cs373 ยป


Gradient Descent

So we saw a lot of confusion in the forums about what was going on with this gradient descent. The equations that were given to you were presented as a black box, and I'd like to open up that black box and see what's going on inside. For those of you who looked at Wikipedia to try to understand what's going on with gradient descent, the first formula you would have encountered was this one. This equation describes gradient descent. Now this is a little intimidating at first. What do all of these parameters mean? b = a - gamma x the gradient of the function of valued at a. Let's figure out what's going on here. Before we get into gradient descent, let's first answer the question, what is a gradient? To help me answer this question, I'm going to get the help of my blindfolded hill climber. Here's our blindfolded hill climber, and he's climbing this rounded hill. He wants to get to the top, and he wants to do it in as few steps as possible. What is his method going to be? Well, as long as the hill is steep, he knows he can afford to take a really big step because he's still not really close to the top. As the hill smoothes out and becomes more level, he's going to want to take smaller and smaller steps because he's going to be afraid of overshooting the top and winding up on the other side of the hill. How can we describe this mathematically? The way we're going to do that is by using the gradient, and when we do that, our climber is going to get to the top of the hill. Let's look at this hill from a top-down view, instead of from the side. I'm going to draw these isocline, so these lines all represent lines of the same altitude. We can see they come together. They're really close together in the beginning and become further and further apart as we go inside, and that's because the hill is getting less and less steep. So at any point, the gradient is a vector. If I took the gradient at, let's say, this point. I would get a vector pointing in the direction of steepest descent, and the length of that vector would depend on exactly how steep the hill is. So at this point, which maybe corresponds to somewhere down here on the hill, the hill is quite steep, so our vector is quite long, and it points directly uphill. If I was over here and taking the gradient, I would still be pointing uphill, but the vector wouldn't be quite as long now because now we're up here, and the slope isn't so great, and with reduced slope comes reduced gradient and a reduced step size for our blindfolded hill climber. Gradient descent is very similar to blindfolded hill climbing. Now instead of climbing up a hill, we're climbing into a valley and trying to find the bottom of it because we want to minimize a function. That minimization is encoded in this - sign so instead of adding the gradient, we're going to subtract it. Let's talk about what each of these variables mean. A is our current position. If we were the hill climber, it would be right here where he starts off. B is going to be his next position so this is how the hill climber decides where to go next. This gradient term tells us the direction of steepest ascent. This - sign flips that around. It says okay, let's actually find the direction of steepest descent. This gamma is just a waiting factor. That's all this formula is doing. It's saying, let's take a step down the hill and use that as our new equation-- our new position. Sorry. Now to elaborate on this, we're going to need to use a little bit of calculus, so if you don't know calculus, don't be intimidated. Try your best to follow along, and if you can't, that's okay too. I think this captures the essence of what's going on with gradient descent. First, we have to figure out what exactly is the function we're trying to minimize. For the purposes of the lecture, the function we're trying to minimize was a function of yi, and it was equal to some waiting alpha x xi - yi squared, + some waiting beta x yi - yi +1 squared, so the next y coordinate, and we also do the same for the previous y coordinate, so yi - 1. We want to minimize this, and we're going to use gradient descent. Well, gradient descent says that, we should just iterate over this process until we get to a sufficiently shallow slope that we're confident are in the bottom. We've really found a minimum. So at each step, we're going to follow this. B, our new location becomes yi prime. That's going to equal our old location, yi, and then - this gradient, and the gradient here is just going to be the derivative with respect to yi. If you don't know calculus again, don't worry about it, and if you do, this is a fairly simple derivative. I'm going to writie it simplified. I'm not going to show you all the intermediate steps. But once we do this out, we find that yi prime = yi + 2 alpha x (xi - yi) + 2B(yi + 1 + yi - 1 - 2yi). This almost, almost looks like the equation that you were given in class. There's a little problem, and that's these 2s. I'm going to do something where I just erase them. You may be saying, hey, you're not allowed to do that, but let's just pretend I'd originally called these parameters alpha over 2 and beta over 2, then everything will work out just fine. That's gradient descent. That's where this occasion comes from in our update step. This would be simultaneous update. Another good way of using gradient descent. It's not how we implement it in lectures, but that's okay. I hope this was helpful. Good luck!