In the section of momentum gradient descent: The equation v1=λv0 +α∇f(x0) x1 = x0 + v1 The first equation, it will be gradient steepness, So, the correction is: v1=λv0 - α∇f(x0) to be gradient descent

You are correct! That sneaky minus sign makes all the difference. Thanks for pointing it out -- the typo should be fixed pretty soon.

Why all the focus on minimizing functions? Can we also use a sort of "gradient ascent" to maximize (differentiable real-valued) functions? I don't see why not...(of course though, any problem about maximizing a function f can be reframed in terms of minimizing its negation -f)

Hi, tau is indeed better than pi, though I did celebrate pi day today! You are absolutely right. By considering -f(x), we turn any maximization problem into a minimization problem. Thus, the ideas are all the same really, and we pick either to minimize or maximize mostly arbitrarily. In the context of machine learning, the custom is to define a loss function because it is often easier to count how many mistakes an algorithm makes than to count how well it is doing. Thus we minimize the loss.

Any python code written for approximating the sinx function?

Hello siddiqr67, here's the original JavaScript code used to create the animation in the article. https://www.khanacademy.org/computer-programming/gradient-descent-animator-polynomial-approximation/4973855889768448 It should be fairly straightforward to translate the ideas into Python. Note: if you want the best approximation for sin(x), there are methods that will produce better results than gradient descent. Taylor series or least-squares approximations come to mind. But it's just really cool to see gradient descent in action! Let me know if you have other questions :)

This might be asking a lot, but you could add another article, or extend this one, to include stochastic gradient descent and mini-batch gradient descent?

Dear Shanie, I don't know of any plans to add another article on SGD or mini-batch methods. These would go a little beyond the scope of a multivariable calculus course, though it would be amazing of course if Khan Academy ever developed a machine learning course. In the meantime, there are many good resources online to learn about the methods you mentioned.

Could you algebraically show the first few iterations for example 2 so it becomes a little clearer? Edit: OK, part of my confusion is that I am thinking of x as the variable, when actually its a 6 variable function: f(a0,a1,a2,a3,a4,a5). So I guess the gradient has 6 elements, and so on? ie, for a given x, we are operating in a 6-D input space?

Hi bkmurthy99, you are absolutely correct. The function f(a0, a1, a2, a3, a4, a5) has six inputs, so the analogy is we're rolling a ball down a hill in six dimensions to find where f is minimized. Like you said, the variable x is not related to how good of an approximation we have -- it's just a dummy variable we use to evaluate any given approximation p. It's a lot to keep in your head. That's one reason gradient descent is so cool. It doesn't care whether we need to optimize one variable, two, six, or a million!

Main content

Course: Multivariable calculus > Unit 3

Lesson 4: Optimizing multivariable functions (articles)

Gradient descent

Google Classroom

Gradient descent is a general-purpose algorithm that numerically finds minima of multivariable functions.

Background

So what is it?

Gradient descent is an algorithm that numerically estimates where a function outputs its lowest values. That means it finds local minima, but not by setting

\nabla f = 0

like we've seen before. Instead of finding minima by manipulating symbols, gradient descent approximates the solution with numbers. Furthermore, all it needs in order to run is a function's numerical output, no formula required.

This distinction is worth emphasizing because it's what makes gradient descent useful. If we had a simple formula like

f (x) = x^{2} - 4 x

, then we could easily solve

\nabla f = 0

to find that

x = 2

minimizes

f (x)

. Or we could use gradient descent to get a numerical approximation, something like

x \approx 1.99999967

. Both strategies arrive at the same answer.

But if we don't have a simple formula for our function, then it becomes hard or impossible to solve

\nabla f = 0

. If our function has a hundred or a million variables, then manipulating symbols isn't feasible. That's when an approximate solution is valuable, and gradient descent can give us these estimates no matter how elaborate our function is.

How gradient descent works

The way gradient descent manages to find the minima of functions is easiest to imagine in three dimensions.

Think of a function

f (x, y)

that defines some hilly terrain when graphed as a height map. We learned that the gradient evaluated at any point represents the direction of steepest ascent up this hilly terrain. That might spark an idea for how we could maximize the function: start at a random input, and as many times as we can, take a small step in the direction of the gradient to move uphill. In other words, walk up the hill.

To minimize the function, we can instead follow the negative of the gradient, and thus go in the direction of steepest descent. This is gradient descent. Formally, if we start at a point

x_{0}

and move a positive distance

α

in the direction of the negative gradient, then our new and improved

x_{1}

will look like this:

x_{1} = x_{0} - α \nabla f (x_{0})

More generally, we can write a formula for turning

x_{n}

into

x_{n + 1}

x_{n + 1} = x_{n} - α \nabla f (x_{n})

Starting from an initial guess

x_{0}

, we keep improving little by little until we find a local minimum. This process may take thousands of iterations, so we typically implement gradient descent with a computer.

Example 1

Consider the function

f (x) = \frac{x^{2} \cos (x) - x}{10}

As we can see from the graph, this function has many local minima. Gradient descent will find different ones depending on our initial guess and our step size.

If we choose

x_{0} = 6

and

α = 0.2

, for example, gradient descent moves as shown in the graph below. The first point is

x_{0}

, and lines connect each point to the next in the sequence. After only

10

steps, we have converged to the minimum near

x = 4

If we use the same

x_{0}

but

α = 1.5

, it seems as if the step size is too large for gradient descent to converge on the minimum. We'll return to this when we discuss the limitations of the algorithm.

If we start at

x_{0} = 7

with

α = 0.2

, we descend into a completely different local minimum.

Example 2

Let's use gradient descent to solve the following problem: how can we best approximate $\sin (x)$ ‍ with a degree $5$ ‍ polynomial within the range $- 3 < x < 3$ ‍?

p (x) = a_{0} + a_{1} x + \dots + a_{5} x^{5}

In order to use gradient descent, we need to phrase the problem in terms of minimizing a function

f

. Intuitively, our goal is to find the coefficients

a_{0}, \dots, a_{5}

that make

p (x)

as close to

\sin (x)

as possible while

x

is between

- 3

and

3

. There are many ways we can turn this idea into a function to minimize. One is called least squares:

f (a_{0}, \dots, a_{5}) = \int_{- 3}^{3} (p (x) - \sin (x))^{2} d x

In short, we define our

f

as the sum of how incorrect

p (x)

is at each point. For example, if

p (x) = 2

near

x = 0

, then

f

will increase by a lot because

\sin (0) = 0

not

2

. Gradient descent will try to get

f

as low as possible, which is the same as making

p (x)

closer to

\sin (x)

. We square the difference so the integral is always positive.

Here's what happens if we start with

a_{0}, \dots, a_{5}

as random numbers and then move along the negative gradient. The number in the top left shows how many steps we've taken so far, where we use a step size of

α = 0.001

We get a pretty good approximation of

\sin (x)

\begin{aligned} p (x) = & - 0.00177 + 0.88974 x + 0.00287 x^{2} \\ - 0.11613 x^{3} - 0.00048 x^{4} + 0.00224 x^{5} \end{aligned}

Technically, the animation above uses something called momentum gradient descent, which is a variant that helps it avoid getting stuck in local minima.

The idea is to imagine we are a ball rolling down a hill. As we move, we go faster and faster. If we encounter a small enough local minimum, we will go straight over it because we have the momentum to carry us through it.

Without momentum, we get stuck in a local minimum:

With momentum, we find the global minimum:

Formally, we start with a position

x_{0}

and a velocity

v_{0} = 0

. Every step, we add a fraction

λ

(usually around

0.9

) of the old velocity to the new gradient. That way we keep some of our old momentum each step.

\begin{aligned} v_{n + 1} & = λ v_{n} - α \nabla f (x_{n}) \\ x_{n + 1} & = x_{n} + v_{n + 1} \end{aligned}

Limitations

Gradient descent has applications whenever we have a function we want to minimize, which is common in machine learning, for example. But it's important to know its shortcomings so that we can plan around them.

One of its limitations is that it only finds local minima (rather than the global minimum). As soon as the algorithm finds some point that's at a local minimum, it will never escape as long as the step size doesn't exceed the size of the ditch.

In the graph above, each local minimum has its own valley that would trap a gradient descent algorithm. After all, the algorithm only ever tries to go down, so once it finds a point where every direction leads up, it will stop. Looking at the graph from a different perspective, we also see that one local minimum is lower than the other.

When we minimize a function, we want to find the global minimum, but there is no way that gradient descent can distinguish global and local minima.

Another limitation of gradient descent concerns the step size

α

. A good step size moves toward the minimum rapidly, each step making substantial progress.

Good step size converges quickly.

If the step size is too large, however, we may never converge to a local minimum because we overshoot it every time.

Large step size diverges.

If we are lucky and the algorithm converges anyway, it still might take more steps than it needed.

Large step size converges slowly.

If the step size is too small, then we'll be more likely to converge, but we'll take far more steps than were necessary. This is a problem when the function we're minimizing has thousands or millions of variables, and evaluating it is cumbersome.

Tiny step size converges slowly.

A final limitation is that gradient descent only works when our function is differentiable everywhere. Otherwise we might come to a point where the gradient isn't defined, and then we can't use our update formula.

Gradient descent fails for non-differentiable functions.

Summary

Gradient descent minimizes differentiable functions that output a number and have any amount of input variables.
It does this by taking a guess $x_{0}$ ‍ and successively applying the formula $x_{n + 1} = x_{n} - α \nabla f (x_{n})$ ‍. In words, the formula says to take a small step in the direction of the negative gradient.
Gradient descent can't tell whether a minimum it has found is local or global.
The step size $α$ ‍ controls whether the algorithm converges to a minimum quickly or slowly, or whether it diverges.
Many real world problems come down to minimizing a function.

Want to join the conversation?

Sort by:

Muhammed El-Yamani
Posted 4 years ago. Direct link to Muhammed El-Yamani's post “In the section of momentu...”
In the section of momentum gradient descent:
The equation
v1=λv0 +α∇f(x0)
x1 = x0 + v1
The first equation, it will be gradient steepness,
So, the correction is:
v1=λv0 - α∇f(x0)
to be gradient descent
Button navigates to signup pageButton navigates to signup page
(5 votes)
Answer
- lakern
  Posted 4 years ago. Direct link to lakern's post “You are correct! That sne...”
  You are correct! That sneaky minus sign makes all the difference. Thanks for pointing it out -- the typo should be fixed pretty soon.
  Button navigates to signup page
  (4 votes)
𝜏 Is Better Than 𝝅
Posted 3 years ago. Direct link to 𝜏 Is Better Than 𝝅's post “Why all the focus on mini...”
Why all the focus on minimizing functions? Can we also use a sort of "gradient ascent" to maximize (differentiable real-valued) functions? I don't see why not...(of course though, any problem about maximizing a function f can be reframed in terms of minimizing its negation -f)
Button navigates to signup pageButton navigates to signup page
(3 votes)
Answer
- lakern
  Posted 2 years ago. Direct link to lakern's post “Hi, tau is indeed better ...”
  Hi, tau is indeed better than pi, though I did celebrate pi day today! You are absolutely right. By considering -f(x), we turn any maximization problem into a minimization problem. Thus, the ideas are all the same really, and we pick either to minimize or maximize mostly arbitrarily. In the context of machine learning, the custom is to define a loss function because it is often easier to count how many mistakes an algorithm makes than to count how well it is doing. Thus we minimize the loss.
  Button navigates to signup page
  (4 votes)
Amy
Posted a year ago. Direct link to Amy's post “Could the gradient descen...”
Could the gradient descent be trapped by a saddle point. Assume we are trying to minimize f = x^3, in the interval [-1,1]. let x_0 = 1/2 then the gradient descent will stop at x = 0 which is not even a local minimum.
Button navigates to signup pageComment on Amy's post “Could the gradient descen...”
(4 votes)
Answer
- alexbriyou
  Posted 10 months ago. Direct link to alexbriyou's post “its unlikely you would la...”
  its unlikely you would land on such an exact even number such as x=0. You would typically initialise from random values as well
  Button navigates to signup page
  (1 vote)
siddiqr67
Posted 3 years ago. Direct link to siddiqr67's post “Any python code written f...”
Any python code written for approximating the sinx function?
Button navigates to signup pageComment on siddiqr67's post “Any python code written f...”
(2 votes)
Answer
- lakern
  Posted 3 years ago. Direct link to lakern's post “Hello siddiqr67, here's t...”
  Hello siddiqr67, here's the original JavaScript code used to create the animation in the article.
  
  https://www.khanacademy.org/computer-programming/gradient-descent-animator-polynomial-approximation/4973855889768448
  
  It should be fairly straightforward to translate the ideas into Python. Note: if you want the best approximation for sin(x), there are methods that will produce better results than gradient descent. Taylor series or least-squares approximations come to mind. But it's just really cool to see gradient descent in action! Let me know if you have other questions :)
  Comment on lakern's post “Hello siddiqr67, here's t...”
  (4 votes)
Jonathan MFRC
Posted 3 years ago. Direct link to Jonathan MFRC's post “does gradient descent wor...”
does gradient descent work on multi-output functions?
Button navigates to signup pageComment on Jonathan MFRC's post “does gradient descent wor...”
(2 votes)
Answer
- Matthew Miller
  Posted 3 years ago. Direct link to Matthew Miller's post “No it will not on its own...”
  No it will not on its own; however most applications of the algorithm can be turned into single variable with a couple of equations. In fact the gradient descent algorithm is extremely effective for deep-learning because of this fact.
  Button navigates to signup page
  (1 vote)
ShanieGlean
Posted 2 years ago. Direct link to ShanieGlean's post “This might be asking a lo...”
This might be asking a lot, but you could add another article, or extend this one, to include stochastic gradient descent and mini-batch gradient descent?
Button navigates to signup pageButton navigates to signup page
(1 vote)
Answer
- lakern
  Posted 2 years ago. Direct link to lakern's post “Dear Shanie, I don't know...”
  Dear Shanie, I don't know of any plans to add another article on SGD or mini-batch methods. These would go a little beyond the scope of a multivariable calculus course, though it would be amazing of course if Khan Academy ever developed a machine learning course. In the meantime, there are many good resources online to learn about the methods you mentioned.
  Button navigates to signup page
  (3 votes)
Ronald
Posted a year ago. Direct link to Ronald's post “Does anyone have a link t...”
Does anyone have a link to a coding exercise for understanding the algorithm practically? It would be really helpful
Button navigates to signup pageButton navigates to signup page
(2 votes)
Answer
bkmurthy99
Posted 4 years ago. Direct link to bkmurthy99's post “Could you algebraically s...”
Could you algebraically show the first few iterations for example 2 so it becomes a little clearer?
Edit: OK, part of my confusion is that I am thinking of x as the variable, when actually its a 6 variable function: f(a0,a1,a2,a3,a4,a5). So I guess the gradient has 6 elements, and so on? ie, for a given x, we are operating in a 6-D input space?
Button navigates to signup pageButton navigates to signup page
(1 vote)
Answer
- lakern
  Posted 4 years ago. Direct link to lakern's post “Hi bkmurthy99, you are ab...”
  Hi bkmurthy99, you are absolutely correct. The function f(a0, a1, a2, a3, a4, a5) has six inputs, so the analogy is we're rolling a ball down a hill in six dimensions to find where f is minimized. Like you said, the variable x is not related to how good of an approximation we have -- it's just a dummy variable we use to evaluate any given approximation p. It's a lot to keep in your head. That's one reason gradient descent is so cool. It doesn't care whether we need to optimize one variable, two, six, or a million!
  Button navigates to signup page
  (3 votes)
nick.soberon
Posted 9 months ago. Direct link to nick.soberon's post “It's not clear to me that...”
It's not clear to me that the negative of the gradient is the direction of steepest descent from the fact that stepping in the direction of the gradient is the steepest ascent... I can imagine being at a point where orthogonally to the gradient is a steep drop off.

I am actually sort of picturing a saddle point in my mind... does anyone have any thoughts?
Button navigates to signup pageButton navigates to signup page
(1 vote)
Answer
- Charles Morelli
  Posted 4 months ago. Direct link to Charles Morelli's post “Picture a tangent plane a...”
  Picture a tangent plane at a point on your function's 'surface'...
  On that plane (in general), some directions are steeper than others, 2 directions are level (these are opposite to each other); one direction will be the steepest ascent and one (the opposite) direction will be the steepest descent (these are perpendicular to the level directions)...
  The gradient is the direction of steepest ascent on that plane (it is perpendicular (orthogonal, if you prefer) to 'contour' lines of the function's surface...
  Given that the gradient gives the vector (direction) of steepest ascent on the tangent plane at the point of interest, it seems logical (to me, at least) that the negative gradient vector must give the direction of steepest descent.
  If there is a sudden change of 'steepness' (i can only picture this as a 'corner' or 'edge' on the surface) at the point in question, the function is not differentiable at that point, so its gradient won't be defined.
  It might be that you're visualising the gradient vector as a small (finite) step in the direction of steepest ascent - i can see how that might lead to the idea you described... however, the gradient is an infinitessimal step... if this is the issue, try to visualise what happens as the step you take gets ever smaller, approaching zero-length... then i think you're more likely to see that the opposite direction (i.e., the negative) to the gradient vector (remembering that the gradient is only valid at the exact point it's calculated for) must be the direction of steepest descent for an infinitessimal step.
  If you're point is the special (i.e., non-general) case of a saddle point, then as Quinn says: the gradient is zero (and all directions on the tangent plane at that point are level.
  Button navigates to signup page
  (1 vote)
Dzmitry Dauhalevich
Posted 2 years ago. Direct link to Dzmitry Dauhalevich's post “Hello. Can you explain br...”
Hello. Can you explain briefly why we are using integral in the least-squares method?
Button navigates to signup pageButton navigates to signup page
(1 vote)
Answer
- Charles Morelli
  Posted 4 months ago. Direct link to Charles Morelli's post “You're 'adding' all the s...”
  You're 'adding' all the squared errors at every point over the interval... we're trying to minimise this 'sum' (or 'definite integral', if that's clearer).
  Button navigates to signup page
  (1 vote)