If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

### Course: Multivariable calculus>Unit 3

Lesson 4: Optimizing multivariable functions (articles)

Gradient descent is a general-purpose algorithm that numerically finds minima of multivariable functions.

## So what is it?

Gradient descent is an algorithm that numerically estimates where a function outputs its lowest values. That means it finds local minima, but not by setting $\mathrm{\nabla }f=0$ like we've seen before. Instead of finding minima by manipulating symbols, gradient descent approximates the solution with numbers. Furthermore, all it needs in order to run is a function's numerical output, no formula required.
This distinction is worth emphasizing because it's what makes gradient descent useful. If we had a simple formula like $f\left(x\right)={x}^{2}-4x$, then we could easily solve $\mathrm{\nabla }f=0$ to find that $x=2$ minimizes $f\left(x\right)$. Or we could use gradient descent to get a numerical approximation, something like $x\approx 1.99999967$. Both strategies arrive at the same answer.
But if we don't have a simple formula for our function, then it becomes hard or impossible to solve $\mathrm{\nabla }f=0$. If our function has a hundred or a million variables, then manipulating symbols isn't feasible. That's when an approximate solution is valuable, and gradient descent can give us these estimates no matter how elaborate our function is.

The way gradient descent manages to find the minima of functions is easiest to imagine in three dimensions.
Think of a function $f\left(x,y\right)$ that defines some hilly terrain when graphed as a height map. We learned that the gradient evaluated at any point represents the direction of steepest ascent up this hilly terrain. That might spark an idea for how we could maximize the function: start at a random input, and as many times as we can, take a small step in the direction of the gradient to move uphill. In other words, walk up the hill.
To minimize the function, we can instead follow the negative of the gradient, and thus go in the direction of steepest descent. This is gradient descent. Formally, if we start at a point ${x}_{0}$ and move a positive distance $\alpha$ in the direction of the negative gradient, then our new and improved ${x}_{1}$ will look like this:
${x}_{1}={x}_{0}-\alpha \mathrm{\nabla }f\left({x}_{0}\right)$
More generally, we can write a formula for turning ${x}_{n}$ into ${x}_{n+1}$:
${x}_{n+1}={x}_{n}-\alpha \mathrm{\nabla }f\left({x}_{n}\right)$
Starting from an initial guess ${x}_{0}$, we keep improving little by little until we find a local minimum. This process may take thousands of iterations, so we typically implement gradient descent with a computer.

## Example 1

Consider the function $f\left(x\right)=\frac{{x}^{2}\mathrm{cos}\left(x\right)-x}{10}$.
As we can see from the graph, this function has many local minima. Gradient descent will find different ones depending on our initial guess and our step size.
If we choose ${x}_{0}=6$ and $\alpha =0.2$, for example, gradient descent moves as shown in the graph below. The first point is ${x}_{0}$, and lines connect each point to the next in the sequence. After only $10$ steps, we have converged to the minimum near $x=4$.
If we use the same ${x}_{0}$ but $\alpha =1.5$, it seems as if the step size is too large for gradient descent to converge on the minimum. We'll return to this when we discuss the limitations of the algorithm.
If we start at ${x}_{0}=7$ with $\alpha =0.2$, we descend into a completely different local minimum.

## Example 2

Let's use gradient descent to solve the following problem: how can we best approximate $\mathrm{sin}\left(x\right)$ with a degree $5$ polynomial within the range $-3?
$p\left(x\right)={a}_{0}+{a}_{1}x+\cdots +{a}_{5}{x}^{5}$
In order to use gradient descent, we need to phrase the problem in terms of minimizing a function $f$. Intuitively, our goal is to find the coefficients ${a}_{0},\dots ,{a}_{5}$ that make $p\left(x\right)$ as close to $\mathrm{sin}\left(x\right)$ as possible while $x$ is between $-3$ and $3$. There are many ways we can turn this idea into a function to minimize. One is called least squares:
$f\left({a}_{0},\dots ,{a}_{5}\right)={\int }_{-3}^{3}\left(p\left(x\right)-\mathrm{sin}\left(x\right){\right)}^{2}\phantom{\rule{0.167em}{0ex}}dx$
In short, we define our $f$ as the sum of how incorrect $p\left(x\right)$ is at each point. For example, if $p\left(x\right)=2$ near $x=0$, then $f$ will increase by a lot because $\mathrm{sin}\left(0\right)=0$ not $2$. Gradient descent will try to get $f$ as low as possible, which is the same as making $p\left(x\right)$ closer to $\mathrm{sin}\left(x\right)$. We square the difference so the integral is always positive.
Here's what happens if we start with ${a}_{0},\dots ,{a}_{5}$ as random numbers and then move along the negative gradient. The number in the top left shows how many steps we've taken so far, where we use a step size of $\alpha =0.001$.
We get a pretty good approximation of $\mathrm{sin}\left(x\right)$!
$\begin{array}{rl}p\left(x\right)=& -0.00177+0.88974x+0.00287{x}^{2}\\ \\ & -0.11613{x}^{3}-0.00048{x}^{4}+0.00224{x}^{5}\end{array}$
Technically, the animation above uses something called momentum gradient descent, which is a variant that helps it avoid getting stuck in local minima.

## Limitations

Gradient descent has applications whenever we have a function we want to minimize, which is common in machine learning, for example. But it's important to know its shortcomings so that we can plan around them.
One of its limitations is that it only finds local minima (rather than the global minimum). As soon as the algorithm finds some point that's at a local minimum, it will never escape as long as the step size doesn't exceed the size of the ditch.
In the graph above, each local minimum has its own valley that would trap a gradient descent algorithm. After all, the algorithm only ever tries to go down, so once it finds a point where every direction leads up, it will stop. Looking at the graph from a different perspective, we also see that one local minimum is lower than the other.
When we minimize a function, we want to find the global minimum, but there is no way that gradient descent can distinguish global and local minima.
Another limitation of gradient descent concerns the step size $\alpha$. A good step size moves toward the minimum rapidly, each step making substantial progress.

Good step size converges quickly.
If the step size is too large, however, we may never converge to a local minimum because we overshoot it every time.

Large step size diverges.
If we are lucky and the algorithm converges anyway, it still might take more steps than it needed.

Large step size converges slowly.
If the step size is too small, then we'll be more likely to converge, but we'll take far more steps than were necessary. This is a problem when the function we're minimizing has thousands or millions of variables, and evaluating it is cumbersome.

Tiny step size converges slowly.
A final limitation is that gradient descent only works when our function is differentiable everywhere. Otherwise we might come to a point where the gradient isn't defined, and then we can't use our update formula.

Gradient descent fails for non-differentiable functions.

## Summary

• Gradient descent minimizes differentiable functions that output a number and have any amount of input variables.
• It does this by taking a guess ${x}_{0}$ and successively applying the formula ${x}_{n+1}={x}_{n}-\alpha \mathrm{\nabla }f\left({x}_{n}\right)$. In words, the formula says to take a small step in the direction of the negative gradient.
• Gradient descent can't tell whether a minimum it has found is local or global.
• The step size $\alpha$ controls whether the algorithm converges to a minimum quickly or slowly, or whether it diverges.
• Many real world problems come down to minimizing a function.

## Want to join the conversation?

• In the section of momentum gradient descent:
The equation
v1​=λv0 +α∇f(x0)
x1 = x0 + v1
The first equation, it will be gradient steepness,
So, the correction is:
v1​=λv0 - α∇f(x0)
• You are correct! That sneaky minus sign makes all the difference. Thanks for pointing it out -- the typo should be fixed pretty soon.
• Why all the focus on minimizing functions? Can we also use a sort of "gradient ascent" to maximize (differentiable real-valued) functions? I don't see why not...(of course though, any problem about maximizing a function f can be reframed in terms of minimizing its negation -f)
• Hi, tau is indeed better than pi, though I did celebrate pi day today! You are absolutely right. By considering -f(x), we turn any maximization problem into a minimization problem. Thus, the ideas are all the same really, and we pick either to minimize or maximize mostly arbitrarily. In the context of machine learning, the custom is to define a loss function because it is often easier to count how many mistakes an algorithm makes than to count how well it is doing. Thus we minimize the loss.
• Could the gradient descent be trapped by a saddle point. Assume we are trying to minimize f = x^3, in the interval [-1,1]. let x_0 = 1/2 then the gradient descent will stop at x = 0 which is not even a local minimum.
• its unlikely you would land on such an exact even number such as x=0. You would typically initialise from random values as well
(1 vote)
• Any python code written for approximating the sinx function?
• Hello siddiqr67, here's the original JavaScript code used to create the animation in the article.

It should be fairly straightforward to translate the ideas into Python. Note: if you want the best approximation for sin(x), there are methods that will produce better results than gradient descent. Taylor series or least-squares approximations come to mind. But it's just really cool to see gradient descent in action! Let me know if you have other questions :)
• Could you algebraically show the first few iterations for example 2 so it becomes a little clearer?
Edit: OK, part of my confusion is that I am thinking of x as the variable, when actually its a 6 variable function: f(a0,a1,a2,a3,a4,a5). So I guess the gradient has 6 elements, and so on? ie, for a given x, we are operating in a 6-D input space?
• Hi bkmurthy99, you are absolutely correct. The function f(a0, a1, a2, a3, a4, a5) has six inputs, so the analogy is we're rolling a ball down a hill in six dimensions to find where f is minimized. Like you said, the variable x is not related to how good of an approximation we have -- it's just a dummy variable we use to evaluate any given approximation p. It's a lot to keep in your head. That's one reason gradient descent is so cool. It doesn't care whether we need to optimize one variable, two, six, or a million!
• This might be asking a lot, but you could add another article, or extend this one, to include stochastic gradient descent and mini-batch gradient descent?
• Dear Shanie, I don't know of any plans to add another article on SGD or mini-batch methods. These would go a little beyond the scope of a multivariable calculus course, though it would be amazing of course if Khan Academy ever developed a machine learning course. In the meantime, there are many good resources online to learn about the methods you mentioned.
• Does anyone have a link to a coding exercise for understanding the algorithm practically? It would be really helpful
• does gradient descent work on multi-output functions?
• No it will not on its own; however most applications of the algorithm can be turned into single variable with a couple of equations. In fact the gradient descent algorithm is extremely effective for deep-learning because of this fact.
(1 vote)
• is this correct python code actually this is not working
properly can provide me correct one

this is my converted code
import numpy as np
import random
import math
cofficient=[1,1,1,1,1,1]
def polynomial(x,cofficent):
sum=0
for i in range(len(cofficent)):
sum+=cofficent[i]math.pow(x,i)
return sum

def loss(coff):
dx=0.1
sum=0
for i in np.arange(-3,3,dx):
px=polynomial(i,coff)
ds=math.sin(i)
diff=px-ds
sum+=diff*diff
for i in range(len(coff)):
sum+=coff[i]*coff[i]
return sum
h=1e-6
component=[]
shiftedx=[]
h=0.001
for i in range(len(x)):
shiftedx.append(x[i])
for i in range(len(x)):
shiftedx[i]+=h
component.append(
(f(shiftedx)-f(x))/h
)

shiftedx[i]-=h
return component
upperlimit=1
def capvector(v):
magnitude=0
for i in range(len(v)):
magnitude+=v[i]*v[i]
magnitude=math.sqrt(magnitude)
ratio=upperlimit/magnitude
if ratio<1:
for i in range(len(v)):
v[i]*=ratio
stepsize=0.001
def decent():
for i in range(len(cofficient)):
friction=0.95
intertia=0.2
moment=[]
for i in range(len(cofficient)):
moment.append(0)
def momentum():
moment[i]*=friction

for i in range(len(cofficient)):
cofficient[i]=cofficient[i]-stepsize*moment[i]
for i in range(10):
momentum()
stepsize
=0.999