The "Lagrange multipliers" technique is a way to solve constrained optimization problems.  Super useful!

What we're building to:

  • The Lagrange multiplier technique lets you find the maximum or minimum of a multivariable function f(x,y,)\blueE{f(x, y, \dots)} when there is some constraint on the input values you are allowed to use.
  • This technique only applies to constraints that look something like this:
    g(x,y,)=c \redE{g(x, y, \dots) = c}
    Here, g\redE{g} is another multivariable function with the same input space as f\blueE{f}, and c\redE{c} is some constant.
  • The core idea is to look for points where the contour lines of f\blueE{f} and g\redE{g} are tangent to each other.
  • This is the same as finding points where the gradient vectors of f\blueE{f} and g\redE{g} are parallel to each other.
  • The entire process can be boiled down into setting the gradient of a certain function, called the Lagrangian, equal to the zero vector.

Motivating example

Suppose you want to maximize this function:
f(x,y)=2x+y \blueE{f(x, y) = 2x + y}
But let's also say you limited yourself to inputs (x,y)(x, y) which satisfy the following equation:
x2+y2=1 \redE{x^2 + y^2 = 1}
In other words, for which point (x,y)(x, y) on the unit circle\redE{\text{unit circle}} is the value 2x+y\blueE{2x + y} biggest?
This is what's known as a constrained optimization problem. The restriction to points where x2+y2=1\redE{x^2 + y^2 = 1} is called a "constraint", and f(x,y)=2x+y\blueE{f(x, y) = 2x + y} is the function that needs to be optimized.
Here's one way to visualize this: First draw the graph of f(x,y)\blueE{f(x, y)}, which looks like a slanted plane since f\blueE{f} is linear. Next, project the circle x2+y2=1\redE{x^2 + y^2 = 1} from the xyxy-plane vertically onto the graph of f\blueE{f}. The maximum we are seeking corresponds with the highest point of this projected circle on the graph.

More general form

In general, constrained optimization problems involve maximizing/minimizing a multivariable function whose input has any number of dimensions:
f(x,y,z,)\blueE{f(x, y, z, \dots)}
Its output will always be one-dimensional, though, since there's not a clear notion of "maximum" with vector-valued outputs.
The type of constraints that the Lagrange multiplier technique applies to must take the form of some other multivariable function g(x,y,z,)\redE{g(x, y, z, \dots)} being set equal to a constant c\redE{c}.
g(x,y,z,)=c\redE{g(x, y, z, \dots) = c}
Since this is meant to be a constraint on the input of f\blueE{f}, the number of dimensions in the input of g\redE{g} is the same as that of f\blueE{f}. For example, the example outlined above fits this general form as follows:
f(x,y)=2x+y\blueE{f(x, y) = 2x+y}
g(x,y)=x2+y2\redE{g(x, y) = x^2 + y^2}
c=1\redE{c = 1}

Using contour maps

Reasoning about this problem becomes easier if we visualize f\blueE{f} not with a graph, but with its contour lines.
As a reminder, a contour line of f(x,y)\blueE{f(x, y)} is the set of all points where f(x,y)=k\blueE{f(x, y) = k} for some constant kk. The following interactive tool shows how this line (drawn in blue) changes as the constant kk changes. The circle g(x,y)=1\redE{g(x, y) = 1} is also shown (in red). Try to make kk as big/small as possible while still allowing contour line of f\blueE{f} to intersect the circle.
Concept check: What does it mean if for a particular value of kk, the blue line representing f(x,y)=k\blueE{f(x, y) = k} does not intersect the red circle representing g(x,y)=1\redE{g(x, y) = 1}?
Notice, the circle where g(x,y)=1\redE{g(x, y) = 1} can be thought of as a particular contour line of the function g\redE{g}. So with that, here's the clever way to think about constrained optimization problems:
Key observation: The maximum and minimum values of f\blueE{f}, subject to the constraint g(x,y)=1\redE{g(x, y) = 1}, correspond with contour lines of f\blueE{f} that are tangent to the contour representing g(x,y)=1\redE{g(x, y) = 1}.
If f\blueE{f} were a different function, its contours might not always be straight lines. This is unique to our example since f\blueE{f} is linear. For example, take a look at this function:
f(x,y)=2x2+5y\blueE{f(x, y) = 2x^2 + \sqrt{5y}},
Its contour lines look like this:
That said, the key observation still holds, and is worth repeating: When kk is a maximum or minimum of ff subject to the constraint, the contour line for f(x,y)=k\blueE{f(x, y) = k} will be tangent to contour representing g(x,y)=1\redE{g(x, y) = 1}.

Where the gradient comes into play

How do you put the idea of two contour lines being tangent into a formula you can solve?
To answer this, we turn to our loyal friend the gradient. There are many ways to interpret f\nabla f: The direction of steepest ascent, a tool for computing directional derivatives, etc. But for our purposes here, the property we care about is that the gradient of ff evaluated at a point (x0,y0)(x_0, y_0) always gives a vector perpendicular to the contour line passing through that point.
This means when the contour lines of two functions f\blueE{f} and g\redE{g} are tangent, their gradient vectors are parallel. Here's what that might look like for arbitrary functions f\blueE{f} and g\redE{g}:
The fact that contour lines are tangent tells us nothing about the magnitude of each of these gradient vectors, but that's okay. When two vectors point in the same direction, it means we can multiply one by some constant to get the other. Specifically, let (x0,y0)(x_0, y_0) represent a particular point where the contour lines of f\blueE{f} and g\redE{g} are tangent (writing x0x_0 and y0y_0 with a 00 subscripts just indicates that we are considering constant values, and hence a specific point). Since this tangency means their gradient vectors align, here's what you might write down:
f(x0,y0)=λ0g(x0,y0)\begin{aligned} \nabla \blueE{f(x_0, y_0)} = \greenE{\lambda}_0 \nabla \redE{g(x_0, y_0)} \end{aligned}
Here, λ0\greenE{\lambda}_0 represents some constant. Some authors use a negative constant, λ0-\greenE{\lambda}_0, but I personally prefer a positive constant, as it gives a cleaner interpretation of λ0\greenE{\lambda_0} down the road.
Let's see what this looks like in our example where f(x,y)=2x+y\blueE{f(x, y) = 2x + y} and g(x,y)=x2+y2\redE{g(x, y) = x^2 + y^2}. The gradient of ff is
and the gradient of gg is
Therefore, the tangency condition ends up looking like this:

Solving the problem in the specific case

To sum up where we are so far, we are looking for input points (x0,y0)(x_0, y_0) with the following properties:
  • g(x0,y0)=1g(x_0, y_0) = 1, which for our example means
    x02+y02=1\quad \redE{x_0^2 + y_0^2 = 1}
  • f(x0,y0)=λ0g(x0,y0)\nabla f(x_0, y_0) = \greenE{\lambda_0} \nabla g(x_0, y_0) for some constant λ0\greenE{\lambda_0}, which for our example means
    2=2λ0x01=2λ0y0\begin{aligned} \quad {2} &{= 2\greenE{\lambda_0} x_0} \\ {1} &{= 2\greenE{\lambda_0} y_0} \end{aligned}
There are 33 equations and 33 unknowns, so this is a perfectly solvable situation.

The Lagrangian function

In the 1700's, our buddy Joseph Louis Lagrange studied constrained optimization problems of this kind, and he found a clever way to express all of our conditions into a single equation.
You can write these conditions generally by saying we are looking for constants x0x_0, y0y_0 and λ0\lambda_0 that satisfy the following conditions:
  • The constraint:
    g(x0,y0)=c\redE{g(x_0, y_0) = c}
  • The tangency condition:
    f(x0,y0)=λ0g(x0,y0)\nabla f(x_0, y_0) = \lambda_0 \nabla g(x_0, y_0).
    This can be broken into its components as follows:
  • fx(x0,y0)=λ0gx(x0,y0){f_x(x_0, y_0) = \lambda_0 g_x(x_0, y_0)}
  • fy(x0,y0)=λ0gy(x0,y0){f_y(x_0, y_0) = \lambda_0 g_y(x_0, y_0)}
Lagrange wrote down a special new function which takes in all the same input variables as ff and gg, along with the new kid in town λ\lambda, thought of now as a variable rather than a constant.
L(x,y,λ)=f(x,y)λ(g(x,y)c) \mathcal{L}(x, y, \lambda) = \blueE{f(x, y)} - \lambda (\redE{g(x, y)-c})
For example, consider our example above.
f(x,y)=2x+yg(x,y)=x2+y2c=1\begin{aligned} \quad \blueE{f(x, y)} &= \blueE{2x + y }\\ \redE{g(x, y)} &= \redE{x^2 + y^2} \\ \redE{c} &= \redE{1} \\ \end{aligned}
Here's how this new function would look:
L(x,y,λ)=2x+yλ(x2+y21). \mathcal{L}(x, y, \lambda) = \blueE{2x + y} - \lambda(\redE{x^2 + y^2 - 1}).
Notice, the partial derivative of L\mathcal{L} with respect to λ\lambda is (g(x,y)c)-(g(x, y)-c):
Lλ(x,y,λ)=λ(f(x,y)λ(g(x,y)c)=0(g(x,y)c)\begin{aligned} \quad \mathcal{L}_\lambda(x, y, \lambda) &= \dfrac{\partial}{\partial \lambda}\left(f(x, y) - \lambda (g(x, y)-c \right) \\ &= 0 - (g(x, y)-c) \end{aligned}
So we can translate the condition g(x,y)=cg(x, y) = c as
Lλ(x,y,λ)=g(x,y)+c=0\begin{aligned} \quad \redE{ \mathcal{L}_\lambda(x, y, \lambda) = -g(x, y) + c = 0 } \end{aligned}
What's more, look at what we get when we set one of the other partial derivatives equal to 00:
Lx(x,y,λ)=0x(f(x,y)λ(g(x,y)c))=0fx(x,y)λgx(x,y)=0fx(x,y)=λgx(x,y)\begin{aligned} \quad \mathcal{L}_x(x, y, \lambda) &= 0 \\ \\ \dfrac{\partial}{\partial x}(f(x, y) - \lambda (g(x, y)-c)) &= 0 \\ \\ f_x(x, y) - \lambda g_x(x, y) &= 0 \\ \\ {f_x(x, y)} &{= \lambda g_x(x, y)} \end{aligned}
That just so happens to be another one of our conditions! Almost identically, the condition Ly(x,y,λ)=0\mathcal{L}_y(x, y, \lambda) = 0 unravels to become
fy(x,y)=λgy(x,y)\begin{aligned} \quad {f_y(x, y) = \lambda g_y(x, y)} \end{aligned}
Together, these conditions are the same as saying.
f(x,y)=λg(x,y)\begin{aligned} \quad \nabla f(x, y) = \lambda \nabla g(x, y) \end{aligned}
Therefore, the three conditions we need to solve to find x,yx, y and λ\lambda come down to the various partial derivatives of L\mathcal{L} being equal to 00. This can be written extremely compactly by setting the gradient of L\mathcal{L} equal to the zero vector:
L=0\begin{aligned} \quad \nabla \mathcal{L} = \textbf{0} \end{aligned}
For example, using our specific functions from above, we see how this encodes the system of equations we need to solve:
L=[x(2x+yλ(x2+y21))y(2x+yλ(x2+y21))λ(2x+yλ(x2+y21))]=[22λx12λyx2y2+1]=[000]\begin{aligned} \quad \nabla \mathcal{L} = \left[ \begin{array}{c} \dfrac{\partial}{\partial x}(2x + y - \lambda(x^2 + y^2 - 1)) \\ \\ \dfrac{\partial}{\partial y}(2x + y - \lambda(x^2 + y^2 - 1)) \\ \\ \dfrac{\partial}{\partial \lambda}(2x + y - \lambda(x^2 + y^2 - 1)) \\ \\ \end{array} \right] = \left[ \begin{array}{c} 2 - 2\lambda x \\ 1 - 2\lambda y \\ -x^2 - y^2 + 1 \\ \end{array} \right] = \left[ \begin{array}{c} 0 \\ 0 \\ 0 \\ \end{array} \right] \end{aligned}
As a tribute to ol' Joey Lou, we call this function L\mathcal{L} the "Lagrangian", and the new variable λ\lambda that we introduce is called a "Lagrange multiplier". Imagine if someone added "-ian" the end of your last name and made it the name of a function everybody uses. Pretty sweet!
Warning: Some authors use a convention where the sign of λ\lambda is reversed:
L(x,y,λ)=f(x,y)+λ(g(x,y)c)\begin{aligned} \quad \mathcal{L}(x, y, \lambda) = f(x, y) \redE{+} \lambda (g(x, y)-c) \end{aligned}
This doesn't make any difference when it comes to solving the problem, but you should keep it in mind in case the course you are taking or the text you are reading follows this convention.


When you want to maximize (or minimize) a multivariable function f(x,y,) \blueE{f(x, y, \dots)} subject to the constraint that another multivariable function equals a constant, g(x,y,)=c\redE{g(x, y, \dots) = c}, follow these steps:
  • Step 1: Introduce a new variable λ\greenE{\lambda}, and define a new function L\mathcal{L} as follows:
    L(x,y,,λ)=f(x,y,)λ(g(x,y,)c) \mathcal{L}(x, y, \dots, \greenE{\lambda}) = \blueE{f(x, y, \dots)} - \greenE{\lambda} (\redE{g(x, y, \dots)-c})
    This function L\mathcal{L} is called the "Lagrangian", and the new variable λ\greenE{\lambda} is referred to as a "Lagrange multiplier"
  • Step 2: Set the gradient of L\mathcal{L} equal to the zero vector.
    L(x,y,,λ)=0Zero vector \nabla \mathcal{L}(x, y, \dots, \greenE{\lambda}) = \textbf{0} \quad \leftarrow \small{\gray{\text{Zero vector}}}
    In other words, find the critical points of L\mathcal{L}.
  • Step 3: Consider each solution, which will look something like (x0,y0,,λ0)(x_0, y_0, \dots, \greenE{\lambda}_0). Plug each one into ff. Or rather, first remove the λ0\greenE{\lambda}_0 component, then plug it into ff, since ff does not have λ\greenE{\lambda} as an input. Whichever one gives the greatest (or smallest) value is the maximum (or minimum) point you are seeking.