When he wrote the formula for the quadratic form part of the qudratic approximation, he wrote the hessian with f_xy in both the top right and bottom left corners. By definition, the hessian has f_yx in the top right corner. I know that f_xy and f_yx are usually the same value, but shouldn't the definition for quadratic approximation contain the true hessian in case f_xy and f_yx come out to be different?

Correct, but, probably, he didn't bother changing it because (as he mentions in previous videos) most of the time f_xy and f_yx end up to be the same. If we want to be absolutely correct we do need to differentiate those two though and write it down in the way you suggest.

Hi, I was wondering what the purpose of approximation is.

Quadratic approx. is useful for machine learning (ML). For example, if you want to measure how bad your program is at recognizing handwritten digits, you can do represent the errors with a cost function. The cost function depends on a lot of parameters (so it has a lot of dimensions), which is where representing things with vectors comes in handy. We want to minimize this cost function (because low cost = less bad ML program) and we can find minima of the cost function more easily when we approximate the cost function with a quadratic approximation.

Why does the Hessian(-like) matrix have a 1/2 for each term? Why does the whole formula need an 1/2? What would happen without it? I don't get where those 1/2s came from.

The origin of this 1/2 is the same as of Taylor series. If you remember, the Taylor series has the following pattern: f(x) = 1/1! + 1/2! + 1/3! + 1/4! + 1/5! + ... You can interpret these constants as a way to calm down the exponent. When we add more terms with higher exponent the thing becomes pretty huge and we reduce it to normal values by dividing it with a factorial. The factorial itself comes out because we are using derivatives to approximate our function. For example, let's take derivatives of g(x) = x^4: g'(x) = 4 * x^3 g''(x) = 4 * 3 * x ^ 2 g'''(x) = 4 * 3 * 2 * x g''''(x) = 4 * 3 * 2 * 1 = 4!

What is the general notation for hessian of n-th degree approximation?

I believe it would be represented with an *H*, like all Hessians can be, so you would still have (1/2)*x*^T times *H* times *x*. A three-by-three (or n by n) Hessian would only work with three (n) variables, though, because you need the Hessian to be square and the number of input variables determines the number of rows. Hope this helps!

Hey, How we can know from the vector form that X-X0 after Hessian matrix is a column vector as it?

X and X0 are both column vectors, since they were defined as such. By the rules of matrix addition and subtraction, their dimensions must be identical, and the result matrix will also have the same dimensions. Therefore, the result will also be a column vector.

Correct me if I am wrong, but at 2:47 while formulating the linear equation in matrix form shouldn't grad(f) be multiplied with the transpose of [X-X0] in order for matrix multiplication to happen??

This is not matrix multiplication, but rather a dot product between vectors. For function with x and y in its input, f(x,y), grad(f) would be a vector containing f_x in the first row and f_y in the second row. Also, when you do this, you're taking the gradient at X_0, the vector containing x_0 and y_0 in its two rows (you can also think of this as the coordinates(x_0,y_0)). This means you'll have a constant in each row. X - X_0 is just simple vector subtraction. X has one column and two rows; in the first row, there is x and in second there is y. X_0 is the same thing, except you tag the subscript naught on each variable. When you do the vector subtraction, you get a vector with its first row as x - x_0 and the second row as y - y_0 (notice that these are no longer the bold-faced vector X, they are variables). When you perform the dot product (if you need a refresher on dot products, khan academy has lessons on them in the Linear Algebra section) between grad(f) and the vector formed from the vector subtraction we just did, you get f_x*(x - x_0) + f_y*(y - y_0). Remember that each of the partials was evaluated at (x_0,y_0) so you'll end up with (using A and B as constants) A(x - x_0) + B(y - y_0).

Main content

Course: Multivariable calculus > Unit 3

Lesson 2: Quadratic approximations

Vector form of multivariable quadratic approximation

Name: Vector form of multivariable quadratic approximation
Uploaded: 2016-06-16T19:33:03Z
Description: This is the more general form of a quadratic approximation for a scalar-valued multivariable function. It is analogous to a quadratic Taylor polynomial in the single-variable world.

Google Classroom

This is the more general form of a quadratic approximation for a scalar-valued multivariable function. It is analogous to a quadratic Taylor polynomial in the single-variable world. Created by Grant Sanderson.

Want to join the conversation?

Sort by:

mathisduguin
Posted 4 years ago. Direct link to mathisduguin's post “Forgot to transpose at th...”
Forgot to transpose at the end
Button navigates to signup pageButton navigates to signup page
(32 votes)
Answer
- Parker Blacklock
  Posted 2 months ago. Direct link to Parker Blacklock's post “That was already correcte...”
  That was already corrected for with a note, but said note is only visible when not in fullscreen.
  Button navigates to signup page
  (1 vote)
Mike Nguyen
Posted 5 years ago. Direct link to Mike Nguyen's post “Hi, I was wondering what ...”
Hi, I was wondering what the purpose of approximation is.
Button navigates to signup pageButton navigates to signup page
(4 votes)
Answer
- Alex K
  Posted 4 years ago. Direct link to Alex K's post “Quadratic approx. is usef...”
  Quadratic approx. is useful for machine learning (ML). For example, if you want to measure how bad your program is at recognizing handwritten digits, you can do represent the errors with a cost function. The cost function depends on a lot of parameters (so it has a lot of dimensions), which is where representing things with vectors comes in handy. We want to minimize this cost function (because low cost = less bad ML program) and we can find minima of the cost function more easily when we approximate the cost function with a quadratic approximation.
  Button navigates to signup page
  (14 votes)
Daniel Oropeza
Posted 8 years ago. Direct link to Daniel Oropeza's post “Why does the Hessian(-lik...”
Why does the Hessian(-like) matrix have a 1/2 for each term? Why does the whole formula need an 1/2? What would happen without it?

I don't get where those 1/2s came from.
Button navigates to signup pageButton navigates to signup page
(3 votes)
Answer
- Egor Okhterov
  Posted 7 years ago. Direct link to Egor Okhterov's post “The origin of this 1/2 is...”
  The origin of this 1/2 is the same as of Taylor series. If you remember, the Taylor series has the following pattern: f(x) = 1/1! + 1/2! + 1/3! + 1/4! + 1/5! + ...
  You can interpret these constants as a way to calm down the exponent. When we add more terms with higher exponent the thing becomes pretty huge and we reduce it to normal values by dividing it with a factorial.
  
  The factorial itself comes out because we are using derivatives to approximate our function. For example, let's take derivatives of g(x) = x^4:
  g'(x) = 4 * x^3
  g''(x) = 4 * 3 * x ^ 2
  g'''(x) = 4 * 3 * 2 * x
  g''''(x) = 4 * 3 * 2 * 1 = 4!
  Button navigates to signup page
  (12 votes)
Harrison
Posted 6 years ago. Direct link to Harrison's post “When he wrote the formula...”
When he wrote the formula for the quadratic form part of the qudratic approximation, he wrote the hessian with f_xy in both the top right and bottom left corners. By definition, the hessian has f_yx in the top right corner. I know that f_xy and f_yx are usually the same value, but shouldn't the definition for quadratic approximation contain the true hessian in case f_xy and f_yx come out to be different?
Button navigates to signup pageButton navigates to signup page
(6 votes)
Answer
- Achilles Kappis
  Posted 5 years ago. Direct link to Achilles Kappis's post “Correct, but, probably, h...”
  Correct, but, probably, he didn't bother changing it because (as he mentions in previous videos) most of the time f_xy and f_yx end up to be the same. If we want to be absolutely correct we do need to differentiate those two though and write it down in the way you suggest.
  Button navigates to signup page
  (6 votes)
chunawalaowais
Posted 8 years ago. Direct link to chunawalaowais's post “What is the general notat...”
What is the general notation for hessian of n-th degree approximation?
Button navigates to signup pageButton navigates to signup page
(3 votes)
Answer
- MBCory
  Posted 5 years ago. Direct link to MBCory's post “I believe it would be rep...”
  I believe it would be represented with an H, like all Hessians can be, so you would still have (1/2)x^T times H times x. A three-by-three (or n by n) Hessian would only work with three (n) variables, though, because you need the Hessian to be square and the number of input variables determines the number of rows.
  Hope this helps!
  Button navigates to signup page
  (2 votes)
Erik Wu
Posted 4 years ago. Direct link to Erik Wu's post “For anyone else wondering...”
For anyone else wondering, why there is at the end only one x:
Now it isn't one variable x, it is the whole input vector x (with x,y,z, etc.) and x0 with (x0, y0,z0,etc).
Button navigates to signup pageButton navigates to signup page
(3 votes)
Answer
Farhan Omar
Posted 7 years ago. Direct link to Farhan Omar's post “Hey, How we can know fro...”
Hey, How we can know from the vector form that X-X0 after Hessian matrix is a column vector as it?
Button navigates to signup pageButton navigates to signup page
(2 votes)
Answer
- Tuomo Salo
  Posted 7 years ago. Direct link to Tuomo Salo's post “X and X0 are both column ...”
  X and X0 are both column vectors, since they were defined as such. By the rules of matrix addition and subtraction, their dimensions must be identical, and the result matrix will also have the same dimensions. Therefore, the result will also be a column vector.
  Button navigates to signup page
  (2 votes)
Tharangni Sivaji
Posted 7 years ago. Direct link to Tharangni Sivaji's post “Correct me if I am wrong,...”
Correct me if I am wrong, but at
2:47
while formulating the linear equation in matrix form shouldn't grad(f) be multiplied with the transpose of [X-X0] in order for matrix multiplication to happen??
Button navigates to signup pageButton navigates to signup page
(1 vote)
Answer
- Harrison
  Posted 6 years ago. Direct link to Harrison's post “This is not matrix multip...”
  This is not matrix multiplication, but rather a dot product between vectors. For function with x and y in its input, f(x,y), grad(f) would be a vector containing f_x in the first row and f_y in the second row. Also, when you do this, you're taking the gradient at X_0, the vector containing x_0 and y_0 in its two rows (you can also think of this as the coordinates(x_0,y_0)). This means you'll have a constant in each row. X - X_0 is just simple vector subtraction. X has one column and two rows; in the first row, there is x and in second there is y. X_0 is the same thing, except you tag the subscript naught on each variable. When you do the vector subtraction, you get a vector with its first row as x - x_0 and the second row as y - y_0 (notice that these are no longer the bold-faced vector X, they are variables). When you perform the dot product (if you need a refresher on dot products, khan academy has lessons on them in the Linear Algebra section) between grad(f) and the vector formed from the vector subtraction we just did, you get f_x*(x - x_0) + f_y*(y - y_0). Remember that each of the partials was evaluated at (x_0,y_0) so you'll end up with (using A and B as constants) A(x - x_0) + B(y - y_0).
  Button navigates to signup page
  (2 votes)
alphabetagamma
Posted 8 years ago. Direct link to alphabetagamma's post “For anyone interested, I ...”
For anyone interested, I derived the full multivariable Taylor expansion using directional derivatives -- it seems to match the quadratic approximation to second degree:

f(x + h) = sum_0^infty 1/n! [Del_h^n f] (x)
Button navigates to signup pageComment on alphabetagamma's post “For anyone interested, I ...”
(1 vote)
Answer
𝔄ℕ𝔇ℝ𝔈𝔚 𝔏ℑℕ
Posted 4 years ago. Direct link to 𝔄ℕ𝔇ℝ𝔈𝔚 𝔏ℑℕ's post “In the vector form of Qua...”
In the vector form of Quadratic Approx. x^{T}·H(f)·x, while ^ stands for superscript and H(f) is the Hessian of f, why should we transpose the vector? Doesn't the dot product turn out the same without transposing?
Button navigates to signup pageButton navigates to signup page
(1 vote)
Answer
- Abdo Reda
  Posted 4 years ago. Direct link to Abdo Reda's post “As far as I know is that ...”
  As far as I know is that it depends on whether you are using matrices and matrix multiplication or vector dot products to represent the quadratic form.
  
  so for example, if you are only using matrices to represent the quadratic form then it doesn't make sense to use dot product because usually dot products are not defined for matrices so instead we only use matrix multiplication and our expression becomes:
  x^{T} H(f) x
  
  However I think you can also represent the quadratic form using a mix of matrix multiplication and vector dot products, But you should be very careful when doing this because its not that clear and not that common. In that case our expression can be rewritten as:
  x · (H(f) x)
  
  where we have matrix multiplication ( H(f) x )
  
  and vector dot product between ( x · (H(f)x) )
  Button navigates to signup page
  (1 vote)

Video transcript

- [Voiceover] Okay, so we are finally ready to express the quadratic approximation of a multivariable function in vector form. So, I have the whole thing written out here where f is the function that we are trying to approximate, x naught, y naught is the constant point about which we are approximating, and then this entire expression is the quadratic approximation, which I've talked about in past videos, and if it seems very complicated or absurd, or you're unfamiliar with it, I'm just dissecting it real quick. This over here is the Constant term, this is just gonna evaluate to a constant, everything over here is the Linear term, because it just involves taking a variable multiplied by a constant, and then the remainder, every one of these components will have two variables multiplied into it. So x squared comes up, and x times y, and y squared comes up, so that's the quadratic term. Quadratic. Now, to vectorize things, first of all, let's write down the input, the input variable (x,y) as a vector, and typically we'll do that with a boldfaced x to indicate that it's a vector, and its components are just gonna be the single variables, x and y, the non-boldfaced. So this is the vector representing the variable input, and then correspondingly a boldfaced x with a little subscript o, x naught, is gonna be the constant input, the single point in space near which we are approximating. So when we write things like that, this Constant term, simply enough, is gonna look like evaluating your function at that boldfaced x naught. So that's probably the easiest one to handle. Now the linear term, this looks like a dot product, and if we kind of expand it out as the dot product, it looks like we're taking the partial derivative of f with respect to x, and then the partial derivative with respect to y, and we're evaluating both of those at that boldfaced x naught input. X naught as its input. Now, each one of those partial derivatives is multiplied by variable minus constant numbers, so this looks like taking the dot product, here, I'm gonna erase the word "linear". We're taking with x minus x naught, and y minus y naught. This is just expressing the same linear term, but as a dot product, but the convenience here is that this is totally the same thing as saying the gradient of f, that's the vector that contains all the partial derivatives, evaluated at the special input, x naught, and then we're taking the dot product between that and the variable vector, boldfaced x, minus x naught. Since when you do this component-wise, boldfaced x minus x naught, if we kinda think here, it'll be x the variable minus x naught the constant, y the variable minus y naught the constant, which is what we have up there. So this expression kind of vectorizes the whole linear term, and now the beef here, the hard part, how are we gonna vectorize this quadratic term? Now that's what I was leading to in the last couple videos, where I talked about how you express a quadratic form like this with a matrix, and the way that you do it, I'm just kinda scroll down to give us some room, the way that you do it is we'll have a matrix whose components are all of these constants. It'll be this 1/2 times the second partial derivative evaluated there, and I'm just gonna, for convenience's sake, I'm gonna just take 1/2 times the second partial derivative with respect to x, and leave it as understood that we're evaluating it at this point. And then, on the other diagonal, you have 1/2 times the other kind of partial derivative with respect to y two times in a row. And then we're gonna multiply it by this constant here, but this term kind of gets broken apart into two different components. If you'll remember, in the quadratic form video, it was always things where it was a, and then 2b and c, as your constants for the quadratic form, so if we're interpreting this as two times something, then it gets broken down, and on one corner shows up as fxy, and on the other one, kind of 1/2 fxy. So both of these together are gonna constitute the entire mixed partial derivative. And then the way that we express the quadratic form is we're gonna multiply this by, well by what? Well, the first component is whatever the thing is that's squared here, so it's gonna be that x minus x naught, and then the second component is whatever the other thing squared is, which in this case is y minus y naught, and of course we take that same vector but we put it in on the other side too. So let me make a little bit of room, cause this is gonna be wide. So we're gonna take that same vector, and then kind of put it on its side. So it'll be x minus x naught as the first component, and then y minus y naught as the second component, but it's written horizontally, and this, if you multiply out the entire matrix, is gonna give us the same expression that you have up here. And if that seems unfamiliar, if that seems, you know, how do you go from there to there, check out the video on quadratic forms, or you can check out the article where I'm talking about the quadratic approximation as a whole, I kind of go through the computation there. Now, this matrix right here is almost the Hessian matrix, this is why I made a video about the Hessian matrix. It's not quite, because everything has a 1/2 multiplied into it, so I'm just gonna kinda take that out and we'll remember we have to multiply a 1/2 in at some point, but otherwise, it is the Hessian matrix, which we denote with a kind of boldfaced H, boldfaced H, and emphasize that it's the Hessian of f. The Hessian is something you take of a function. And like I said, remember each of these terms we should be thinking of as evaluated on the special input point, evaluating it at that special, you know, boldfaced x naught input point. I was just kind of too lazy to write it in, each time the x naught y naught, x naught y naught, x naught y naught, all of that. But what we have then is we're multiplying it on the right by this whole vector is the variable vector, boldfaced x, minus boldfaced x naught, that's what that entire vector is, and then we kind of have the same thing on the right, you know, boldfaced vector x minus x naught, except that we transpose it, we kind of put it on its side, and the way you denote that, you have a little T there, for transpose. So this term captures all of the quadratic information that we need for the approximation. So just to put it all together, if we go back up and we put the Constant term that we have, the Linear term, and this quadratic form that we just found, all together, what we get is that the quadratic approximation of f, which is a function we'll think of it as a vector input, boldfaced x, it equals the function itself evaluated at, you know, whatever point we're approximating near, plus the gradient of f, which is kind of, it's vector analog of a derivative, evaluated at that point, so this is a constant vector, dot product with the variable vector, x minus the constant vector x naught, that whole thing, plus 1/2 the, then we'll just copy down this whole quadratic term up there, the variable minus the constant multiplied by the Hessian, which is kind of like an extension of the second derivative, two multivariable functions, and we're evaluating that, no, let's see, we're evaluating it at the constant, at the constant, x naught, and then on the right side, we're multiplying it by the variable, x minus x naught. And this, this is the quadratic approximation in vector form, and the important part is, now, it doesn't just have to be of a two variable input. You could imagine plugging in a three variable input, or a four variable input, and all of these terms make sense. You know, you take the gradient of a four variable function, you'll get a vector with four components. You take the Hessian of a four variable function, you would get a four by four matrix. And all of these terms make sense. And I think it's also prettier to write it this way, because it looks a lot more like a Taylor expansion in the single variable world. You have, you know, a constant term, plus the value of a derivative, times x minus a constant, plus 1/2 what's kind of like the second derivative term, what's kind of like taking an x squared, but this is how it looks in the vector world. So in that way it's actually maybe a little bit more familiar than writing it out in the full, you know, component by component term, where it's easy to kind of get lost in the weeds there. So, full vectorized form of the quadratic approximation of a scalar valued multivariable function. Boy, is that a lot to say.