Main content

## Statistics and probability

### Unit 5: Lesson 6

More on regression- Squared error of regression line
- Proof (part 1) minimizing squared error to regression line
- Proof (part 2) minimizing squared error to regression line
- Proof (part 3) minimizing squared error to regression line
- Proof (part 4) minimizing squared error to regression line
- Regression line example
- Second regression example
- Calculating R-squared
- Covariance and the regression line

© 2023 Khan AcademyTerms of usePrivacy PolicyCookie Notice

# Proof (part 3) minimizing squared error to regression line

Proof (Part 3) Minimizing Squared Error to Regression Line. Created by Sal Khan.

## Want to join the conversation?

- Intuitively it makes sense that there would only be one best fit line. But isn't it true that the idea of setting the partial derivatives equal to zero with respect to m and b would only locate a REGIONAL minimum in the 3D "bowl." There could be other minima present with partial derivatives both equal to zero. Correct? And who's to say which minima is the one minimum, if it exists?

Also, intinutively, there is no maximum for the best fit line, but the partial derivates would equal zero at a maximum point in the 3D surface as well, right?(3 votes)- Under the assumptions of linear regression, that won't happen. The "loss function" (that is, how we measure the closeness of the predictions, in this case the sum of squared residuals) is convex, so the surface won't be bumpy like you're envisioning. It will be a smooth curve.

And yes, for any local maximum or local minimum, the derivative will be zero.(4 votes)

- Why can't you divide "mean of x^2" with the "mean of x" like normal "x^2/x" and end up with just "mean of x"?(4 votes)
- In notation, the mean of x is:
`xbar = Σ(xi) / n`

That is: we add up all the numbers`xi`

, and divide by how many there are.

But the "mean of x^2" is not the square of the mean of x. We square*each value*, then add them up, and then divide by how many there are. Let's call it x2bar:`x2bar = Σ(xi^2) / n`

Now, x2bar is not the same as xbar^2. The reason for this is because we're squaring, and then adding up. Those two operations are not interchangeable: the "sum of the squares" is not equal to the "square of the sum". You can try it out with a really small exercise in algebra. Take two numbers`a`

and`b`

, and check whether`(a + b)^2`

is equal to`(a^2 + b^2)`

. The second expression is pretty simple - we just have the square of`a`

and the square of`b`

. The first expression needs to be expanded first:`(a + b)^2 = (a + b)*(a + b)`

(a + b)^2 = a^2 + 2*a*b + b^2

Now compare that to`(a^2 + b^2)`

. We have the two squared terms, but we also have that`2*a*b`

term, or the "crossproduct". That crossproduct is what makes the "sum of the squares" and the "square of the sum" not be equal, and hence why "mean of x^2" divided by "mean of x" doesn't give us "mean of x".(2 votes)

- The 3D surface is explained to be parabolic. Who do we know that it's going to be parabolic? Why not any other 3 dimensional surface like the cylinder or conical?(4 votes)
- Thanks for the extremely helpful video series. At9:08Sal is dividing the equation by mean of x. What happens when mean of x's is zero. Is the derivation/formula valid is such cases?(4 votes)
- Likewise, at4:07, how did Sal take the partial derivative of nb^2 and get 2nb (or 2bn)? Also, I thought the object here was to factor out the b?(3 votes)
- It's just a basic rule for derivatives. Deriv. of x^2 is 2*x.

He's finding the derivative w.r.t. b when everything else in the term is held constant.(3 votes)

- I don't understand the part at7:14. xy-bar does not equal x-bar * y-bar. How is he dividing out an x-bar to obtain this y-bar?(2 votes)
- oh I got it. He didn't derive it like that. He used the other partial derivative starting point.(4 votes)

- at4:15how did he get 2bn , shouldn't it be just bn by taking out one b(3 votes)
- i think I am missing some basic knowledge here. Can you please direct me to more basic videos on partial derivatives. Thanks a ton for your answer(1 vote)

- Where can I find more information regarding the surface that Sal drew at the beginning?(2 votes)
- Why is the derivative is used. What does that mean?(2 votes)
- It's a rate of change. Velocity is the rate of change of distance covered. That is, the greater your velocity, the faster the distance covered is changing.

Sal showed that as you change the slope (m) of the best fit line, the error changes; and as you change the y-intercept (b) the error also changes. He found algebraic expressions for these rates of change.(2 votes)

- how did the 3-d surfaces come into play(2 votes)

## Video transcript

All right, so where we left
off, we had simplified our algebraic expression for the
squared error to the line from the n data points. We kind of visualized it. This expression right here would
be a surface, I guess you could view it as a surface
in three dimensions, where for any m and b is going to be a
point on that surface that represents the squared
error for that line. Our goal is to find the m and
the b, which would define an actual line, that minimize
the squared error. The way that we do that, is we
find a point where the partial derivative of the squared error
with respect to m is 0, and the partial derivative
with respect to b is also equal to 0. So it's flat with
respect to m. So that means that the slope
in this direction is going to be flat. Let me do it in the
same color. So the slope in this direction,
that's the partial derivative with respect to
m, is going to be flat. It's not going to change
in that direction. The partial derivative
with respect to b is going to be flat. So it will be a flat point
right over there. The slope at that point in that
direction will also be 0, and that is our minimum point. So let's figure out the m and
b's that give us this. So if I were to take the partial
derivative of this expression with respect to m. Well this first term has
no m terms in it. So it's a constant from the
point of view of m. Just as a reminder, partial
derivatives, it's just like taking a regular derivative. You're just assuming that
everything but the variable that you're doing the partial
derivative with respect to, you're assuming everything
else is a constant. So in this expression, all the
x's, the y's, the b's, the n's, those are all constant. The only variable, when we take
the partial derivative with respect to m, that
matters is the m. So this is a constant. There's no m here. This term right over
here, we're taking with respect to m. So the derivative of this with
respect to m, it's kind of the coefficients on the m. So negative 2 times n times the
mean of the xy's, that's the partial of this
with respect to m. Then this term or right
here has no m's in it. So it's constant with
respect to m. So its partial derivative
with respect to m is 0. Then this term here, you have
n times the mean of the x squared times m squared. So this is going to be-- we're
talking about a partial derivative with respect to m--
so it's going to be 2 times n times the mean of the
x [? squareds ?] times m. The derivative of m squared is
2m, and then you just have this coefficient
there as well. Now this term, you also
have an m over there. So let's see, everything
else is just kind of a coefficient on this m. So the derivative with respect
to m is 2bn times the mean of the x's. If I took the derivative of 3m,
the derivative is just 3. It's just the coefficient
on it. Then finally, this is a constant
with respect to m. So we don't see it. So this is the partial
derivative with respect to m. That's that right over there. We want to set this
equal to 0. Now let's do the same thing
with respect to b. This term, once again, is
a constant from the perspective of b. There's no b here. There's no b over here. So the partial derivatives
of either of these with respect to b is 0. Then over here you have a
negative 2n times the mean of y's as a coefficient on a b. So the partial derivative with
respect to b is going to be minus 2n, or negative 2n, times
the mean of the y's. Then there's no b over here. Then we do have a b over here. So it's plus 2mn times
the mean of the x's. This is essentially
the coefficient on the b over here. It was written in a mixed order,
but all of these are constants from the point
of view of b. They are the coefficient
in front of the b. The partial derivative of that
with respect to b is just going to be the coefficient. Then finally, the partial
derivative of this with respect to b is going to be 2nb,
Or 2nb to the first you could even say. We want to set this
equal to 0. So it looks very complicated. But remember, we're just trying
to solve for the m's and the b 's. We have two equations with
two unknowns here. We have the m's and then
we have the b's. To simplify this, both of these
equations, actually the top one and the bottom
one, both sides are divisible by 2n. I mean 0 is divisible
by anything. It'll be just 0. So let's divide the top equation
and by 2n and see what we get. If we divide the top equation by
2n, this'll become just 1. That goes away, and then
those go away. You would just be left with
negative times the mean, the negative mean of the xy's plus
m times the mean of the x squareds, plus b times the mean
of the x's is equal to 0. That's this first expression
when you divide both sides by negative 2n. The second expression will
be, this will go away. This is when you divide
it by 2n. I don't want to say
negative 2n. When you divide this by 2n,
that'll go away, that will go away, and then those
will go away. You're just left with the
negative mean of the y's plus m times the mean of the x's
plus b is equal to 0. So if we find the m and the b
values that satisfy the system of equations, we have minimized
the squared error. We could just solve it
in a traditional way. But I want to rewrite this,
because I think it's kind of interesting to see what these
really represents. So let's add this mean
of the xy's to both sides of this top equation. What do we get? We get m times the mean of
the x [? squareds ?] plus b times the mean of the
x's is equal to, these are going to cancel out, is equal
to the mean of the xy's. That's that top equation. This bottom equation, right
here, let's add the mean of y to both sides of
this equation. I do that so that that
cancels out. And then we're left with m--
I'll do that in the blue color to show you the same equation--
we have m times the mean of the x's plus b is equal
to the mean of the y's. Now, I actually want to
get both of these into mx plus b form. This is actually
already there. Actually you can see, that if
our best-fitting line is going to be y is equal to mx plus b--
we still have to find the m and the b-- but we see on
that best-fitting line, because the m and the b that
satisfy both of these equations are going to be
the m and the b on that best-fitting line. So that best-fitting line
actually contains the point, and we get this from the second
equation right here. It contains the point. I should write it this way. The coordinate mean of x mean
of y lies on the line. And you could see it
right over here. If you put the mean of x in this
for the optimal m and b, you are going to get
the mean of the y. So that's interesting. This optimal line. Let's never forget what we're
even trying to do. This optimal line is going to
contain some point on it-- let me do that in a new color--
it's going to contain some point on it right here that is
the mean of all of the x values and the mean of
all the y values. That's just interesting. It kind of makes sense. It kind of makes intuitive
sense. Now this other thing, just to
kind of get it in the same point of view. Then it will actually become a
kind of an easier way to solve the system. You could solve this a million
different ways. But just to give us an intuition
of what even is going on here, what's another
point that's on the line? Because if you have two points
on the line, you know what the equation of the line
is going to be. Well the other point, we want
this to be in mx plus b form. So let's divide both sides of
this equation by this term right here, by the
mean of the x 's. If we do that, we get m times
the mean of the x [? squareds ?] divided by the mean of the x's
plus b is equal to the mean of the xy's divided by the
mean of the x's. So when you write it in this
form, this is the exact same equation as that, I just divided
both sides by the mean of the x's, you get another
interesting point that will lie on this optimal fitting
line, at least from the point of view of the squared
distances. So another point that will lie
on it, on this optimal line, the x value is going to be
this, the mean of the x [? squareds ?] divided by the mean
of the x's. Then the y value is going to
be the mean of the xy's divided by the mean
of the x's. I'll let you think about
that a little bit more. But already, this is actually
the two points that lie on the line, so both of these on the
best-fitting line based on how we're measuring a good fit,
which is the squared distance. These are on the line
that minimize that squared distance. What I'm going to do in next
video, and this is turning into like a six or seven video
saga on trying to prove the best-fitting line or finding
the formula for the best-fitting line. But it's interesting. There's all sorts of kind of
neat little mathematical things to ponder over here. But in the next video, we can
actually use this information. We could have just solved
the system straight up. But we can actually use this
information right here to solve for our m and b's. Maybe we'll do it both ways
depending on my mood.