Main content

## Statistics and probability

### Unit 5: Lesson 6

More on regression- Squared error of regression line
- Proof (part 1) minimizing squared error to regression line
- Proof (part 2) minimizing squared error to regression line
- Proof (part 3) minimizing squared error to regression line
- Proof (part 4) minimizing squared error to regression line
- Regression line example
- Second regression example
- Calculating R-squared
- Covariance and the regression line

© 2023 Khan AcademyTerms of usePrivacy PolicyCookie Notice

# Second regression example

Second Regression Example. Created by Sal Khan.

## Want to join the conversation?

- Do you have any video on multiple regressions?

Thank you.(66 votes)- This would really be great, the examples on here are really helping me revise but would love to check if I'm getting correct answers to the practice problems I've gotten as part of my course.. :)(1 vote)

- why not add any practise for this topic?(36 votes)
- yeah...practice would be GREAT! it would really help me remember how to do it if I had to muscle through a few on my own.(14 votes)

- i notice that here, Sal has used an alternate way of stating
**m**... in earlier vid (Regression Line Example) he showed it as:**m = mean x * mean y - mean xy / (mean of x)^2 - mean of x^2**. Sal noted (at minute0:46of that video) that we might see it reversed in some statistics books, but that didn't matter b/c it was just multiplying by -1.

in the version here**m = mean xy - mean x*mean y / mean of x^2 - (mean of x)^2**

wouldn't changing these reverse the direction of the slope (i.e from positive to negative or negative to positive, depending on the data set)?(11 votes)- Actually to the from one to the other you have to multiply both the numerator and the denominator by -1.

So essentially you're multiplying it by -1/-1 = 1, and that means it won't affect the slope at all! (:(8 votes)

- Thanks for this. How about Multiple Regressions?(11 votes)
- Do you have a vid on multiple regression?(8 votes)
- How would you find a missing pair of values using the regression line and all the information related to the given values?(5 votes)
- Once you have the m and the b, you have a regression equation: y = mx + b. So for a given missing x value, you can plug it into the equation to see what the predicted y value would be to find your missing pair: (x_missing, y_predicted).(6 votes)

- Why do we know that the average x and average y will always be on the regression line?(3 votes)
- Look up the videos of 'Proof of minimizing squared error to regression line'. You will get your answer in part 3.(2 votes)

- Is the formula for the slope/intercept of the line of best fit to be memorized? I know it can be re-derived, but just wondering.(2 votes)
- I think that's a great question. My opinion is: If you know how to derive it, it's easier in the long run to keep deriving it every time. Because the work of memorizing is wasted, but the work of deriving gives something in return. (Also, memorization is prone to error). (However, I'm a liar, because for the problem in this video, I just used my notes and copied the formulas. That's the 3rd option - refer to notes).

4 videos from now ("covariance and the regression line") you'll learn that the slope is (cov(x,y))/(cov(x,x)); then b is still ybar -m*xbar. This is an easier thing to remember, and interesting to know. Again, I think the best option is to review the derivation and if you want get it from notes to calculate.(1 vote)

- can anyone tell me, what's the difference between least square regression and squared error of regression line?(2 votes)
- How would you do a regression for categorical data?(2 votes)
- I'm not able to explain it, but I know that is called logistic regression.(1 vote)

## Video transcript

Let's find the equation for
the regression line that best fits this. Where the fit minimizes the
squared distance to each of the points. And then let's actually
calculate how good of a fit it is using an r squared. And we might have to do that
in the next video, depending on time. So just as a reminder, the
line is going to have the equation y is equal mx plus b. And we've shown ourselves that
the slope of this line-- the one that best minimizes the
squared distance to each of those points-- is going to be
the mean of the xy's minus the mean of x times the mean of y. All of that over the mean of the
x's squared, or the mean of the x squareds, minus the
means of the x's squared. So one way to memorize it, I
guess, is the first terms have the mean of the combined
things. You're multiplying x times
itself first, then meaning. You're multiplying x
times y, times each other first, then meaning. And then the second terms,
you're finding the means of the individual components
and then multiplying. Mean of x, times mean of y,
mean of x times mean of x. So hopefully maybe that helps. Maybe it doesn't. But we can calculate
the slope. And then the y intercept, b, is
just going to be equal to the mean of y times whatever we
calculate here for m, times the mean of x. And we can do that because we
know that the point mean of x comma mean of y is going to be
on this regression live. So what's calculate them. And you'll see, in the last
example we did three points. We only have four points here. But the computations get
more and more intense. You can imagine what would
happen if you had 10 or 20 or 100 points. You pretty much have to use a
calculator at that point. Or computer, even better. Or a spreadsheet. So let's calculate m. And to do that, let's calculate
the components. So the mean of x-- the mean of
the x's-- is going to be equal to, this x is negative 2, plus
negative 1, plus 1, plus 4. All of that over, we have
four x data points. These two guys cancel out. Negative 2 plus 4 is 2. 2 over 4 is equal to 1/2. Now let's do the mean
of the y's. We have negative 3, we
have a negative 1. And then we have a 2, and
then we have a 3. And once again, we have
four data points. That guy and that
guy cancel out. Negative 1 plus 2 is 1. So this is equal to 1/4. Now let's figure out the
mean of the xy's. So x times y, the
mean of that. So over here we have negative
2 times negative 3. Negative 2 times negative
3 is positive 6. Plus negative 1 times negative
1 is positive 1. Plus 1 times 2 is 2. Plus 4 times 3 is 12. And we have four of
these points. And what is this? This is 6 plus 1 is 7. 7 plus 2 is 9. 9 plus 12 is 21 over 4. This is equal to 21/4. And then finally, we want-- I'll
do this in a new color-- the mean of the x's squared. And so that is going to be equal
to-- negative 2 squared is positive 4. Plus negative 1 squared
is positive 1. Plus 1 squared is 1. Plus for 4 squared is 16. All of that over 4. 4 plus 2 is 6 plus
16 is 22 over 4. So 22/4 is the same
thing as 11/2. So now we're now ready to
calculate the actual slope. Let me do it over here. Well actually, let me
do it over here. I want to be able look at
everything we've done. So this is going to be equal to,
in this case, it's going to be the mean of the
xy's, which is 21/4. Minus the product of the mean
of x, which is 1/2. Times the mean of the
y's, which is 1/4. And then all of that over
the mean of the x squareds, which is 11/2. So we did that. Minus the mean of
the x's squared. The mean of the x's,
once again, is 1/2. And so what is this equal to? I'm just going to go straight
to the calculator. I could deal with the fractions,
but this isn't a review of adding
and subtracting and multiplying fractions. Let's just go straight
to the calculator. Actually, let me simplify
it before. It's just too tempting
to simplify. Let me copy and paste it. Let's go down here
to calculate it. And so this is going to be--
maybe I should have used the calculator, but it's
too tempting. So what's this on top? On top, we have 21/4 minus 1/2
times 1/4 is minus 1/8. All of that over 11/2 minus
1/2 squared, which is 1/4. Now, one way to simplify this
right from the get go is multiply the numerator and
the denominator by 8. And that's just to get rid
of all these fractions. So 21/4 times 8 is going to be
the same thing is 21 times 2, which is equal to 42. Minus 1/8 times 8. We have to, of course,
distribute the eights. So it's going to be minus 1. All of that over, 8 times 11/2
is going to be 11 times 4, which is 44. And then 8 times 1/4 is
2, so it's minus 2. So 42 minus 1 is 41. And then 44 minus 2 is 42. So the slope is 41/42. So a little bit less than
a slope of one. 42/42 would be exactly 1. So our regression slope is
a little bit less than 1. And then our regression
y-intercept, b, is going to be equal to the mean of the y. So 1/4, minus our slope, minus
41/42, times the mean of the x's, so times 1/2. And so this is going to be
equal to 1/4 minus 41/84, which is equal to-- let
me just find a common denominator. So let's go over 84. So what's 1/4 of 84? 1/4 of 80 is 20. So this is 21. 21 times 4 is 84. This is 1/4 of 84. Yep, that's right. So it's going to be 21 minus 41
over 84, which is equal to negative 20. Negative 20 over 84, which is
the same thing, they're both divisible by 4, the numerator
divided by 4 is negative 5, over 21. So our regression line is going
to be y is equal to 41/42 x minus 5/21. And 5/21 is a little
bit less than 1/4. 5/20 would be 1/4. We made the denominator a little
bit bigger, so it's going to be a little bit
less than negative 1/4. So our y-intercept is going to
be a little bit less than negative 1/4. And then we're going to
have a slope a little bit less than 1. So our line is going to look
something like this. If I were able to actually draw
a straight line, it would look something like
that over there. So I'm going to leave you
there in this video. In the next video, we're
actually going to calculate the r squared for this line. How good of a fit is it? How much of the total variation
in the y values can be explained by the variation
in the x values, or by the line itself?