Statistics and probability
- Squared error of regression line
- Proof (part 1) minimizing squared error to regression line
- Proof (part 2) minimizing squared error to regression line
- Proof (part 3) minimizing squared error to regression line
- Proof (part 4) minimizing squared error to regression line
- Regression line example
- Second regression example
- Calculating R-squared
- Covariance and the regression line
Second Regression Example. Created by Sal Khan.
Want to join the conversation?
- Do you have any video on multiple regressions?
Thank you.(66 votes)
- This would really be great, the examples on here are really helping me revise but would love to check if I'm getting correct answers to the practice problems I've gotten as part of my course.. :)(1 vote)
- why not add any practise for this topic?(36 votes)
- yeah...practice would be GREAT! it would really help me remember how to do it if I had to muscle through a few on my own.(14 votes)
- i notice that here, Sal has used an alternate way of stating m... in earlier vid (Regression Line Example) he showed it as:
m = mean x * mean y - mean xy / (mean of x)^2 - mean of x^2. Sal noted (at minute0:46of that video) that we might see it reversed in some statistics books, but that didn't matter b/c it was just multiplying by -1.
in the version here m = mean xy - mean x*mean y / mean of x^2 - (mean of x)^2
wouldn't changing these reverse the direction of the slope (i.e from positive to negative or negative to positive, depending on the data set)?(11 votes)
- Actually to the from one to the other you have to multiply both the numerator and the denominator by -1.
So essentially you're multiplying it by -1/-1 = 1, and that means it won't affect the slope at all! (:(8 votes)
- How would you find a missing pair of values using the regression line and all the information related to the given values?(5 votes)
- Once you have the m and the b, you have a regression equation: y = mx + b. So for a given missing x value, you can plug it into the equation to see what the predicted y value would be to find your missing pair: (x_missing, y_predicted).(6 votes)
- Why do we know that the average x and average y will always be on the regression line?(3 votes)
- Look up the videos of 'Proof of minimizing squared error to regression line'. You will get your answer in part 3.(2 votes)
- Is the formula for the slope/intercept of the line of best fit to be memorized? I know it can be re-derived, but just wondering.(2 votes)
- I think that's a great question. My opinion is: If you know how to derive it, it's easier in the long run to keep deriving it every time. Because the work of memorizing is wasted, but the work of deriving gives something in return. (Also, memorization is prone to error). (However, I'm a liar, because for the problem in this video, I just used my notes and copied the formulas. That's the 3rd option - refer to notes).
4 videos from now ("covariance and the regression line") you'll learn that the slope is (cov(x,y))/(cov(x,x)); then b is still ybar -m*xbar. This is an easier thing to remember, and interesting to know. Again, I think the best option is to review the derivation and if you want get it from notes to calculate.(1 vote)
- can anyone tell me, what's the difference between least square regression and squared error of regression line?(2 votes)
- How would you do a regression for categorical data?(2 votes)
Let's find the equation for the regression line that best fits this. Where the fit minimizes the squared distance to each of the points. And then let's actually calculate how good of a fit it is using an r squared. And we might have to do that in the next video, depending on time. So just as a reminder, the line is going to have the equation y is equal mx plus b. And we've shown ourselves that the slope of this line-- the one that best minimizes the squared distance to each of those points-- is going to be the mean of the xy's minus the mean of x times the mean of y. All of that over the mean of the x's squared, or the mean of the x squareds, minus the means of the x's squared. So one way to memorize it, I guess, is the first terms have the mean of the combined things. You're multiplying x times itself first, then meaning. You're multiplying x times y, times each other first, then meaning. And then the second terms, you're finding the means of the individual components and then multiplying. Mean of x, times mean of y, mean of x times mean of x. So hopefully maybe that helps. Maybe it doesn't. But we can calculate the slope. And then the y intercept, b, is just going to be equal to the mean of y times whatever we calculate here for m, times the mean of x. And we can do that because we know that the point mean of x comma mean of y is going to be on this regression live. So what's calculate them. And you'll see, in the last example we did three points. We only have four points here. But the computations get more and more intense. You can imagine what would happen if you had 10 or 20 or 100 points. You pretty much have to use a calculator at that point. Or computer, even better. Or a spreadsheet. So let's calculate m. And to do that, let's calculate the components. So the mean of x-- the mean of the x's-- is going to be equal to, this x is negative 2, plus negative 1, plus 1, plus 4. All of that over, we have four x data points. These two guys cancel out. Negative 2 plus 4 is 2. 2 over 4 is equal to 1/2. Now let's do the mean of the y's. We have negative 3, we have a negative 1. And then we have a 2, and then we have a 3. And once again, we have four data points. That guy and that guy cancel out. Negative 1 plus 2 is 1. So this is equal to 1/4. Now let's figure out the mean of the xy's. So x times y, the mean of that. So over here we have negative 2 times negative 3. Negative 2 times negative 3 is positive 6. Plus negative 1 times negative 1 is positive 1. Plus 1 times 2 is 2. Plus 4 times 3 is 12. And we have four of these points. And what is this? This is 6 plus 1 is 7. 7 plus 2 is 9. 9 plus 12 is 21 over 4. This is equal to 21/4. And then finally, we want-- I'll do this in a new color-- the mean of the x's squared. And so that is going to be equal to-- negative 2 squared is positive 4. Plus negative 1 squared is positive 1. Plus 1 squared is 1. Plus for 4 squared is 16. All of that over 4. 4 plus 2 is 6 plus 16 is 22 over 4. So 22/4 is the same thing as 11/2. So now we're now ready to calculate the actual slope. Let me do it over here. Well actually, let me do it over here. I want to be able look at everything we've done. So this is going to be equal to, in this case, it's going to be the mean of the xy's, which is 21/4. Minus the product of the mean of x, which is 1/2. Times the mean of the y's, which is 1/4. And then all of that over the mean of the x squareds, which is 11/2. So we did that. Minus the mean of the x's squared. The mean of the x's, once again, is 1/2. And so what is this equal to? I'm just going to go straight to the calculator. I could deal with the fractions, but this isn't a review of adding and subtracting and multiplying fractions. Let's just go straight to the calculator. Actually, let me simplify it before. It's just too tempting to simplify. Let me copy and paste it. Let's go down here to calculate it. And so this is going to be-- maybe I should have used the calculator, but it's too tempting. So what's this on top? On top, we have 21/4 minus 1/2 times 1/4 is minus 1/8. All of that over 11/2 minus 1/2 squared, which is 1/4. Now, one way to simplify this right from the get go is multiply the numerator and the denominator by 8. And that's just to get rid of all these fractions. So 21/4 times 8 is going to be the same thing is 21 times 2, which is equal to 42. Minus 1/8 times 8. We have to, of course, distribute the eights. So it's going to be minus 1. All of that over, 8 times 11/2 is going to be 11 times 4, which is 44. And then 8 times 1/4 is 2, so it's minus 2. So 42 minus 1 is 41. And then 44 minus 2 is 42. So the slope is 41/42. So a little bit less than a slope of one. 42/42 would be exactly 1. So our regression slope is a little bit less than 1. And then our regression y-intercept, b, is going to be equal to the mean of the y. So 1/4, minus our slope, minus 41/42, times the mean of the x's, so times 1/2. And so this is going to be equal to 1/4 minus 41/84, which is equal to-- let me just find a common denominator. So let's go over 84. So what's 1/4 of 84? 1/4 of 80 is 20. So this is 21. 21 times 4 is 84. This is 1/4 of 84. Yep, that's right. So it's going to be 21 minus 41 over 84, which is equal to negative 20. Negative 20 over 84, which is the same thing, they're both divisible by 4, the numerator divided by 4 is negative 5, over 21. So our regression line is going to be y is equal to 41/42 x minus 5/21. And 5/21 is a little bit less than 1/4. 5/20 would be 1/4. We made the denominator a little bit bigger, so it's going to be a little bit less than negative 1/4. So our y-intercept is going to be a little bit less than negative 1/4. And then we're going to have a slope a little bit less than 1. So our line is going to look something like this. If I were able to actually draw a straight line, it would look something like that over there. So I'm going to leave you there in this video. In the next video, we're actually going to calculate the r squared for this line. How good of a fit is it? How much of the total variation in the y values can be explained by the variation in the x values, or by the line itself?