More on regression
Second regression example
Let's find the equation for the regression line that best fits this. Where the fit minimizes the squared distance to each of the points. And then let's actually calculate how good of a fit it is using an r squared. And we might have to do that in the next video, depending on time. So just as a reminder, the line is going to have the equation y is equal mx plus b. And we've shown ourselves that the slope of this line-- the one that best minimizes the squared distance to each of those points-- is going to be the mean of the xy's minus the mean of x times the mean of y. All of that over the mean of the x's squared, or the mean of the x squareds, minus the means of the x's squared. So one way to memorize it, I guess, is the first terms have the mean of the combined things. You're multiplying x times itself first, then meaning. You're multiplying x times y, times each other first, then meaning. And then the second terms, you're finding the means of the individual components and then multiplying. Mean of x, times mean of y, mean of x times mean of x. So hopefully maybe that helps. Maybe it doesn't. But we can calculate the slope. And then the y intercept, b, is just going to be equal to the mean of y times whatever we calculate here for m, times the mean of x. And we can do that because we know that the point mean of x comma mean of y is going to be on this regression live. So what's calculate them. And you'll see, in the last example we did three points. We only have four points here. But the computations get more and more intense. You can imagine what would happen if you had 10 or 20 or 100 points. You pretty much have to use a calculator at that point. Or computer, even better. Or a spreadsheet. So let's calculate m. And to do that, let's calculate the components. So the mean of x-- the mean of the x's-- is going to be equal to, this x is negative 2, plus negative 1, plus 1, plus 4. All of that over, we have four x data points. These two guys cancel out. Negative 2 plus 4 is 2. 2 over 4 is equal to 1/2. Now let's do the mean of the y's. We have negative 3, we have a negative 1. And then we have a 2, and then we have a 3. And once again, we have four data points. That guy and that guy cancel out. Negative 1 plus 2 is 1. So this is equal to 1/4. Now let's figure out the mean of the xy's. So x times y, the mean of that. So over here we have negative 2 times negative 3. Negative 2 times negative 3 is positive 6. Plus negative 1 times negative 1 is positive 1. Plus 1 times 2 is 2. Plus 4 times 3 is 12. And we have four of these points. And what is this? This is 6 plus 1 is 7. 7 plus 2 is 9. 9 plus 12 is 21 over 4. This is equal to 21/4. And then finally, we want-- I'll do this in a new color-- the mean of the x's squared. And so that is going to be equal to-- negative 2 squared is positive 4. Plus negative 1 squared is positive 1. Plus 1 squared is 1. Plus for 4 squared is 16. All of that over 4. 4 plus 2 is 6 plus 16 is 22 over 4. So 22/4 is the same thing as 11/2. So now we're now ready to calculate the actual slope. Let me do it over here. Well actually, let me do it over here. I want to be able look at everything we've done. So this is going to be equal to, in this case, it's going to be the mean of the xy's, which is 21/4. Minus the product of the mean of x, which is 1/2. Times the mean of the y's, which is 1/4. And then all of that over the mean of the x squareds, which is 11/2. So we did that. Minus the mean of the x's squared. The mean of the x's, once again, is 1/2. And so what is this equal to? I'm just going to go straight to the calculator. I could deal with the fractions, but this isn't a review of adding and subtracting and multiplying fractions. Let's just go straight to the calculator. Actually, let me simplify it before. It's just too tempting to simplify. Let me copy and paste it. Let's go down here to calculate it. And so this is going to be-- maybe I should have used the calculator, but it's too tempting. So what's this on top? On top, we have 21/4 minus 1/2 times 1/4 is minus 1/8. All of that over 11/2 minus 1/2 squared, which is 1/4. Now, one way to simplify this right from the get go is multiply the numerator and the denominator by 8. And that's just to get rid of all these fractions. So 21/4 times 8 is going to be the same thing is 21 times 2, which is equal to 42. Minus 1/8 times 8. We have to, of course, distribute the eights. So it's going to be minus 1. All of that over, 8 times 11/2 is going to be 11 times 4, which is 44. And then 8 times 1/4 is 2, so it's minus 2. So 42 minus 1 is 41. And then 44 minus 2 is 42. So the slope is 41/42. So a little bit less than a slope of one. 42/42 would be exactly 1. So our regression slope is a little bit less than 1. And then our regression y-intercept, b, is going to be equal to the mean of the y. So 1/4, minus our slope, minus 41/42, times the mean of the x's, so times 1/2. And so this is going to be equal to 1/4 minus 41/84, which is equal to-- let me just find a common denominator. So let's go over 84. So what's 1/4 of 84? 1/4 of 80 is 20. So this is 21. 21 times 4 is 84. This is 1/4 of 84. Yep, that's right. So it's going to be 21 minus 41 over 84, which is equal to negative 20. Negative 20 over 84, which is the same thing, they're both divisible by 4, the numerator divided by 4 is negative 5, over 21. So our regression line is going to be y is equal to 41/42 x minus 5/21. And 5/21 is a little bit less than 1/4. 5/20 would be 1/4. We made the denominator a little bit bigger, so it's going to be a little bit less than negative 1/4. So our y-intercept is going to be a little bit less than negative 1/4. And then we're going to have a slope a little bit less than 1. So our line is going to look something like this. If I were able to actually draw a straight line, it would look something like that over there. So I'm going to leave you there in this video. In the next video, we're actually going to calculate the r squared for this line. How good of a fit is it? How much of the total variation in the y values can be explained by the variation in the x values, or by the line itself?