More on regression
In the last video, we were able to find the equation for the regression line for these four data points. What I want to do in this video is figure out the r squared for these data points. Figure out how good this line fits the data. Or even better, figure out the percentage-- which is really the same thing-- of the variation of these data points, especially the variation in y, that is due to, or that can be explained by variation in x. And to do that, I'm actually going to get a spreadsheet out. I've actually tried to do this with a calculator and it's much harder. So hopefully this doesn't confuse you too much to use a spreadsheet. And I'm a make a couple of columns here. And spreadsheets actually have functions that'll do all of this automatically, but I really want to do it so that you could do it by hand if you had to. So I'm going to make a couple of columns here. This is going to be my x column. This is going to be my y column. This is going to be the column-- I'll call this y star-- this'll be the y value that our line predicts based on our x value. This is going to be the error with the line. Let me caught it the squared error with the line. I don't want us to take up too much space. And then the next one, I'm going to have the squared variation for that y value from the mean y. And I think these columns by themselves will be enough for us to do everything. So let's first put all the data points in. So we had negative 2 comma negative 3. That was one data point. Negative 1 comma negative 1. And we had 1 comma 2. Then we have 4 comma 3. Now, what does our line predict? Well our line says, you give me an x value, I'm going to tell you what y value I'll predict. So when x is equal to negative 2, the y value on the line is going to be the slope. So this is going to be equal to 41 divided by 42 times our x value. And I just selected that cell. And just a little bit of a primer on spreadsheets, I'm selecting the cell D2. I was able to just move my cursor over and select that. But that tells me the x value. Minus 5/21. Minus 5 divided by 21. Just like that. So just to be clear of what we're even doing. This y star here, I got negative 2.19. That tells us at this point right over here is negative 2.19. So when we figure out the error, we're going to figure out the distance between negative 3, that's our y value, and negative 2.19. So let's do that. So the error is just going to be equal to our y value. That's cell E2. Minus the value that our line would predict. So just that value is the actual error. But we want to square it. And then, the next thing we want to do is the squared distance. so this is equal to the squared distance of our y value from the y's mean. So what's the mean of the y's? Mean of the y's is 1/4. So minus 0.25, is the same thing is 1/4. And we also want to square that. Now, this is what's fun about spreadsheets. I can apply those formulas to every row now. And notice, what it did when I did that. Now all of a sudden, this is the y value that my line would predict, it's now using this x value and sticking it over here. It's now figuring out the square distance from the line using what the line would predict and using the y value, this one. And then does the same thing over here. It's figures out the squared distance of this y value from the mean. So what is the total squared error with the line? So let me just sum this up. The total squared error with the line is 2.73. And then the total variation from the mean, squared distances from the mean of the y, are 22.75. So let me be very clear what this is. So let me write these numbers down. I'll write it up here so we can keep looking at this actual graph. So are squared error versus our line, our total squared error, we just computed to be 2.74. I rounded a little bit. And what that is, is you take each of these data points' vertical distance to the line. So this distance squared, plus this distance squared, plus this distance squared, plus this distance squared. That's all we just calculated on Excel. And that total squared variation to the line is 2.74. Or total squared error with the line. And then the other number we figured out was the total distance from the mean. So the mean here is y is equal to 1/4. So that's going to be right over here. This is 1/2. So right over here. So this is our mean y value. Or the central tendency for our y values. And so what we calculated next was the total error, the squared error, from the means of our y values. That's what we calculated over here in the spreadsheet. You see in the formula. It is this number, E2, minus 0.25, which is the mean of our y's squared. That's exactly what we calculated. We calculated for each of the y values. And then we summed them all up. It's 22.75. It is equal to 22.75. So this is essentially the error that the line does not explain. This is the total error, this is the total variation of the numbers. So if you wanted to know the percentage of the total variation that is not explained by the line, you could take this number divided by this number. So 2.74 over 22.75. This tells us the percentage of total variation not explained by the line or by the variation in x. And so what is this number going to be? I can just use Excel for this. So I'm just going to divide this number divided by this number right over there. I get 0.12. So this is equal to 0.12. Or another way to think about it is 12% of the total variation is not explained by the variation in x. The total squared distance between each of the points or their kind of spread, their variation, is not explain by the variation in x. So if you want the amount that is explained by the variance in x, you just subtract that from 1. So let me write it right over here. So we have our r squared, which is the percent of the total variation that is explained by x, is going to be 1 the minus that 0.12 that we just calculated. Which is going to be 0.88. So our r squared here is 0.88. It's very, very close to 1. The highest number it can be is 1. So what this tells us, or a way to interpret this, is that 88% of the total variation of these y values is explained by the line or by the variation in x. And you can see that it looks like a pretty good fit. Each of these aren't too far. Each of these points are definitely much closer to the line than they are to the mean line. In fact, all of them are closer to our actual line than to the mean.