Main content

## More on regression

# Calculating R-squared

## Video transcript

In the last video, we were able
to find the equation for the regression line for these
four data points. What I want to do in this video
is figure out the r squared for these data points. Figure out how good this
line fits the data. Or even better, figure out the
percentage-- which is really the same thing-- of the
variation of these data points, especially the variation
in y, that is due to, or that can be explained
by variation in x. And to do that, I'm actually
going to get a spreadsheet out. I've actually tried to do this
with a calculator and it's much harder. So hopefully this doesn't
confuse you too much to use a spreadsheet. And I'm a make a couple
of columns here. And spreadsheets actually have
functions that'll do all of this automatically, but I really
want to do it so that you could do it by hand
if you had to. So I'm going to make a couple
of columns here. This is going to
be my x column. This is going to
be my y column. This is going to be the column--
I'll call this y star-- this'll be the y value
that our line predicts based on our x value. This is going to be the
error with the line. Let me caught it the squared
error with the line. I don't want us to take
up too much space. And then the next one, I'm
going to have the squared variation for that y value
from the mean y. And I think these columns by
themselves will be enough for us to do everything. So let's first put all
the data points in. So we had negative 2
comma negative 3. That was one data point. Negative 1 comma negative 1. And we had 1 comma 2. Then we have 4 comma 3. Now, what does our
line predict? Well our line says, you give
me an x value, I'm going to tell you what y value
I'll predict. So when x is equal to negative
2, the y value on the line is going to be the slope. So this is going to be equal
to 41 divided by 42 times our x value. And I just selected that cell. And just a little bit of a
primer on spreadsheets, I'm selecting the cell D2. I was able to just move my
cursor over and select that. But that tells me the x value. Minus 5/21. Minus 5 divided by 21. Just like that. So just to be clear of what
we're even doing. This y star here, I
got negative 2.19. That tells us at this
point right over here is negative 2.19. So when we figure out the error,
we're going to figure out the distance between
negative 3, that's our y value, and negative 2.19. So let's do that. So the error is just going to
be equal to our y value. That's cell E2. Minus the value that our
line would predict. So just that value is
the actual error. But we want to square it. And then, the next thing
we want to do is the squared distance. so this is equal to the squared
distance of our y value from the y's mean. So what's the mean of the y's? Mean of the y's is 1/4. So minus 0.25, is the
same thing is 1/4. And we also want
to square that. Now, this is what's fun
about spreadsheets. I can apply those formulas
to every row now. And notice, what it did
when I did that. Now all of a sudden, this is the
y value that my line would predict, it's now using
this x value and sticking it over here. It's now figuring out the square
distance from the line using what the line would
predict and using the y value, this one. And then does the same
thing over here. It's figures out the squared
distance of this y value from the mean. So what is the total squared
error with the line? So let me just sum this up. The total squared error
with the line is 2.73. And then the total variation
from the mean, squared distances from the mean
of the y, are 22.75. So let me be very clear
what this is. So let me write these
numbers down. I'll write it up here so we
can keep looking at this actual graph. So are squared error versus our
line, our total squared error, we just computed
to be 2.74. I rounded a little bit. And what that is, is you take
each of these data points' vertical distance to the line. So this distance squared, plus
this distance squared, plus this distance squared, plus
this distance squared. That's all we just calculated
on Excel. And that total squared variation
to the line is 2.74. Or total squared error
with the line. And then the other number we
figured out was the total distance from the mean. So the mean here is
y is equal to 1/4. So that's going to be
right over here. This is 1/2. So right over here. So this is our mean y value. Or the central tendency
for our y values. And so what we calculated next
was the total error, the squared error, from the
means of our y values. That's what we calculated over
here in the spreadsheet. You see in the formula. It is this number, E2, minus
0.25, which is the mean of our y's squared. That's exactly what
we calculated. We calculated for each
of the y values. And then we summed
them all up. It's 22.75. It is equal to 22.75. So this is essentially
the error that the line does not explain. This is the total error,
this is the total variation of the numbers. So if you wanted to know the
percentage of the total variation that is not explained
by the line, you could take this number divided
by this number. So 2.74 over 22.75. This tells us the percentage
of total variation not explained by the line or
by the variation in x. And so what is this number
going to be? I can just use Excel for this. So I'm just going to divide this
number divided by this number right over there. I get 0.12. So this is equal to 0.12. Or another way to think about
it is 12% of the total variation is not explained
by the variation in x. The total squared distance
between each of the points or their kind of spread, their
variation, is not explain by the variation in x. So if you want the amount that
is explained by the variance in x, you just subtract
that from 1. So let me write it
right over here. So we have our r squared, which
is the percent of the total variation that is
explained by x, is going to be 1 the minus that 0.12 that
we just calculated. Which is going to be 0.88. So our r squared here is 0.88. It's very, very close to 1. The highest number
it can be is 1. So what this tells us, or a way
to interpret this, is that 88% of the total variation of
these y values is explained by the line or by the
variation in x. And you can see that it looks
like a pretty good fit. Each of these aren't too far. Each of these points are
definitely much closer to the line than they are
to the mean line. In fact, all of them are closer
to our actual line than to the mean.