Main content

## Assessing the fit in least-squares regression

Current time:0:00Total duration:12:41

# R-squared or coefficient of determination

## Video transcript

In the last few videos, we saw
that if we had n points, each of them have x and
y-coordinates. Let me draw n of those points. So let's call this point one. It has coordinates x1, y1. You have the second
point over here. It had coordinates x2, y2. And we keep putting points up
here and eventually we get to the nth point. That has coordinates xn, yn. What we saw is that there is a
line that we can find that minimizes the squared
distance. This line right here,
I'll call it y, is equal to mx plus b. There's some line that minimizes
the square distance to the points. And let me just review what
those squared distances are. Sometimes, it's called
the squared error. So this is the error between
the line and point one. So I'll call that error one. This is the error between
the line and point two. We'll call this error two. This is the error between
the line and point n. So if you wanted the total
error, if you want the total squared error-- this is actually
how we started off this whole discussion-- the
total squared error between the points and the line, you
literally just take the y value each point. So for example, you
would take y1. That's this value right over
here, you take y1 minus the y value at this point
in the line. Well, that point in the line is,
essentially, the y value you get when you substitute
x1 into this equation. So I'll just substitute
x1 into this equation. So minus m x1 plus b. This right here, that is the
this y value right over here. That is m x1 b. I don't want to my get my
graph too cluttered. So I'll just delete
that there. That is error one right
over there. And we want the squared errors
between each of the points of the line. So that's the first one. Then you do the same thing
for the second point. And we started our discussion
this way. y2 minus m x2 plus b squared,
all the way-- I'll do dot dot dot to show that there are a
bunch of these that we have to do until we get to the nth
point-- all the way to yn minus m xn plus b squared. And now that we actually know
how to find these m's and b's, I showed you the formula. And in fact, we've proved
the formula. We can find this line. And if we want to say, well,
how much error is there? We can then calculate it. Because we now know the
m's and the b's. So we can calculate it for
certain set of data. Now, what I want to do is kind
of come up with a more meaningful estimate of how good
this line is fitting the data points that we have. And to
do that, we're going to ask ourselves the question, what
percentage of the variation in y is described by the
variation in x? So let's think about this. How much of the total variation
in y-- there's obviously variation in y. This y value is over here. This point's y value
is over here. There is clearly a bunch
of variation in the y. But how much of that is
essentially described by the variation in x? Or described by the line? So let's think about that. First, let's think about what
the total variation is. How much of the total
variation in y? So let's just figure out what
the total variation in y is. It's really just a tool
for measuring. When we think about variation,
and this is even true when we thought about variance, which
was the mean variation in y. If you think about the squared
distance from some central tendency, and the best central
measure we can have of y is the arithmetic mean. So we could just say, the total
variation in y is just going to be the sum of the
distances of each of the y's. So you get y1 minus the mean
of all the y's squared. Plus y2 minus the mean of
all the y's squared. Plus, and you just keep
going all the way to the nth y value. To yn minus the mean of
all the y's squared. This gives you the total
variation in y. You can just take out
all the y values. Find their mean. It'll be some value, maybe it's right over here someplace. And so you can even visualize it
the same way we visualized the squared error
from the line. So if you visualize it, you can
imagine a line that's y is equal to the mean of y. Which would look
just like that. And what we're measuring over
here, this error right over here, is the square of this
distance right over here. Between this point vertically
and this line. The second one is going
to be this distance. Just right up to the line. And the nth one is going to be
the distance from there all the way to the line
right over there. And there are these other
points in between. This is the total
variation in y. Makes sense. If you divide this by n, you're
going to get what we typically associate as the
variance of y, which is kind of the average squared
distance. Now, we have the total
squared distance. So what we want to do is-- how
much of the total variation in y is described by the
variation in x? So maybe we can think
of it this way. So our denominator, we want what
percentage of the total variation in y? Let me write it this way. Let me call this the squared
error from the average. Maybe I'll call this
the squared error from the mean of y. And this is really the
total variation in y. So let's put that as
the denominator. The total variation in y, which
is the squared error from the mean of the y's. Now we want to what percentage
of this is described by the variation in x. Now, what is not described
by the variation in x? We want to how much is
described by the variation in x. But what if we want how much of
the total variation is not described by the regression
line? Well, we already have
a measure for that. We have the squared
error of the line. This tells us the square of the
distances from each point to our line. So it is exactly this measure. It tells us how much of the
total variation is not described by the regression
line. So if you want to know what
percentage of the total variation is not described by
the regression line, it would just be the squared error of the
line, because this is the total variation not described
by the regression line, divided by the total
variation. So let me make it clear. This, right over here, tells
us what percentage of the total variation is not
described by the variation in x. Or by the regression line. So to answer our question, what
percentage is described by the variation? Well, the rest of it has
to be described by the variation in x. Because our question is what
percent of the total variation is described by the
variation in x. This is the percentage that
is not described. So if this number is 30%-- if
30% of the variation in y is not described by the line, then
the remainder will be described by the line. So we could essentially just
subtract this from 1. So if we take 1 minus the
squared error between our data points and the line over the
squared error between the y's and the mean y, this actually
tells us what percentage of total variation is described
by the line. You can either view it's
described by the line or by the variation in x. And this number right here, this
is called the coefficient of determination. It's just what statisticians
have decided to name it. And it's also called
R-squared. You might have even heard that
term when people talk about regression. Now let's think about it. If the squared error of the
line is really small what does that mean? It means that these
errors, right over here, are really small. Which means that the line
is a really good fit. So let me write it over here. If the squared error of the
line is small, it tells us that the line is a good fit. Now, what would happen
over here? Well, if this number is really
small, this is going to be a very small fraction over here. 1 minus a very small fraction
is going to be a number close to 1. So then, our R-squared will be
close to 1, which tells us that a lot of the variation
in y is described by the variation in x. Which makes sense, because
the line is a good fit. You take the opposite case. If the squared error of the line
is huge, then that means there's a lot of error between
the data points and the line. So if this number is huge, then
this number over here is going to be huge. Or it's going to be a percentage
close to 1. And 1 minus that is going
to be close to 0. And so if the squared error of
the line is large, this whole thing's going to
be close to 1. And if this whole thing is
close to 1, the whole coefficient of determination,
the whole R-squared, is going to be close to 0, which
makes sense. That tells us that very little
of the total variation in y is described by the variation in
x, or described by the line. Well, anyway, everything I've
been dealing with so far has been a little bit
in the abstract. In the next video, we'll
actually look at some data samples and calculate their
regression line. And also calculate the
R-squared, and see how good of a fit it really is.