Current time:0:00Total duration:7:40

# Introduction to residuals and least squares regression

## Video transcript

- [Narrator] Though I'm
interested in finding the relationship between
people's height in inches and their weight in pounds. And so I'm randomly
sampling a bunch of people measuring their heights,
measuring their weight and then for each person
I'm plotting a point that represents their height
and weight combination. So for example let's say I measure someone who is 60 inches tall, that'll
be about five feet tall and they weigh 100 pounds. And so I'd go to 60 inches
and then 100 pounds. Right over there so that
point right over there is the point 60 comma, 60 comma, 100. One way to think about
it, height we could say is being measured on our
X axis or plotted along our X axis and then
weight along our Y axis. And though this point from
this person is the 0.60, 100 representing 60 inches, 100 pounds. And so so I've done it
for one, two, three, four five, six, seven, eight, nine
people and I could keep going but even with this I could say, well look, it looks like there's a roughly
linear relationship here. It looks like it's positive,
that generally speaking as height increases so does weight. Maybe I could try to put a line that can approximate this trend. Let me try to do that
so this is my line tool. I could think about a bunch of lines. Something like this
seems like it would be, you'd be, most of the
data is below the line so that seems like it's not right. I could do something like, I could do something like this but that doesn't seem like a good fit. Most the data seems to be above the line. And so once again I'm
just eyeballing it here, in the future you will learn
better methods of finding a better fit. But that's something like this
and I'm just eyeballing it looks about right. So that line, you could view
this as a regression line. We could view this as y equals mx plus b. Where we would have to
figure out the slope and the Y intercept and
we could figure it out based on what I just drew or
we could even think of this as weight. Weight is equal to our slope times height. Times height plus whatever
our Y intercept is, if you think of the vertical
axis as the weight axis you could think of it as
your weight intercept. But either way this is
the model that I'm just through eyeballing, this
is my regression line. Something that I'm trying
to fit to these points. But clearly it can't go through, one line won't be able to
go to all of these points. There's going to be for
each point some difference or not for all of them
but for many of them, some difference between the
actual and what would have been predicted by the line. And that idea, the difference
between the actual four point and what would have been
predicted, given say the height that is called a residual. Gonna write that down. The A residual for each
of these data points. And so for example if
I call this right here, if I call that point one,
the residual for point one. Is going to be well, for our variable, for our height variable 60 inches. The actual here is 100 pounds. From that we would subtract
what would be predicted. So what would be predicted
is right over here. I could just substitute
60 into this equation so it would be M times 60 plus b. So I could write it as M, maybe let me write it
this way, 60 M plus B. Once again I would just take the 60 pounds and put it into my model here and say, well what weight would
that have predicted. And I can even, just for the
sake of having a number here. I can look, I can, let
me get my line tool out. And try to get a straight
line from that point. So from this point let
me get a straight line. So that doesn't look quite straight, okay a little bit, okay. Okay so looks like it's about 150 pounds. So my model would have
predicted 150 pounds. So the residual here is going
to be equal to negative 50. So a negative residual is when your actual is below your predicted. So this right over here. This is our one, it is
a negative residual. If you had, if you tried to find, let's say this residual right
over here, for this point. This r2, this would be a positive residual because the actual is larger than what would have
actually been predicted. And so a residual is good for saying, well how good does your
line, does your regression, does your model fit a given data point or how does a given data
point compare to that. What you probably want
to do is think about some combination of all the residuals and try to minimize it. Now you might say, well
why don't I just add up all the residuals and
try to minimize that. But that gets tricky
because some are positive and some are negative and
so a big negative residual could counterbalance the
big positive residual and it would look, it would add up to zero and then it look like there's no residual so you could just add
up the absolute values. So you could say, well
let me just take the sum of all of the residual,
of the absolute value of all the residuals. And then let me change M and B for my line to minimize this and
that would be a technique of trying to create a regression line. But another way to do
it and this is actually the most typical way that
you will see in statistics is that people take the sum of the squares of the residuals. The sum of the squares and
when you square something whether it's negative or positive, it's going to be a positive
so it takes care of that issue of negatives and positives
canceling out with each other. And when you square a number,
things with large residuals are gonna become even
larger, relatively speaking. You know, if you square
a large, you know one is, if you think about this way,
let me put regular numbers, one, two, three, four. These are all one apart from each other but if I were to square
them, one, four, nine, 16, they get further and further apart and so something, the
larger the residual is when you square it,
when the sum of squares is going to represent a
bigger proportion of the sum. And so what we'll see in future videos is that there is a technique called least squares regression. Least squares regression. Where you can find an M and
a B for a given set of data so it minimizes the sum of
the squares of the residual. And that's valuable and the
reason why this is used most is it really tries to take in account things that are significant outliers. Things that sit from pretty
far away from the model, something like this is going to really, with a least squares regression. It's going to try to be minimized or it's going to be weighted
a little bit heavier because when you square it becomes even a bigger factor in this. But this is just a
conceptual introduction. In future videos we'll do
things like calculate residuals. And we'll actually derive the formula for how do you figure out
an M and a B for a line that actually minimizes
the sum of the squares of the residuals.