In the last several videos,
we did some fairly hairy mathematics. And you might have even
skipped them. But we got to a pretty
neat result. We got to a formula for the
slope and y-intercept of the best fitting regression line
when you measure the error by the squared distance
to that line. And our formula is, and I'll
just rewrite it here just so we have something
neat to look at. So the slope of that line is
going to be the mean of x's times the mean of the y's minus
the mean of the xy's. And don't worry, this seems
really confusing, we're going to do an example of this
actually in a few seconds. Divided by the mean of x squared
minus the mean of the x squareds. And if this looks a little
different than what you see in your statistics class or your
textbook, you might see this swapped around. If you multiply both the
numerator and denominator by negative 1, you could see this
written as the mean of the xy's minus the mean of x times
the mean of the y's. All of that over the mean of the
x squareds minus the mean of the x's squared. These are obviously
the same thing. You're just multiplying the
numerator and denominator by negative 1, which is same thing
as multiplying the whole thing by 1. And of course, whatever you get
for m, you can then just substitute back in this
to get your b. Your b is going to be equal
to the mean of the y's minus your m. Let me write that in yellow
so it's very clear. You solved for the m value. Minus m times the
mean of the x's. And this is all you need. So let's actually put
that into practice. So let's say I have three
points, and I'm going to make sure that these points
aren't colinear. Because, otherwise, it wouldn't
be interesting. So let me draw three
points over here. Let's say that to one point
is the point 1 comma 2. So this 1, 2. And then we also have
the point 2 comma 1. And then, let's say we also
have the point, let's do something a little bit
crazy, 4 comma 3. So this is 4, 3. So those are our three points. And what we want to do is find
it the best fitting regression line, which we suspect
is going to look something like that. We'll see what it actually looks
like using our formulas, which we have proven. So a good place to start is just
to calculate these things ahead of time, and then
to substitute them back in the equation. So what's the mean of our x's? The mean of our x's is going
to be 1 plus 2 plus 4 divided by 3. And what's this going to be? 1 plus 2 is 3, plus 4
is 7 divided by 3. It is equal to 7/3. Now, what is the mean
of our y's? The mean of our y's is equal
to 2 plus 1 plus 3. All of that over 3. So this is 2 plus 1 is 3. Plus 3 is 6. Divided by 3 is equal to 2. This is 6 divided by
3 is equal to 2. Now, what is the mean
of our xy's? So our first xy over
here is 1 times 2. Plus 2 times 1 plus 4 times 3. And we have three
of these xy's. So divided by 3. So what's this going
to be equal to? We have 2 plus 2, which is 4. 4 plus 12, which is 16. So it's going to be 16/3. And then the last one we have
to calculate is the mean of the x squareds. So what's the mean of
the x squareds? The first x squared is just
going to be 1 squared. Plus this 2 squared, plus
this 4 squared. And we have three data
points again. So this is 1 plus
4, which is 5. Plus 16. Is equal to 21/3, which
is equal to 7. So that worked out to a
pretty neat number. So let's actually find
our m's and our b's. So our slope, our optimal slope
for our regression line, the mean of the x's is
going to be 7/3. Times the mean of the y's. The mean of the y's is 2. Minus the mean of the xy's. Well, that's 16/3. And then, all of that over
the mean of the x's. The mean of the x's
is 7/3 squared. Minus the mean of
the x squareds. So it's going to be minus
this 7 right over here. And we just have to do a little
bit of mathematics. I'm tempted to get out my
calculator, but i'll resist the temptation. It's nice to keep things
as fractions. Let's see if I can
calculate this. This is 14/3 minus 16/3. All of that over,
this is 49/9. And then minus 7. If I wanted to express that as
something over 9, that's the same thing as 63/9. So in our numerator, we
get negative 2/3. And then in our denominator,
what's 49 minus 63? That's negative 14/9. And this is the same thing
as negative 2/3 times negative 9/ 14. Divide numerator and
denominator by 3. Well, the negatives are going
to cancel out first of all. You divide by 3. That becomes a 1. That becomes a 3. Divide by 2. Becomes a 1. That becomes a 7. So our slope is 3/7. Not too bad. Now, we can go back and figure
out our y-intercept. So let's figure out our
y-intercept using this right over here. So our y-intercept, b, is going
to be equal to the mean of the y's, the mean of the
y's is 2, minus our slope. We just figured out our
slope to be 3/7. Times the mean of the
x's, which is 7/3. These just are the reciprocal
of each other, so they cancel out. That just becomes 1. So our y-intercept is literally
just 2 minus 1. So it equals 1. So we have the equation
for our line. Our regression line is going
to be y is equal to-- We figured out m. m is 3/7. y is equal to 3/7 x plus,
our y-intercept is 1. And we are done. So let's actually try
to graph this. So our y-intercept
is going to be 1. It's going to be right
over there. And the slope of our
line is 3/7. So for every 7 we
run, we rise 3. Or another way to think of
it, for every 3.5 we run, we rise 1.5. So we're going to go 1.5
right over here. So this line, if you were to
graph it, and obviously I'm hand drawing it, so it's not
going to be that exact, is going to look like that
right over there. And it actually won't go
directly through that line. So I don't want to give
you that impression. So it might look something
like this. And this line, we have shown,
that this formula minimizes the squared distances
from each of these points to that line. Anyway, that was, at least
in my mind, pretty neat.