Linear regression and correlation
-
Estimating the line of best fit
-
Correlation and Causality
-
Squared Error of Regression Line
-
Proof (Part 1) Minimizing Squared Error to Regression Line
-
Proof Part 2 Minimizing Squared Error to Line
-
Proof (Part 3) Minimizing Squared Error to Regression Line
-
Proof (Part 4) Minimizing Squared Error to Regression Line
-
Regression Line Example
-
Second Regression Example
-
R-Squared or Coefficient of Determination
-
Calculating R-Squared
-
Covariance and the Regression Line
R-Squared or Coefficient of Determination R-Squared or Coefficient of Determination
⇐ Use this menu to view and help create subtitles for this video in many different languages.
You'll probably want to hide YouTube's captions if using these subtitles.
- In the last few videos, we saw that if we had n points, each
- of them have x and y-coordinates.
- Let me draw n of those points.
- So let's call this point one.
- It has coordinates x1, y1.
- You have the second point over here.
- It had coordinates x2, y2.
- And we keep putting points up here and eventually we get to
- the nth point.
- That has coordinates xn, yn.
- What we saw is that there is a line that we can find that
- minimizes the squared distance.
- This line right here, I'll call it y, is
- equal to mx plus b.
- There's some line that minimizes the square distance
- to the points.
- And let me just review what those squared distances are.
- Sometimes, it's called the squared error.
- So this is the error between the line and point one.
- So I'll call that error one.
- This is the error between the line and point two.
- We'll call this error two.
- This is the error between the line and point n.
- So if you wanted the total error, if you want the total
- squared error-- this is actually how we started off
- this whole discussion-- the total squared error between
- the points and the line, you literally just take the y
- value each point.
- So for example, you would take y1.
- That's this value right over here, you take y1 minus the y
- value at this point in the line.
- Well, that point in the line is, essentially, the y value
- you get when you substitute x1 into this equation.
- So I'll just substitute x1 into this equation.
- So minus m x1 plus b.
- This right here, that is the this y value right over here.
- That is m x1 b.
- I don't want to my get my graph too cluttered.
- So I'll just delete that there.
- That is error one right over there.
- And we want the squared errors between each of the
- points of the line.
- So that's the first one.
- Then you do the same thing for the second point.
- And we started our discussion this way.
- y2 minus m x2 plus b squared, all the way-- I'll do dot dot
- dot to show that there are a bunch of these that we have to
- do until we get to the nth point-- all the way to yn
- minus m xn plus b squared.
- And now that we actually know how to find these m's and b's,
- I showed you the formula.
- And in fact, we've proved the formula.
- We can find this line.
- And if we want to say, well, how much error is there?
- We can then calculate it.
- Because we now know the m's and the b's.
- So we can calculate it for certain set of data.
- Now, what I want to do is kind of come up with a more
- meaningful estimate of how good this line is fitting the
- data points that we have. And to do that, we're going to ask
- ourselves the question, what percentage of the variation in
- y is described by the variation in x?
- So let's think about this.
- How much of the total variation in y-- there's
- obviously variation in y.
- This y value is over here.
- This point's y value is over here.
- There is clearly a bunch of variation in the y.
- But how much of that is essentially described by the
- variation in x?
- Or described by the line?
- So let's think about that.
- First, let's think about what the total variation is.
- How much of the total variation in y?
- So let's just figure out what the total variation in y is.
- It's really just a tool for measuring.
- When we think about variation, and this is even true when we
- thought about variance, which was the mean variation in y.
- If you think about the squared distance from some central
- tendency, and the best central measure we can have of y is
- the arithmetic mean.
- So we could just say, the total variation in y is just
- going to be the sum of the distances of each of the y's.
- So you get y1 minus the mean of all the y's squared.
- Plus y2 minus the mean of all the y's squared.
- Plus, and you just keep going all the way
- to the nth y value.
- To yn minus the mean of all the y's squared.
- This gives you the total variation in y.
- You can just take out all the y values.
- Find their mean.
- It'll be some value, maybe it's
- right over here someplace.
- And so you can even visualize it the same way we visualized
- the squared error from the line.
- So if you visualize it, you can imagine a line that's y is
- equal to the mean of y.
- Which would look just like that.
- And what we're measuring over here, this error right over
- here, is the square of this distance right over here.
- Between this point vertically and this line.
- The second one is going to be this distance.
- Just right up to the line.
- And the nth one is going to be the distance from there all
- the way to the line right over there.
- And there are these other points in between.
- This is the total variation in y.
- Makes sense.
- If you divide this by n, you're going to get what we
- typically associate as the variance of y, which is kind
- of the average squared distance.
- Now, we have the total squared distance.
- So what we want to do is-- how much of the total variation in
- y is described by the variation in x?
- So maybe we can think of it this way.
- So our denominator, we want what percentage of the total
- variation in y?
- Let me write it this way.
- Let me call this the squared error from the average.
- Maybe I'll call this the squared error
- from the mean of y.
- And this is really the total variation in y.
- So let's put that as the denominator.
- The total variation in y, which is the squared error
- from the mean of the y's.
- Now we want to what percentage of this is described by the
- variation in x.
- Now, what is not described by the variation in x?
- We want to how much is described by the
- variation in x.
- But what if we want how much of the total variation is not
- described by the regression line?
- Well, we already have a measure for that.
- We have the squared error of the line.
- This tells us the square of the distances from each point
- to our line.
- So it is exactly this measure.
- It tells us how much of the total variation is not
- described by the regression line.
- So if you want to know what percentage of the total
- variation is not described by the regression line, it would
- just be the squared error of the line, because this is the
- total variation not described by the regression line,
- divided by the total variation.
- So let me make it clear.
- This, right over here, tells us what percentage of the
- total variation is not described by the
- variation in x.
- Or by the regression line.
- So to answer our question, what percentage is described
- by the variation?
- Well, the rest of it has to be described by the
- variation in x.
- Because our question is what percent of the total variation
- is described by the variation in x.
- This is the percentage that is not described.
- So if this number is 30%-- if 30% of the variation in y is
- not described by the line, then the remainder will be
- described by the line.
- So we could essentially just subtract this from 1.
- So if we take 1 minus the squared error between our data
- points and the line over the squared error between the y's
- and the mean y, this actually tells us what percentage of
- total variation is described by the line.
- You can either view it's described by the line or by
- the variation in x.
- And this number right here, this is called the coefficient
- of determination.
- It's just what statisticians have decided to name it.
- And it's also called R-squared.
- You might have even heard that term when people talk about
- regression.
- Now let's think about it.
- If the squared error of the line is really small
- what does that mean?
- It means that these errors, right over
- here, are really small.
- Which means that the line is a really good fit.
- So let me write it over here.
- If the squared error of the line is small, it tells us
- that the line is a good fit.
- Now, what would happen over here?
- Well, if this number is really small, this is going to be a
- very small fraction over here.
- 1 minus a very small fraction is going to be a
- number close to 1.
- So then, our R-squared will be close to 1, which tells us
- that a lot of the variation in y is described by the
- variation in x.
- Which makes sense, because the line is a good fit.
- You take the opposite case.
- If the squared error of the line is huge, then that means
- there's a lot of error between the data points and the line.
- So if this number is huge, then this number over here is
- going to be huge.
- Or it's going to be a percentage close to 1.
- And 1 minus that is going to be close to 0.
- And so if the squared error of the line is large, this whole
- thing's going to be close to 1.
- And if this whole thing is close to 1, the whole
- coefficient of determination, the whole R-squared, is going
- to be close to 0, which makes sense.
- That tells us that very little of the total variation in y is
- described by the variation in x, or described by the line.
- Well, anyway, everything I've been dealing with so far has
- been a little bit in the abstract.
- In the next video, we'll actually look at some data
- samples and calculate their regression line.
- And also calculate the R-squared, and see how good of
- a fit it really is.
Be specific, and indicate a time in the video:
At 5:31, how is the moon large enough to block the sun? Isn't the sun way larger?
|
Have something that's not a question about this content? |
This discussion area is not meant for answering homework questions.
Discuss the site
For general discussions about Khan Academy, visit our Reddit discussion page.
Flag inappropriate posts
Here are posts to avoid making. If you do encounter them, flag them for attention from our Guardians.
abuse
- disrespectful or offensive
- an advertisement
not helpful
- low quality
- not about the video topic
- soliciting votes or seeking badges
- a homework question
- a duplicate answer
- repeatedly making the same post
wrong category
- a tip or feedback in Questions
- a question in Tips & Feedback
- an answer that should be its own question
about the site
Share a tip
Suggest a fix
Have something that's not a tip or feedback about this content?
This discussion area is not meant for answering homework questions.