Statistics and probability
- Squared error of regression line
- Proof (part 1) minimizing squared error to regression line
- Proof (part 2) minimizing squared error to regression line
- Proof (part 3) minimizing squared error to regression line
- Proof (part 4) minimizing squared error to regression line
- Regression line example
- Second regression example
- Calculating R-squared
- Covariance and the regression line
Calculating R-Squared to see how well a regression line fits data. Created by Sal Khan.
Want to join the conversation?
- Great explanation! I understand everything, but I have a question: what is the practical value of looking at a graph with the R-squared value?
For example, you have a business and your statistician gives you two reports about two unrelated projects / ideas with a large R-squared value on one report, and a small R-squared value on the second report.
You, the business owner and the decision maker, will ask, "So what? What does this mean? What value is this graph to the decisions I will make?"
There has to be an economic purpose behind the R-squared values. What is the answer to "so what?"(37 votes)
- Knowing the extent to which the model "fits" reality (represented by the points we have actually observed, the plotted points) will help the business owner in your example assess the likelihood of a value predicted by the model (all the other points on the line) being actually true.
The business owner distributes grains which he buys from farmers and sells to a breakfast cereal company. He suspects that the amount of rainfall has something to do with the price the farmer charges him for the grains (rain affects the crops and therefore the supply). He asks the statistician to build a model and report its R^2 along with it. The statistician has access to 10 years of rainfall and grain price data, so he plots the prices against the rainfall and builds a model. The R^2 for this model is 88%.
The next year, the business owner measures the rainfall and uses the model to predict the price of grain and gets a rather accurate result. He's quite pleased with the statistician's work so he asks him to build a model relating the number of births of ten years past to the number of cereal boxes sold that year (he assumes more kids means more parents buying cereals). The statistician builds a model and comes back with an R^2 of 42%.
The business owner decides is probably best not to try to predict the demand for his cereals from the births 10 years past. He ignores the model.
High R^2 = good model, probably profitable
Low R^2 = bad model, probably dangerous
Hope this helps.(108 votes)
- How is the variation in x the same as the regression line, when calculating r2?(12 votes)
- I think you're just wondering why he's using the term "variation in X" at all. It helps to think of it as though the x-axis is time and the y-axis shows results taken at different times. Each result at each time has some bit of error associated with it based on the line you're measuring against (drawing those vertical lines from point to line are aka residuals).
The regression line attempts to change where you draw your residuals to so that a y value of 10 might have lots of error at one value of x (at one time), but if you were to get that same value of y=10 at a different value of x it will have a different amount of error (due to the slope of the regression line).
So the regression line changes the requirement for "error" as X varies (aka the variation of X).
I hope that helped a little. I'm no Sal.(13 votes)
- My statistics textbook suggests that the total error would be the sum of the explained and the unexplained error which in this case would be 2.74 + 22.75. The book then calculates r squared as the explained error divided by the total error which in this case would be 22.75/(2.74+22.75) = 0.89. Are the two methods equivalent (i.e. this method and the one described in the lecture) ?(3 votes)
- You're missing something that the video didn't fully explain. There are three ways to categorize the error here:
1. Total error
2. Explained error
3. Unexplained error
R^2 is then (Explained Error) / (Total Error) = 1 - (Unexplained Error) / (Total Error)
The total error is the sum of (Y-Ybar)^2, so in the video this is the 22.75.
The unexplained error is the sum of (Y-Y*)^2, so in the video this value is 2.74.
He never actually calculated the explained error, but it would be the difference, 22.75 - 2.74 = 20.01. You could get this by taking the sum of (Y*-Ybar)^2. He hints at this when he says that the 12% is the "percent of error NOT explained by variation in X," and subtracts that value from 1 to get the percent of error that IS explained by variation in X. So, that hints that the Unexplained and Explained errors must add to the Total error.
Anyway, after this, you appear to have followed the formulas in your statistics textbook correctly. Using the new values, we'd have:
R^2 = (Explained Error / Total Error) = 20.01/22.75 = 0.879
R^2 = 1 - (Unexplained Error / Total Error) = 1 - 2.74/22.75 = 0.879(15 votes)
- At7:14, Sal's explanation seems to imply that the standard error from the line can be seen as a fraction of the standard error from the mean. The SE of the line is the 'unexplained variation' and the SE of the mean is the 'explained variation'. But as far as I can tell, the SE of the line is measuring a completely different type of variation, right? How does the variation from the regression line in any way contribute or relate to the variation from the mean?(5 votes)
- Be sure to watch the next video all the way to the end. Sal pulls it all together, and I think his explanation will answer your question.(3 votes)
- Why do we square the error on the line?(3 votes)
- It makes positive and negative differences both positive, so they don't cancel each other out in the sum. Also, it makes large deviations from the line have a disproportionally large effect on the total error. There are other reasons why this is the convention, but I don't know them well enough to comment more.(6 votes)
- How does the squared error from the mean explain then"total" variation? I understand the squared error of the line, but do not understand the squared error of the mean(4 votes)
- Consider if we tried to fit the model y=b instead of y=mx+b, basically limiting ourselves to using a horizontal line. The line of best fit would be a horizontal line at the mean of all y values, because it minimizes the vertical distance between itself and the points. That's why we use y_mean as the denominator in R-squared. A slope will always give us a better line of best fit, and R-squared is a measure of how much better.(2 votes)
- If you draw a line which is extremely far away from all the points, e.g.
The Squared Error from the line will be much higher than the Squared Error from the mean.
so SEi/SEy is greater than 1
and R^2, 1-SEl/SEy will be negative
Is this a real thing?
I realise that this never happens when you try to actually draw a good regression line, but
it means that "R^2" can be a negative number?(2 votes)
- Yes, that's true, but it's also violating the basic premise of the model. The reason R^2 = 1-SEl/SEy works is because we assume that the total sum of squares, the SSy, is the total variation of the data, so we can't get any more variability than that. When we intentionally make the regression line bad like that, it's making one of the other sum of square terms larger than the total variation.(4 votes)
- Perhaps go into different statistic tests for ANOVA and explain what passing/failing a test means, briefly?(2 votes)
- There are other videos specifically discussing ANOVA that can also be found under the Probability + Statistics topic.(2 votes)
- As R^2 gets closer to 1 that indicates that the variation in data points is explained by the variation in x, meaning that the regression line is an increasingly better fit for the data, as explained at9:04. Is this correct? For research purposes and reporting on studies what value is considered "good enough" to make the statement that x + y are correlated?(2 votes)
- That's a great question. Unfortunately, like so many great questions, the answer is "it depends" :)
In something like a physics or chemistry experiment, where you are able to tightly control all the variables and using high-quality sensors, you can get R-squared values like 0.999 or even higher. If you are expecting a value like this and get something like R-squared = 0.9, you might start rethinking your hypothesis or the design of your experiment.
However, if the data is less precise or a bit noisier - perhaps you're plotting self-reported happiness versus self-reported height - then an R-squared value of less than 0.9 might still be enough to demonstrate a correlation. Ultimately, it all comes down to how much random variation you can expect in your data.(2 votes)
- Is r^2 the same as r? The correlation coefficient intuition module leads me to believe this.
Also, what does it mean to say that "x% of the total variation of the y values is explained by the variation in x"? Are we talking about the variation of the x values in each of the ordered pairs?(1 vote)
In the last video, we were able to find the equation for the regression line for these four data points. What I want to do in this video is figure out the r squared for these data points. Figure out how good this line fits the data. Or even better, figure out the percentage-- which is really the same thing-- of the variation of these data points, especially the variation in y, that is due to, or that can be explained by variation in x. And to do that, I'm actually going to get a spreadsheet out. I've actually tried to do this with a calculator and it's much harder. So hopefully this doesn't confuse you too much to use a spreadsheet. And I'm a make a couple of columns here. And spreadsheets actually have functions that'll do all of this automatically, but I really want to do it so that you could do it by hand if you had to. So I'm going to make a couple of columns here. This is going to be my x column. This is going to be my y column. This is going to be the column-- I'll call this y star-- this'll be the y value that our line predicts based on our x value. This is going to be the error with the line. Let me caught it the squared error with the line. I don't want us to take up too much space. And then the next one, I'm going to have the squared variation for that y value from the mean y. And I think these columns by themselves will be enough for us to do everything. So let's first put all the data points in. So we had negative 2 comma negative 3. That was one data point. Negative 1 comma negative 1. And we had 1 comma 2. Then we have 4 comma 3. Now, what does our line predict? Well our line says, you give me an x value, I'm going to tell you what y value I'll predict. So when x is equal to negative 2, the y value on the line is going to be the slope. So this is going to be equal to 41 divided by 42 times our x value. And I just selected that cell. And just a little bit of a primer on spreadsheets, I'm selecting the cell D2. I was able to just move my cursor over and select that. But that tells me the x value. Minus 5/21. Minus 5 divided by 21. Just like that. So just to be clear of what we're even doing. This y star here, I got negative 2.19. That tells us at this point right over here is negative 2.19. So when we figure out the error, we're going to figure out the distance between negative 3, that's our y value, and negative 2.19. So let's do that. So the error is just going to be equal to our y value. That's cell E2. Minus the value that our line would predict. So just that value is the actual error. But we want to square it. And then, the next thing we want to do is the squared distance. so this is equal to the squared distance of our y value from the y's mean. So what's the mean of the y's? Mean of the y's is 1/4. So minus 0.25, is the same thing is 1/4. And we also want to square that. Now, this is what's fun about spreadsheets. I can apply those formulas to every row now. And notice, what it did when I did that. Now all of a sudden, this is the y value that my line would predict, it's now using this x value and sticking it over here. It's now figuring out the square distance from the line using what the line would predict and using the y value, this one. And then does the same thing over here. It's figures out the squared distance of this y value from the mean. So what is the total squared error with the line? So let me just sum this up. The total squared error with the line is 2.73. And then the total variation from the mean, squared distances from the mean of the y, are 22.75. So let me be very clear what this is. So let me write these numbers down. I'll write it up here so we can keep looking at this actual graph. So are squared error versus our line, our total squared error, we just computed to be 2.74. I rounded a little bit. And what that is, is you take each of these data points' vertical distance to the line. So this distance squared, plus this distance squared, plus this distance squared, plus this distance squared. That's all we just calculated on Excel. And that total squared variation to the line is 2.74. Or total squared error with the line. And then the other number we figured out was the total distance from the mean. So the mean here is y is equal to 1/4. So that's going to be right over here. This is 1/2. So right over here. So this is our mean y value. Or the central tendency for our y values. And so what we calculated next was the total error, the squared error, from the means of our y values. That's what we calculated over here in the spreadsheet. You see in the formula. It is this number, E2, minus 0.25, which is the mean of our y's squared. That's exactly what we calculated. We calculated for each of the y values. And then we summed them all up. It's 22.75. It is equal to 22.75. So this is essentially the error that the line does not explain. This is the total error, this is the total variation of the numbers. So if you wanted to know the percentage of the total variation that is not explained by the line, you could take this number divided by this number. So 2.74 over 22.75. This tells us the percentage of total variation not explained by the line or by the variation in x. And so what is this number going to be? I can just use Excel for this. So I'm just going to divide this number divided by this number right over there. I get 0.12. So this is equal to 0.12. Or another way to think about it is 12% of the total variation is not explained by the variation in x. The total squared distance between each of the points or their kind of spread, their variation, is not explain by the variation in x. So if you want the amount that is explained by the variance in x, you just subtract that from 1. So let me write it right over here. So we have our r squared, which is the percent of the total variation that is explained by x, is going to be 1 the minus that 0.12 that we just calculated. Which is going to be 0.88. So our r squared here is 0.88. It's very, very close to 1. The highest number it can be is 1. So what this tells us, or a way to interpret this, is that 88% of the total variation of these y values is explained by the line or by the variation in x. And you can see that it looks like a pretty good fit. Each of these aren't too far. Each of these points are definitely much closer to the line than they are to the mean line. In fact, all of them are closer to our actual line than to the mean.