If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

# Calculating R-squared

Calculating R-Squared to see how well a regression line fits data. Created by Sal Khan.

## Want to join the conversation?

• Great explanation! I understand everything, but I have a question: what is the practical value of looking at a graph with the R-squared value?

For example, you have a business and your statistician gives you two reports about two unrelated projects / ideas with a large R-squared value on one report, and a small R-squared value on the second report.

You, the business owner and the decision maker, will ask, "So what? What does this mean? What value is this graph to the decisions I will make?"

There has to be an economic purpose behind the R-squared values. What is the answer to "so what?" •   Knowing the extent to which the model "fits" reality (represented by the points we have actually observed, the plotted points) will help the business owner in your example assess the likelihood of a value predicted by the model (all the other points on the line) being actually true.

For example:

The business owner distributes grains which he buys from farmers and sells to a breakfast cereal company. He suspects that the amount of rainfall has something to do with the price the farmer charges him for the grains (rain affects the crops and therefore the supply). He asks the statistician to build a model and report its R^2 along with it. The statistician has access to 10 years of rainfall and grain price data, so he plots the prices against the rainfall and builds a model. The R^2 for this model is 88%.

The next year, the business owner measures the rainfall and uses the model to predict the price of grain and gets a rather accurate result. He's quite pleased with the statistician's work so he asks him to build a model relating the number of births of ten years past to the number of cereal boxes sold that year (he assumes more kids means more parents buying cereals). The statistician builds a model and comes back with an R^2 of 42%.

The business owner decides is probably best not to try to predict the demand for his cereals from the births 10 years past. He ignores the model.

Finally:

High R^2 = good model, probably profitable
Low R^2 = bad model, probably dangerous

Hope this helps.
• How is the variation in x the same as the regression line, when calculating r2? • I think you're just wondering why he's using the term "variation in X" at all. It helps to think of it as though the x-axis is time and the y-axis shows results taken at different times. Each result at each time has some bit of error associated with it based on the line you're measuring against (drawing those vertical lines from point to line are aka residuals).

The regression line attempts to change where you draw your residuals to so that a y value of 10 might have lots of error at one value of x (at one time), but if you were to get that same value of y=10 at a different value of x it will have a different amount of error (due to the slope of the regression line).

So the regression line changes the requirement for "error" as X varies (aka the variation of X).

I hope that helped a little. I'm no Sal.
• My statistics textbook suggests that the total error would be the sum of the explained and the unexplained error which in this case would be 2.74 + 22.75. The book then calculates r squared as the explained error divided by the total error which in this case would be 22.75/(2.74+22.75) = 0.89. Are the two methods equivalent (i.e. this method and the one described in the lecture) ? • You're missing something that the video didn't fully explain. There are three ways to categorize the error here:
1. Total error
2. Explained error
3. Unexplained error
R^2 is then (Explained Error) / (Total Error) = 1 - (Unexplained Error) / (Total Error)

The total error is the sum of (Y-Ybar)^2, so in the video this is the 22.75.
The unexplained error is the sum of (Y-Y*)^2, so in the video this value is 2.74.
He never actually calculated the explained error, but it would be the difference, 22.75 - 2.74 = 20.01. You could get this by taking the sum of (Y*-Ybar)^2. He hints at this when he says that the 12% is the "percent of error NOT explained by variation in X," and subtracts that value from 1 to get the percent of error that IS explained by variation in X. So, that hints that the Unexplained and Explained errors must add to the Total error.

Anyway, after this, you appear to have followed the formulas in your statistics textbook correctly. Using the new values, we'd have:
R^2 = (Explained Error / Total Error) = 20.01/22.75 = 0.879
or
R^2 = 1 - (Unexplained Error / Total Error) = 1 - 2.74/22.75 = 0.879
• At , Sal's explanation seems to imply that the standard error from the line can be seen as a fraction of the standard error from the mean. The SE of the line is the 'unexplained variation' and the SE of the mean is the 'explained variation'. But as far as I can tell, the SE of the line is measuring a completely different type of variation, right? How does the variation from the regression line in any way contribute or relate to the variation from the mean? • • How does the squared error from the mean explain then"total" variation? I understand the squared error of the line, but do not understand the squared error of the mean • Consider if we tried to fit the model y=b instead of y=mx+b, basically limiting ourselves to using a horizontal line. The line of best fit would be a horizontal line at the mean of all y values, because it minimizes the vertical distance between itself and the points. That's why we use y_mean as the denominator in R-squared. A slope will always give us a better line of best fit, and R-squared is a measure of how much better.
• If you draw a line which is extremely far away from all the points, e.g.
y=4000+x
The Squared Error from the line will be much higher than the Squared Error from the mean.
SEl>>>SEy
so SEi/SEy is greater than 1
and R^2, 1-SEl/SEy will be negative
Is this a real thing?
I realise that this never happens when you try to actually draw a good regression line, but
it means that "R^2" can be a negative number? • Yes, that's true, but it's also violating the basic premise of the model. The reason R^2 = 1-SEl/SEy works is because we assume that the total sum of squares, the SSy, is the total variation of the data, so we can't get any more variability than that. When we intentionally make the regression line bad like that, it's making one of the other sum of square terms larger than the total variation.
• Perhaps go into different statistic tests for ANOVA and explain what passing/failing a test means, briefly? • As R^2 gets closer to 1 that indicates that the variation in data points is explained by the variation in x, meaning that the regression line is an increasingly better fit for the data, as explained at . Is this correct? For research purposes and reporting on studies what value is considered "good enough" to make the statement that x + y are correlated? • That's a great question. Unfortunately, like so many great questions, the answer is "it depends" :)

In something like a physics or chemistry experiment, where you are able to tightly control all the variables and using high-quality sensors, you can get R-squared values like 0.999 or even higher. If you are expecting a value like this and get something like R-squared = 0.9, you might start rethinking your hypothesis or the design of your experiment.

However, if the data is less precise or a bit noisier - perhaps you're plotting self-reported happiness versus self-reported height - then an R-squared value of less than 0.9 might still be enough to demonstrate a correlation. Ultimately, it all comes down to how much random variation you can expect in your data. 