You calculated standart deviation on all points. Why do we use a SAMPLE standart deviation in this case?

This is used because of the difference between the Population mean and Sample mean. The _error_ of an observed value is the deviation of the observed value from the (unobservable) true value of a quantity of interest, aka deviation from the *Population* mean. The _residual_ of an observed value is the difference between the observed value and the estimated value of the quantity of interest, aka the *Sample* mean.

At 4:20, Sal notes that we should divide by `(n-1)`, but I've seen elsewhere `(n-2)` in the denominator since `y-hat` estimation costs us two degrees of freedom -- one for the intercept term and one for the slope term of our regression line. Source: https://onlinecourses.science.psu.edu/stat501/node/254 Is the difference because here we are only looking at error variance due to *intercept only* (hence + and - error lines that are 'shifted' above and below the regression line) rather than *slope and intercept* (which would include + and - 'shifts' as well as 'rotation' around the coordinate `(x-bar, y-bar)`)? I'm not entirely convinced of my attempted explanation since the Wikipedia article on errors/residuals has this tidbit: _In regression analysis, the distinction between errors and residuals is subtle and important, and leads to the concept of studentized residuals. Given an unobservable function that relates the independent variable to the dependent variable – say, a line – the deviations of the dependent variable observations from this function are the unobservable errors. If one runs a regression on some data, then the deviations of the dependent variable observations from the fitted function are the residuals. However, a terminological difference arises in the expression mean squared error (MSE). The mean squared error of a regression is a number computed from the sum of squares of the computed residuals, and not of the unobservable errors. If that sum of squares is divided by n, the number of observations, the result is the mean of the squared residuals. Since this is a biased estimate of the variance of the unobserved errors, the bias is removed by dividing the sum of the squared residuals by `df = n − p − 1`, instead of `n`, where df is the number of degrees of freedom (`n minus the number of parameters p being estimated - 1`). This forms an unbiased estimate of the variance of the unobserved errors, and is called the mean squared error._ Source: https://en.wikipedia.org/wiki/Errors_and_residuals#Regressions Does `p = 0` in this case? Or does `p = 1` since `y-hat` is attempting to estimate the parameter that is the true `y` at each `x`?

The section you reference on Wikipedia is for creating an unbiased estimate using the degrees of freedom to adjust. _In regression analysis, the term mean squared error is sometimes used to refer to the unbiased estimate of error variance: the residual sum of squares divided by the number of degrees of freedom. This definition for a known, computed quantity differs from the above definition for the computed MSE of a predictor in that a different denominator is used._ In our case we are using the standard n-1 error correction.

Thank you for nice explanation. How to calculate rmse of a probability distribution? e.g., normal distribution.

That's just the standard deviation of the distribution. That is, sqrt(E[X^2] - (E[X])^2).

Do we divide by (n-1) because it's a sample? If it's the entire population, do we divide by n?

You are correct. Divide by n-1 when it is a sample and n when it is an entire population.

Standard deviation of residuals? The line is meant to give the average values? Standard deviation involves subtracting a mean from a value. Square the difference. Sum all the squared differences, divide by n and the square root.

The standard deviation of residual is not entirely accurate; RMSD is the technically sound term in the context. I think SD of residual was used to point out the involvement of residuals and the calculation looking similar to the SD equation.

Main content

Course: Statistics and probability > Unit 5

Lesson 5: Assessing the fit in least-squares regression

Standard deviation of residuals or Root-mean-square error (RMSD)

Name: Standard deviation of residuals or Root-mean-square error (RMSD)
Uploaded: 2017-07-12T19:36:06Z
Description: Calculating the standard deviation of residuals (or root-mean-square error (RMSD) or root-mean-square deviation (RMSD)) to measure disagreement between a linear regression model and a set of data.

Google Classroom

Calculating the standard deviation of residuals (or root-mean-square error (RMSD) or root-mean-square deviation (RMSD)) to measure disagreement between a linear regression model and a set of data.

Want to join the conversation?

Sort by:

dmytrokalinin
Posted 7 years ago. Direct link to dmytrokalinin's post “You calculated standart d...”
You calculated standart deviation on all points. Why do we use a SAMPLE standart deviation in this case?
Button navigates to signup pageComment on dmytrokalinin's post “You calculated standart d...”
(15 votes)
Answer
- Britton Winterrose
  Posted 6 years ago. Direct link to Britton Winterrose's post “This is used because of t...”
  This is used because of the difference between the Population mean and Sample mean.
  
  The error of an observed value is the deviation of the observed value from the (unobservable) true value of a quantity of interest, aka deviation from the Population mean.
  
  The residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest, aka the Sample mean.
  Comment on Britton Winterrose's post “This is used because of t...”
  (3 votes)
Per48edjes
Posted 6 years ago. Direct link to Per48edjes's post “At 4:20, Sal notes that w...”
At
4:20
, Sal notes that we should divide by (n-1), but I've seen elsewhere (n-2) in the denominator since y-hat estimation costs us two degrees of freedom -- one for the intercept term and one for the slope term of our regression line.

Source: https://onlinecourses.science.psu.edu/stat501/node/254

Is the difference because here we are only looking at error variance due to intercept only (hence + and - error lines that are 'shifted' above and below the regression line) rather than slope and intercept (which would include + and - 'shifts' as well as 'rotation' around the coordinate (x-bar, y-bar))?

I'm not entirely convinced of my attempted explanation since the Wikipedia article on errors/residuals has this tidbit:

In regression analysis, the distinction between errors and residuals is subtle and important, and leads to the concept of studentized residuals. Given an unobservable function that relates the independent variable to the dependent variable – say, a line – the deviations of the dependent variable observations from this function are the unobservable errors. If one runs a regression on some data, then the deviations of the dependent variable observations from the fitted function are the residuals.

However, a terminological difference arises in the expression mean squared error (MSE). The mean squared error of a regression is a number computed from the sum of squares of the computed residuals, and not of the unobservable errors. If that sum of squares is divided by n, the number of observations, the result is the mean of the squared residuals. Since this is a biased estimate of the variance of the unobserved errors, the bias is removed by dividing the sum of the squared residuals by df = n − p − 1, instead of n, where df is the number of degrees of freedom (n minus the number of parameters p being estimated - 1). This forms an unbiased estimate of the variance of the unobserved errors, and is called the mean squared error.

Source: https://en.wikipedia.org/wiki/Errors_and_residuals#Regressions

Does p = 0 in this case? Or does p = 1 since y-hat is attempting to estimate the parameter that is the true y at each x?
Button navigates to signup pageButton navigates to signup page
(10 votes)
Answer
- Britton Winterrose
  Posted 6 years ago. Direct link to Britton Winterrose's post “The section you reference...”
  The section you reference on Wikipedia is for creating an unbiased estimate using the degrees of freedom to adjust.
  
  In regression analysis, the term mean squared error is sometimes used to refer to the unbiased estimate of error variance: the residual sum of squares divided by the number of degrees of freedom. This definition for a known, computed quantity differs from the above definition for the computed MSE of a predictor in that a different denominator is used.
  
  In our case we are using the standard n-1 error correction.
  Button navigates to signup page
  (2 votes)
dfbarbour
Posted 4 years ago. Direct link to dfbarbour's post “It's confusing that r is ...”
It's confusing that r is used both for the residual and for the correlation coefficient. I sometimes have to stop and think about which he is talking about in a given context.
Button navigates to signup pageButton navigates to signup page
(7 votes)
Answer
- daniella
  Posted 3 months ago. Direct link to daniella's post “It can indeed be confusin...”
  It can indeed be confusing that "r" is used to represent both the residual and the correlation coefficient. In context, it's important to pay attention to whether "r" is referring to residuals (the differences between observed and predicted values) or the correlation coefficient (a measure of the strength and direction of the linear relationship between two variables). Often, the context of the discussion or equation will clarify which "r" is being referred to.
  Button navigates to signup page
  (1 vote)
Ann
Posted 6 years ago. Direct link to Ann's post “I don't understand why at...”
I don't understand why at
4:13
Why do we divide by (n-1) instead of total n? :(
Button navigates to signup pageComment on Ann's post “I don't understand why at...”
(4 votes)
Answer
- daniella
  Posted 3 months ago. Direct link to daniella's post “At 4:13, we divide by (n ...”
  At
  4:13
  , we divide by (n − 1) instead of the total n because we are calculating the sample standard deviation, not the population standard deviation. When we calculate the standard deviation from a sample, we use (n − 1) in the denominator (also known as Bessel's correction) to provide an unbiased estimate of the population standard deviation. This adjustment helps correct for the fact that we are using the sample mean instead of the population mean, which can slightly underestimate the variability in the population. By dividing by (n − 1), we produce a standard deviation that is more representative of the variability in the entire population.
  Button navigates to signup page
  (1 vote)
Alam Ashraf
Posted 7 years ago. Direct link to Alam Ashraf's post “Thank you for nice explan...”
Thank you for nice explanation. How to calculate rmse of a probability distribution? e.g., normal distribution.
Button navigates to signup pageButton navigates to signup page
(2 votes)
Answer
- Chris O'Donnell
  Posted 7 years ago. Direct link to Chris O'Donnell's post “That's just the standard ...”
  That's just the standard deviation of the distribution. That is, sqrt(E[X^2] - (E[X])^2).
  Button navigates to signup page
  (3 votes)
Akira
Posted 3 years ago. Direct link to Akira's post “Do we divide by (n-1) bec...”
Do we divide by (n-1) because it's a sample?
If it's the entire population, do we divide by n?
Button navigates to signup pageButton navigates to signup page
(2 votes)
Answer
- Peter Yu
  Posted 2 years ago. Direct link to Peter Yu's post “You are correct. Divide b...”
  You are correct. Divide by n-1 when it is a sample and n when it is an entire population.
  Button navigates to signup page
  (2 votes)
Felipe Oliveira
Posted a year ago. Direct link to Felipe Oliveira's post “Why do we have two points...”
Why do we have two points at x = 2 (y = 2 and y = 3)?
Button navigates to signup pageButton navigates to signup page
(2 votes)
Answer
- RICHIE RICH
  Posted 5 months ago. Direct link to RICHIE RICH's post “Because the actual and th...”
  Because the actual and the predicted point is stand the same.
  Button navigates to signup page
  (1 vote)
Sidhusan Devamanoharan
Posted 2 years ago. Direct link to Sidhusan Devamanoharan's post “is it technically correct...”
is it technically correct to call this the standard deviation of residuals? we are rather calculating the deviation of actual to expected values and not the deviation of error values.
Just to explain myself more: if we were to find the deviation of residuals, we will find the mean of residuals and square the difference between mean and individual error and find deviation.
Button navigates to signup pageButton navigates to signup page
(2 votes)
Answer
- daniella
  Posted 3 months ago. Direct link to daniella's post “While it is technically c...”
  While it is technically correct to refer to it as the standard deviation of residuals in the context of linear regression analysis, you make a valid point that we are actually calculating the deviation of actual values from the expected values. The term "residuals" refers specifically to the differences between observed and predicted values, which are essentially the errors in the regression model's predictions. Therefore, when we calculate the standard deviation of the residuals, we are measuring the variability of these errors around the regression line. If we were to calculate the deviation of residuals as you described, we would indeed find the mean of the residuals and then calculate the deviation of each residual from that mean. However, in the context of linear regression, the focus is typically on the variability of the errors themselves rather than their deviations from the mean of the residuals.
  Button navigates to signup page
  (1 vote)
Mez Cooper
Posted 5 years ago. Direct link to Mez Cooper's post “Standard deviation of res...”
Standard deviation of residuals? The line is meant to give the average values? Standard deviation involves subtracting a mean from a value. Square the difference. Sum all the squared differences, divide by n and the square root.
Button navigates to signup pageButton navigates to signup page
(1 vote)
Answer
- Rishav
  Posted 4 years ago. Direct link to Rishav's post “The standard deviation of...”
  The standard deviation of residual is not entirely accurate; RMSD is the technically sound term in the context. I think SD of residual was used to point out the involvement of residuals and the calculation looking similar to the SD equation.
  Button navigates to signup page
  (3 votes)
Bryan
Posted 4 years ago. Direct link to Bryan's post “Why do we divide by 3 (n ...”
Why do we divide by 3 (n - 1) and no 4 (n) at
4:35
? I've seen other sites write the formula with just n in the denominator, not n - 1. Are we only dealing with a sample here? What does that mean in terms of residuals?
Button navigates to signup pageButton navigates to signup page
(2 votes)
Answer
- daniella
  Posted 3 months ago. Direct link to daniella's post “The reason we divide by (...”
  The reason we divide by (n − 1) instead of n at
  4:35
  is consistent with the explanation provided above. Yes, in this context, we are dealing with a sample of data points rather than the entire population. Dividing by (n − 1) instead of n helps adjust for the fact that we are estimating parameters from a sample, which might not perfectly represent the entire population. It's important to note that when dealing with the population as a whole, you would divide by n. However, when working with a sample, dividing by (n − 1) provides a more accurate estimate of the population parameter.
  Button navigates to signup page
  (1 vote)

Video transcript

- [Instructor] What we're going to do in this video is calculate a typical measure of how well the actual data points agree with a model, in this case, a linear model and there's several names for it. We could consider this to be the standard deviation of the residuals and that's essentially what we're going to calculate. You could also call it the root-mean-square error and you'll see why it's called this because this really describes how we calculate it. So, what we're going to do is look at the residuals for each of these points and then we're going to find the standard deviation of them. So, just as a bit of review, the ith residual is going to be equal to the ith Y value for a given X minus the predicted Y value for a given X. Now, when I say Y hat right over here, this just says what would the linear regression predict for a given X? And this is the actual Y for a given X. So, for example, and we've done this in other videos, this is all review, the residual here when X is equal to one, we have Y is equal to one but what was predicted by the model is 2.5 times one minus two which is .5. So, one minus .5, so this residual here, this residual is equal to one minus 0.5 which is equal to 0.5 and it's a positive 0.5 and if the actual point is above the model you're going to have a positive residual. Now, the residual over here you also have the actual point being higher than the model, so this is also going to be a positive residual and once again, when X is equal to three, the actual Y is six, the predicted Y is 2.5 times three, which is 7.5 minus two which is 5.5. So, you have six minus 5.5, so here I'll write residual is equal to six minus 5.5 which is equal to 0.5. So, once again you have a positive residual. Now, for this point that sits right on the model, the actual is the predicted, when X is two, the actual is three and what was predicted by the model is three, so the residual here is equal to the actual is three and the predicted is three, so it's equal to zero and then last but not least, you have this data point where the residual is going to be the actual, when X is equal to two is two, minus the predicted. Well, when X is equal to two, you have 2.5 times two, which is equal to five minus two is equal to three. So, two minus three is equal to negative one. And so, when your actual is below your regression line, you're going to have a negative residual, so this is going to be negative one right over there. Now we can calculate the standard deviation of the residuals. We're going to take this first residual which is 0.5, and we're going to square it, we're going to add it to the second residual right over here, I'll use this blue or this teal color, that's zero, gonna square that. Then we have this third residual which is negative one, so plus negative one squared and then finally, we have that fourth residual which is 0.5 squared, 0.5 squared, so once again, we took each of the residuals, which you could view as the distance between the points and what the model would predict, we are squaring them, when you take a typical standard deviation, you're taking the distance between a point and the mean. Here we're taking the distance between a point and what the model would have predicted but we're squaring each of those residuals and adding them all up together, and just like we do with the sample standard deviation, we are now going to divide by one less than the number of residuals we just squared and added, so we have four residuals, we're going to divide by four minus one which is equal to of course three. You could view this part as a mean of the squared errors and now we're gonna take the square root of it. So, let's see, this is going to be equal to square root of this is 0.25, 0.25, this is just zero, this is going to be positive one, and then this 0.5 squared is going to be 0.25, 0.25, all of that over three. Now, this numerator is going to be 1.5 over three, so this is going to be equal to, 1.5 is exactly half of three, so we could say this is equal to the square root of one half, this one over the square root of two, one divided by the square root of two which gets us to, so if we round to the nearest thousandths, it's roughly 0.707. So, approximately 0.707. And if you wanted to visualize that, one standard deviation of the residuals below the line would look like this, and one standard deviation above the line for any given X value would go one standard deviation of the residuals above it, it would look something like that. And this is obviously just a hand-drawn approximation but you do see that this does seem to be roughly indicative of the typical residual. Now, it's worth noting, sometimes people will say it's the average residual and it depends how you think about the word average because we are squaring the residuals, so outliers, things that are really far from the line, when you square it are going to have disproportionate impact here. If you didn't want to have that behavior we could have done something like find the mean of the absolute residuals, that actually in some ways would have been the simple one but this is a standard way of people trying to figure out how much a model disagrees with the actual data, and so you can imagine the lower this number is the better the fit of the model.