If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Impact of removing outliers on regression lines

Impact of removing outliers on slope, y-intercept and r of least-squares regression lines.

Want to join the conversation?

  • blobby green style avatar for user G.Gulzt
    At , I am confused about the answer b, and I failed a similar test in the Practice exercise.

    I understand that the coefficient of the slope decreases, but the slope itself increases, right? The angle of the line relative to the x-axis gets bigger in the negative direction. To me the formulation of the answer is ambiguous.
    (17 votes)
    Default Khan Academy avatar avatar for user
    • aqualine ultimate style avatar for user Caleb Man
      You are right that the angle of the line relative to the x-axis gets bigger, but that does not mean that the slope increases. The absolute value of the slope gets bigger, but it is increasing in a negative direction so it is getting smaller. -6 is smaller that -1, but that absolute value of -6(6) is greater than the absolute value of -1(1).
      (14 votes)
  • blobby green style avatar for user tokjonathan
    Why would slope decrease? Arguably, the slope tilts more and therefore it increases doesn't it? Is the slope measure based on which side is the one going up/down rather than the steepness of it in either direction. That is to say left side of the line going downwards means positive and vice versa.
    (3 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      The slope of the least-squares regression line represents the change in the dependent variable for a one-unit increase in the independent variable. In the context of linear regression, the slope measures the steepness of the line, indicating how much the dependent variable changes for each unit increase in the independent variable. When an outlier that deviates significantly from the overall trend is removed, the remaining data points may exhibit a less steep trend, leading to a decrease in the magnitude of the slope coefficient. Therefore, the slope may decrease after removing an outlier, as the line becomes less steep, even if its direction remains the same.
      (1 vote)
  • blobby green style avatar for user Trevor Clack
    r and r^2 always have magnitudes < 1 correct?

    So if we remove an outlier and r^2 drops, doesn't r (the square root of r^2) get larger and vice versa?
    (2 votes)
    Default Khan Academy avatar avatar for user
  • piceratops sapling style avatar for user Neel Nawathey
    How do you know if the outlier increases or decreases the correlation? Or do outliers decrease the correlation by definition?
    (2 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user papa.jinzu
    For the first example, how would the slope increase?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user Shashi G
      Imagine the regression line as just a physical stick. If you tie a stone (outlier) using a thread at the end of stick, stick goes down a bit. Now, cut down the thread... what happens to the stick. it goes up.
      Mathematically, the regression line tries to come closer to all points.. so if the point to down, then the line bends down. If we remove outlier, the line no need to bend down.. means slope increase.
      (2 votes)
  • purple pi teal style avatar for user Tridib Roy Chowdhury
    How is r(correlation coefficient) related to r2 (co-efficient of detremination
    (1 vote)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user YamaanNandolia
    What if there a negative correlation and an outlier in the bottom right of the graph but above the LSRL has to be removed from the graph. How will that affect the correlation and slope of the LSRL?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      If the outlier is in the bottom right of the graph but above the least-squares regression line (LSRL), its removal would likely result in a decrease in the negative correlation (closer to zero) and a decrease in the magnitude of the slope of the LSRL. Removing the outlier would allow the LSRL to better fit the remaining data points, potentially resulting in a less negative slope and a correlation closer to zero.
      (1 vote)
  • male robot johnny style avatar for user Mohamed Ibrahim
    So this outlier at is causing trouble predicting the data with or without the regression line, isn't it ?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      Yes, that's correct. The outlier is causing trouble in predicting the data both with and without the regression line. With the outlier present, the regression line is influenced by its extreme position, leading to a less accurate representation of the overall trend in the data.
      (1 vote)
  • blobby green style avatar for user Shashi G
    Why R2 always increase or stay same on adding new variables. Why don't it go worse
    (1 vote)
    Default Khan Academy avatar avatar for user
  • piceratops seed style avatar for user pkannan.wiz
    Since r^2 is simply a measure of how much of the data the line of best fit accounts for, would it be true that removing the presence of any outlier increases the value of r^2?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      Not necessarily. While removing outliers can sometimes increase R^2 by improving the fit of the regression line to the remaining data points, there are cases where removing outliers may not significantly impact R^2 or even decrease it. It depends on the specific characteristics of the data and how the outliers are influencing the relationship between the variables.
      (1 vote)

Video transcript

- [Instructor] The scatterplot below displays a set of bivariate data along with its least-squares regression line. Consider removing the outlier 95 comma one. So 95 comma one, we're talking about that outlier right over there. And calculating a new least-squares regression line. What effects would removing the outlier have? Choose all answers that apply. Like always, pause this video and see if you could figure it out. Well let's see, even with this outlier here, we have an upward sloping regression line. And so, it looks like our r already is going to be greater than zero. And of course, it's going to be less than one. So our r is going to be greater than zero and less than one. We know it's not going to be equal one because then we would go perfectly through all of the dots and it's clear that this point right over here is indeed an outlier. The residual between this point and the line is quite high. We have a pretty big distance right over here. It would be a negative residual and so, this point is definitely bringing down the r and it's definitely bringing down the slope of the regression line. If we were to remove this point, we're more likely to have a line that looks something like this, in which case, it looks like we would get a much, a much much much better fit. The only reason why the line isn't doing that is it's trying to get close to this point right over here. So if we remove this outlier, our r would increase. So, r would increase and also the slope of our line would increase. And slope would increase. We'd have a better fit to this positively correlated data and we would no longer have this point dragging the slope down anymore. So let's see which choices apply. The coefficient of determination r squared would increase. Well if r would increase, then squaring that value would increase as well. So I will circle that. The coefficient, the correlation coefficient r would get close to zero. No, in fact, it would get closer to one because we would have a better fit here. And so, I will rule that out. The slope of the least-squares regression line would increase. Yes, indeed. This point, this outlier's pulling it down. If you take it out, it'll allow the slope to increase. So I will circle that as well. Let's do another example. The scatterplot below displays a set of bivariate data along with its least-squares regression line. Same idea. Consider removing the outlier ten comma negative 18, so we're talking about that point there, and calculating a new least-squares regression line. So what would happen this time? So as is without removing this outlier, we have a negative slope for the regression line, so we're dealing with a negative r. So we already know that negative one is less than r which is less than zero without even removing the outlier. We know it's not going to be negative one. If it was negative, if r was exactly negative one, then it would be in downward-sloping line that went exactly through all of the points. But if we remove this point, what's going to happen? Well, this least-squares regression is being pulled down here by this outlier. So if you remove this point, the least-squares regression line could move up on the left-hand side and so you'll probably have a line that looks more like that. And I'm just hand drawing it. But even what I hand drew looks like a better fit for the leftover points. And so, clearly the new line that I drew after removing the outlier, this has a more negative slope. So removing the outlier would decrease r, r would get closer to negative one, it would be closer to being a perfect negative correlation. And also, it would decrease the slope. Decrease the slope. Which choices match that? The coefficient of determination r squared would decrease. So let's be very careful. R was already negative. If we decrease it, it's going to become more negative. If you square something that is more negative, it's not going to become smaller. Let's say before you remove the data point, r was, I'm just gonna make up a value, let's say it was negative 0.4, and then after removing the outlier, r becomes more negative and it's going to be equal to negative 0.5. We'll if you square this, this would be positive 0.16 while this would be positive 0.25. So if r is already negative and if you make it more negative, it would not decrease r squared, it actually would increase r squared. So I will rule this one out. The slope of the least-squares regression line would increase. No, it's going to decrease. It's going to be a stronger negative correlation. Rule that one out. The y-intercept of the least-squares regression line would increase. Yes, by getting rid of this outlier, you could think of it as the left side of this line is going to increase. Or another way to think about it, the slope of this line is going to decrease, it's going to become more negative. We know that the least-squares regression line will always go through the mean of both variables. So we're just gonna pivot around the mean of both variables which would mean that the y-intercept will go higher. So I will fill that in.