Main content

## Assessing the fit in least-squares regression

Current time:0:00Total duration:5:44

# Impact of removing outliers on regression lines

AP Stats: DAT‑1 (EU), DAT‑1.I (LO), DAT‑1.I.1 (EK), DAT‑1.I.3 (EK)

## Video transcript

- [Instructor] The scatterplot
below displays a set of bivariate data along with its
least-squares regression line. Consider removing the
outlier 95 comma one. So 95 comma one, we're
talking about that outlier right over there. And calculating a new
least-squares regression line. What effects would
removing the outlier have? Choose all answers that apply. Like always, pause this video and see if you could figure it out. Well let's see, even
with this outlier here, we have an upward sloping regression line. And so, it looks like our r already is going to be greater than zero. And of course, it's going
to be less than one. So our r is going to be greater
than zero and less than one. We know it's not going to
be equal one because then we would go perfectly
through all of the dots and it's clear that this
point right over here is indeed an outlier. The residual between this point
and the line is quite high. We have a pretty big
distance right over here. It would be a negative residual and so, this point is definitely
bringing down the r and it's definitely
bringing down the slope of the regression line. If we were to remove this
point, we're more likely to have a line that looks
something like this, in which case, it looks
like we would get a much, a much much much better fit. The only reason why the
line isn't doing that is it's trying to get close
to this point right over here. So if we remove this outlier,
our r would increase. So, r would increase and also the slope of
our line would increase. And slope would increase. We'd have a better fit to this
positively correlated data and we would no longer
have this point dragging the slope down anymore. So let's see which choices apply. The coefficient of determination
r squared would increase. Well if r would increase,
then squaring that value would increase as well. So I will circle that. The coefficient, the
correlation coefficient r would get close to zero. No, in fact, it would get closer to one because we would have a better fit here. And so, I will rule that out. The slope of the
least-squares regression line would increase. Yes, indeed. This point, this
outlier's pulling it down. If you take it out, it'll
allow the slope to increase. So I will circle that as well. Let's do another example. The scatterplot below displays
a set of bivariate data along with its least-squares
regression line. Same idea. Consider removing the outlier
ten comma negative 18, so we're talking about that point there, and calculating a new
least-squares regression line. So what would happen this time? So as is without removing this outlier, we have a negative slope
for the regression line, so we're dealing with a negative r. So we already know that
negative one is less than r which is less than zero without
even removing the outlier. We know it's not going to be negative one. If it was negative, if r
was exactly negative one, then it would be in downward-sloping line that went exactly through
all of the points. But if we remove this point,
what's going to happen? Well, this least-squares
regression is being pulled down here by this outlier. So if you remove this point, the least-squares regression
line could move up on the left-hand side
and so you'll probably have a line that looks more like that. And I'm just hand drawing it. But even what I hand drew
looks like a better fit for the leftover points. And so, clearly the new line
that I drew after removing the outlier, this has
a more negative slope. So removing the outlier would decrease r, r would get closer to
negative one, it would be closer to being a perfect
negative correlation. And also, it would decrease the slope. Decrease the slope. Which choices match that? The coefficient of determination
r squared would decrease. So let's be very careful. R was already negative. If we decrease it, it's going
to become more negative. If you square something
that is more negative, it's not going to become smaller. Let's say before you
remove the data point, r was, I'm just gonna make up a value, let's say it was negative
0.4, and then after removing the outlier,
r becomes more negative and it's going to be
equal to negative 0.5. We'll if you square this, this would be positive 0.16 while this would be positive 0.25. So if r is already negative and if you make it more negative, it
would not decrease r squared, it actually would increase r squared. So I will rule this one out. The slope of the
least-squares regression line would increase. No, it's going to decrease. It's going to be a stronger
negative correlation. Rule that one out. The y-intercept of the
least-squares regression line would increase. Yes, by getting rid of this outlier, you could think of it as
the left side of this line is going to increase. Or another way to think about it, the slope of this line
is going to decrease, it's going to become more negative. We know that the
least-squares regression line will always go through the
mean of both variables. So we're just gonna pivot around
the mean of both variables which would mean that the
y-intercept will go higher. So I will fill that in.