If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Impact on median & mean: removing an outlier

AP.STATS:
UNC‑1 (EU)
,
UNC‑1.K (LO)
,
UNC‑1.K.2 (EK)
CCSS.Math: , , ,
Sal thinks through the effects of removing a low outlier from a data set. What will happen to the mean and median?

Want to join the conversation?

  • starky tree style avatar for user kristofer
    I remember much about mean, but not so much about the rest. can someone fill me in?
    (4 votes)
    Default Khan Academy avatar avatar for user
    • aqualine ultimate style avatar for user YH
      Mean: Add all the numbers together and divide the sum by the number of data points in the data set.
      Example: Data set; 1, 2, 2, 9, 8. (1 + 2 + 2 + 9 + 8) / 5

      Median: Arrange all the data points from small to large and choose the number that is physically in the middle. If there is an even number of data points, then choose the two numbers in the (physical) middle and find the mean of the two numbers.
      Example: Data set; 1, 2, 2, 9, 8, 10. Small to Large; 1, 2, 2, 8, 9, 10. Find the mean of 2 & 8.

      Mode: The mode is the number that appears most frequently in a data set.
      Example: Data set; 1, 2, 2, 9, 4, 10, 4. Mode: 2 and 4
      (13 votes)
  • male robot hal style avatar for user Tom Wang
    at ,why does the mean have to go up?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • cacteye blue style avatar for user Jerry Nilsson
      80 is the lowest score.
      All the other four scores are greater than 80, so they can be written as
      80 + 𝑎, 80 + 𝑏, 80 + 𝑐, and 80 + 𝑑, for some positive values 𝑎, 𝑏, 𝑐, 𝑑.

      The mean of these five scores is
      (80 + (80 + 𝑎) + (80 + 𝑏) + (80 + 𝑐) + (80 + 𝑑))∕5 =
      = (5 ∙ 80 + 𝑎 + 𝑏 + 𝑐 + 𝑑)∕5 = 80 + (𝑎 + 𝑏 + 𝑐 + 𝑑)∕5

      If we remove the lowest score, then the new mean will be
      ((80 + 𝑎) + (80 + 𝑏) + (80 + 𝑐) + (80 + 𝑑))∕4 =
      = (4 ∙ 80 + 𝑎 + 𝑏 + 𝑐 + 𝑑)∕4 = 80 + (𝑎 + 𝑏 + 𝑐 + 𝑑)∕4

      𝑎, 𝑏, 𝑐, 𝑑 > 0 ⇒ 𝑎 + 𝑏 + 𝑐 + 𝑑 > 0 ⇒
      ⇒ (𝑎 + 𝑏 + 𝑐 + 𝑑)∕4 > (𝑎 + 𝑏 + 𝑐 + 𝑑)∕5, and thereby the new mean must be greater than the previous mean.
      (5 votes)
  • winston default style avatar for user Redapple8787
    Won't removing an outlier be manipulating the data set? This video shows how the mean and median can change when the outlier is removed. So, if a scientist does some tests and gets an outlier, he/she can remove it to change the results to what he/she wants. So, I ask again, won't removing an outlier be unfairly changing the results?
    (2 votes)
    Default Khan Academy avatar avatar for user
    • leaf yellow style avatar for user Howard Bradley
      Depends. You're right that a scientist can't just arbitrarily discard a result, but if she'd been getting consistent results previously an outlier would suggest some kind of experimental error. If she can identify the source of that error then she is justified in removing the data.
      In the video, it turned out that the score of 80 was as a result of "cheating", so we are right to discount it.
      (4 votes)
  • piceratops ultimate style avatar for user misteralejandro777
    Why "mean" increases? These still were 5 games. Shouldn't the lowest score become 0 and still divide by 5.
    (3 votes)
    Default Khan Academy avatar avatar for user
  • duskpin sapling style avatar for user Coolpanda
    What does outlier mean?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • leaf red style avatar for user Cavan P
      An outlier is a value that lies an abnormal distance away from the rest of your data. Generally a value that lies at or beyond 1.5 * IQR (interquartile range) is considered to be an abnormal distance away from the data, and thus becomes an outlier.
      (3 votes)
  • blobby green style avatar for user aaliyah rivera
    how will the removal of the lowest round affect the mean and median?
    (1 vote)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user carrie.m.nicholson
    this inset what my teacher told me
    ha ha ha ha ha ha
    can somebody explain plz
    (3 votes)
    Default Khan Academy avatar avatar for user
  • aqualine ultimate style avatar for user Max Colthart
    how does sal get the 2/5 bit im confused
    (1 vote)
    Default Khan Academy avatar avatar for user
  • male robot johnny style avatar for user DarkSolar
    What is SERIHPGW4IEHWRhihiwnig+eragjeadargr
    (1 vote)
    Default Khan Academy avatar avatar for user
  • leafers sapling style avatar for user Justin Mahe Vea
    If the modes of this(1,2,2,3,4,4,5,6) data set have 4 and 2 then they have two modes. Should you go in between those numbers?
    (1 vote)
    Default Khan Academy avatar avatar for user

Video transcript

- "Ana played five rounds of golf "and her lowest score was an 80. " The scores of the first four rounds and the lowest round "are shown in the following dot plot." And we see it right over here. The lowest round she scores an 80, she also scores a 90 once, a 92 once, a 94 once, and a 96 once. "It was discovered that Ana broke some rules when she scored "80, so that score", so I guess cheating didn't help her, "so that score will be removed from the data set." So they removed that 80 right over there. We're just left with the scores from the other four rounds. "How will the removal of the lowest round "affect the mean and the median?" So let's actually think about the median first. So the median is the middle number. So over here when you had five data points the middle data point is gonna be the one that has two to the left and two to the right. So the median up here is going to be 92. The median up there is 92. And what's the median once you remove this? Now you only have four data points. When you're trying to find the median of an even number of numbers you look at the middle two numbers. So that's a 92 and a 94. And then you take the average of them. You go halfway between them to figure out the median. So the median here is going to be, let me do that a little bit clearer. The median over here is going to be halfway between 92 and 94 which is 93. So the median, the median is 93. Median is 93. So removing the lowest data point in this case increased the median. So the median, let me write it down here. So the median increased by a little bit. The median increases. Now what's going to happen to the mean? What's going to happen to the mean? Well one way to think about it without having to do any calculations is if you remove a number that is lower than the mean, lower than the existing mean, and I haven't calculated what the existing mean is, but if you remove that the mean is going to go up. The mean is going to go up. So hopefully that gives you some intuition. If you removed a number that's larger than the mean your mean is, your mean is going to go down cause you don't have that large number anymore. If you remove a number that's lower than the mean, well you take that out, you don't have that small number bringing the average down and so the mean will go up. But let's verify it mathematically. So let's calculate the mean over here. So we're gonna add 80, plus 90, plus 92, plus 94, plus 96. Those are our data points. And that gets us: two plus four is six, plus six is 12. And then we have one plus eight is nine, and this is, so these are nine and then you have another nine, another nine, another nine, another nine. You essentially have, this is five nines right over here. So this is going to be 452. So that's the sum of the scores of these five rounds, and then you divide it by the number of rounds you have. So it would be 452 divided by five. So 452 divided by five is going to give us, five goes into, it doesn't go into four, it goes into 45 nine times. Nine times five is 45, you subtract, get zero, bring down the two. Five goes into two zero times, zero times five is, zero times five is zero, subtract. You have two left over, so you can say that the mean here, the mean here is 90 and 2/5. Not nine and 2/5, 90 and 2/5. So the mean is right around here. So that's the mean of these data points right over there. And if you remove it what is the mean going to be? So here we're just going to take our 90, plus our 92, plus our 94, plus our 96, add 'em together. So let's see, two plus four plus six is 12. And then you add these together you're gonna get 37. 372 divided by four, cause I have four data points now, not five. Four goes into, let me do this in a place where you can see it. So four goes into 372, goes into 37 nine times. Nine times four is 36, subtract, you get a one. Bring down the two, it goes exactly three times. Three times four is 12. You have no remainder. So the median and the mean here are both, so this is also the mean. The mean here is also 93. So you see that the median, the median went from 92 to 93, it increased. The mean went from 90 and 2/5 to 93. So the mean increased by more than the median. They both increased but the mean increased by more. And it makes sense cause this number was way, way below all of these over here. So you could imagine if you take this out the mean should increase by a good amount. But let's see which of these choices are what we just described. "Both the mean and the median will decrease", nope. "Both the mean and the median will decrease", nope. "Both the mean and the median will increase, "but the mean will increase by more than the median." That's exactly, that's exactly, what happened. The mean went from 90 and 2/5 or 90.4, went from 90.4 or 90 and 2/5 to 93. And then the median only increased by one. So this is the right answer.