Impact on median & mean: removing an outlier
Sal thinks through the effects of removing a low outlier from a data set. What will happen to the mean and median?
Want to join the conversation?
- Won't removing an outlier be manipulating the data set? This video shows how the mean and median can change when the outlier is removed. So, if a scientist does some tests and gets an outlier, he/she can remove it to change the results to what he/she wants. So, I ask again, won't removing an outlier be unfairly changing the results?(27 votes)
- Depends. You're right that a scientist can't just arbitrarily discard a result, but if she'd been getting consistent results previously an outlier would suggest some kind of experimental error. If she can identify the source of that error then she is justified in removing the data.
In the video, it turned out that the score of 80 was as a result of "cheating", so we are right to discount it.(44 votes)
- I remember much about mean, but not so much about the rest. can someone fill me in?(4 votes)
- Mean: Add all the numbers together and divide the sum by the number of data points in the data set.
Example: Data set; 1, 2, 2, 9, 8. (1 + 2 + 2 + 9 + 8) / 5
Median: Arrange all the data points from small to large and choose the number that is physically in the middle. If there is an even number of data points, then choose the two numbers in the (physical) middle and find the mean of the two numbers.
Example: Data set; 1, 2, 2, 9, 8, 10. Small to Large; 1, 2, 2, 8, 9, 10. Find the mean of 2 & 8.
Mode: The mode is the number that appears most frequently in a data set.
Example: Data set; 1, 2, 2, 9, 4, 10, 4. Mode: 2 and 4(18 votes)
- At2:05. If removing a number that is larger than the mean will make the mean itself go down, what will then happen with the median in this case? (when removing a number larger than the median)(5 votes)
- The median will also change because you've altered the data set. However, if you simply alter a number (other than the median), then the mean will change but the median will not.(6 votes)
- Starting from1:58to2:10, how does Sal find the mean without calculating? I thought about it and still couldn't understand how the mean increases, because removing one number means decreasing the total. If he removed 80, the original mean would drop.(2 votes)
- Actually, Sal is correct, if you remove a number that is lower than the mean, the mean would increase. You have to remember that you are not only removing the 80 which decreases the total, but you are also removing one of the numbers, so the denominator also drops from 5 to 4. Dividing the sum of the higher number by 4 increases the mean.(4 votes)
- How can you remember all of this?(3 votes)
- at1:59,why does the mean have to go up?(1 vote)
- 80 is the lowest score.
All the other four scores are greater than 80, so they can be written as
80 + 𝑎, 80 + 𝑏, 80 + 𝑐, and 80 + 𝑑, for some positive values 𝑎, 𝑏, 𝑐, 𝑑.
The mean of these five scores is
(80 + (80 + 𝑎) + (80 + 𝑏) + (80 + 𝑐) + (80 + 𝑑))∕5 =
= (5 ∙ 80 + 𝑎 + 𝑏 + 𝑐 + 𝑑)∕5 = 80 + (𝑎 + 𝑏 + 𝑐 + 𝑑)∕5
If we remove the lowest score, then the new mean will be
((80 + 𝑎) + (80 + 𝑏) + (80 + 𝑐) + (80 + 𝑑))∕4 =
= (4 ∙ 80 + 𝑎 + 𝑏 + 𝑐 + 𝑑)∕4 = 80 + (𝑎 + 𝑏 + 𝑐 + 𝑑)∕4
𝑎, 𝑏, 𝑐, 𝑑 > 0 ⇒ 𝑎 + 𝑏 + 𝑐 + 𝑑 > 0 ⇒
⇒ (𝑎 + 𝑏 + 𝑐 + 𝑑)∕4 > (𝑎 + 𝑏 + 𝑐 + 𝑑)∕5, and thereby the new mean must be greater than the previous mean.(5 votes)
- Pretty useful but how will we solve for the mean if it has a negative number?(3 votes)
- Why "mean" increases? These still were 5 games. Shouldn't the lowest score become 0 and still divide by 5.(3 votes)
- Since Ana "cheated" in that last game, the score didn't count, and you calculate the total as if she sat out that round.(3 votes)
- how does sal get the 2/5 bit im confused(2 votes)
- It's the remainder of straight division. In the 1st group of 5 scores, Sal sums them as 80+90+92+94+96=452. To get the mean, Sal then divides 452 by 5, the number of scores in the dataset. 452/5 = 90 2/5 = 90.40(3 votes)
- At0:25, Sal said that cheating didn't help her, but in golf, low scores are better. Can the Khan Academy people fix that?(3 votes)
- Cheating didn't help her because the score where she cheated got thrown out.(0 votes)
- "Ana played five rounds of golf "and her lowest score was an 80. " The scores of the first four rounds and the lowest round "are shown in the following dot plot." And we see it right over here. The lowest round she scores an 80, she also scores a 90 once, a 92 once, a 94 once, and a 96 once. "It was discovered that Ana broke some rules when she scored "80, so that score", so I guess cheating didn't help her, "so that score will be removed from the data set." So they removed that 80 right over there. We're just left with the scores from the other four rounds. "How will the removal of the lowest round "affect the mean and the median?" So let's actually think about the median first. So the median is the middle number. So over here when you had five data points the middle data point is gonna be the one that has two to the left and two to the right. So the median up here is going to be 92. The median up there is 92. And what's the median once you remove this? Now you only have four data points. When you're trying to find the median of an even number of numbers you look at the middle two numbers. So that's a 92 and a 94. And then you take the average of them. You go halfway between them to figure out the median. So the median here is going to be, let me do that a little bit clearer. The median over here is going to be halfway between 92 and 94 which is 93. So the median, the median is 93. Median is 93. So removing the lowest data point in this case increased the median. So the median, let me write it down here. So the median increased by a little bit. The median increases. Now what's going to happen to the mean? What's going to happen to the mean? Well one way to think about it without having to do any calculations is if you remove a number that is lower than the mean, lower than the existing mean, and I haven't calculated what the existing mean is, but if you remove that the mean is going to go up. The mean is going to go up. So hopefully that gives you some intuition. If you removed a number that's larger than the mean your mean is, your mean is going to go down cause you don't have that large number anymore. If you remove a number that's lower than the mean, well you take that out, you don't have that small number bringing the average down and so the mean will go up. But let's verify it mathematically. So let's calculate the mean over here. So we're gonna add 80, plus 90, plus 92, plus 94, plus 96. Those are our data points. And that gets us: two plus four is six, plus six is 12. And then we have one plus eight is nine, and this is, so these are nine and then you have another nine, another nine, another nine, another nine. You essentially have, this is five nines right over here. So this is going to be 452. So that's the sum of the scores of these five rounds, and then you divide it by the number of rounds you have. So it would be 452 divided by five. So 452 divided by five is going to give us, five goes into, it doesn't go into four, it goes into 45 nine times. Nine times five is 45, you subtract, get zero, bring down the two. Five goes into two zero times, zero times five is, zero times five is zero, subtract. You have two left over, so you can say that the mean here, the mean here is 90 and 2/5. Not nine and 2/5, 90 and 2/5. So the mean is right around here. So that's the mean of these data points right over there. And if you remove it what is the mean going to be? So here we're just going to take our 90, plus our 92, plus our 94, plus our 96, add 'em together. So let's see, two plus four plus six is 12. And then you add these together you're gonna get 37. 372 divided by four, cause I have four data points now, not five. Four goes into, let me do this in a place where you can see it. So four goes into 372, goes into 37 nine times. Nine times four is 36, subtract, you get a one. Bring down the two, it goes exactly three times. Three times four is 12. You have no remainder. So the median and the mean here are both, so this is also the mean. The mean here is also 93. So you see that the median, the median went from 92 to 93, it increased. The mean went from 90 and 2/5 to 93. So the mean increased by more than the median. They both increased but the mean increased by more. And it makes sense cause this number was way, way below all of these over here. So you could imagine if you take this out the mean should increase by a good amount. But let's see which of these choices are what we just described. "Both the mean and the median will decrease", nope. "Both the mean and the median will decrease", nope. "Both the mean and the median will increase, "but the mean will increase by more than the median." That's exactly, that's exactly, what happened. The mean went from 90 and 2/5 or 90.4, went from 90.4 or 90 and 2/5 to 93. And then the median only increased by one. So this is the right answer.