Main content

## Measuring spread in quantitative data

Current time:0:00Total duration:7:59

# Mean and standard deviation versus median and IQR

## Video transcript

- [Narrator] So we have
nine students who recently graduated from a small school
that has a class size of nine, and they wanna figure out
what is the central tendency for salaries one year after graduation? And they also wanna have a
sense of the spread around that central tendency one
year after graduation. So they all agree to put in
their salaries into a computer, and so these are their salaries. They're measured in thousands. So one makes 35,000, 50,000,
50,000, 50,000, 56,000, two make 60,000, one makes
75,000, and one makes 250,000. So she's doing very well for herself, and the computer it spits
out a bunch of parameters based on this data here. So it spits out two typical
measures of central tendency. The mean is roughly 76.2. The computer would calculate
it by adding up all of these numbers, these nine numbers,
and then dividing by nine, and the median is 56, and median
is quite easy to calculate. You just order the numbers and you take the middle number here which is 56. Now what I want you to
do is pause this video and think about for this data set, for this population of
salaries, which measure, which measure of central
tendency is a better measure? All right, so let's think
about this a little bit. I'm gonna plot it on a line here. I'm gonna plot my data
so we get a better sense and we just don't see them,
so we just don't see things as numbers, but we see
where those numbers sit relative to each other. So let's say this is zero. Let's say this is, let's see,
one, two, three, four, five. So this would be 250, this
is 50, 100, 150, 200, 200, and let's see. Let's say if this is 50
than this would be roughly 40 right here, and I just wanna get rough. So this would be about 60,
70, 80, 90, close enough. I'm, I could draw this
a little bit neater, but, 60, 70, 80, 90. Actually, let me just clean
this up a little bit more too. This one right over here would be a little bit closer to this one. Let me just put it right around here. So that's 40, and then
this would be 30, 20, 10. Okay, that's pretty good. So let's plot this data. So, one student makes 35,000,
so that is right over there. Two make 50,000, or three make 50,000, so one, two, and three. I'll put it like that. One makes 56,000 which would
put them right over here. One makes 60,000, or
actually, two make 60,000, so it's like that. One makes 75,000, so
that's 60, 70, 75,000. So it's gonna be right around there, and then one makes 250,000. So one's salary is all
the way around there, and then when we
calculate the mean as 76.2 as our measure of central tendency, 76.2 is right over there. So is this a good measure
of central tendency? Well to me it doesn't feel that good, because our measure of central
tendency is higher than all of the data points except for
one, and the reason is is that you have this one that the,
that our, our data is skewed significantly by this
data point at $250,000. It is so far from the
rest of the distribution from the rest of the data
that it has skewed the mean, and this is something
that you see in general. If you have data that is skewed,
and especially things like salary data where someone might
make, most people are making 50, 60, $70,000, but someone
might make two million dollars, and so that will skew the
average or skew the mean I should say, when you add them all
up and divide by the number of data points you have. In this case, especially when
you have data points that would skew the mean,
median is much more robust. The median at 56 sits right
over here, which seems to be much more indicative for central tendency. And think about it. Even if you made this instead of 250,000 if you made this 250,000
thousand, which would be 250 million dollars, which is
a ginormous amount of money to make, it wouldn't, it would
skew the mean incredibly, but it actually would not
even change the median, because the median, it doesn't matter how high this number gets. This could be a trillion dollars. This could be a quadrillion dollars. The median is going to stay the same. So the median is much more robust if you have a skewed data set. Mean makes a little bit more
sense if you have a symmetric data set or if you have things
that are, you know, where, where things are roughly
above and below the mean, or things aren't skewed
incredibly in one direction, especially by a handful of data points like we have right over here. So in this example, the median is a much better measure of central tendency. And so what about spread? Well you might say, well,
Sal you already told us that the mean is not so good and the standard deviation
is based on the mean. You take each of these data
points, find their distance from the mean, square that
number, add up those squared distances, divide by the
number of data points if we're taking the population standard
deviation, and then you, and then you, you take the
square root of the whole thing. And so since this is based on
the mean, which isn't a good measure of central tendency
in this situation, and this, this is also going to skew
that standard deviation. This is going to be, this is a lot larger than if you look at the, the actual, if you wanted an indication of the spread. Yes, you have this one data
point that's way far away from either the mean or
the median depending on how you wanna think about it, but
most of the data points seem much closer, and so for that situation, not only are we using the median, but the interquartile range
is once again more robust. How do we calculate the
interquartile range? Well, you take the median
and then you take the bottom group of numbers and
calculate the median of those. So that's 50 right over here
and then you take the top group of numbers, the
upper group of numbers, and the median there is
60 and 75, it's 67.5. If this looks unfamiliar
we have many videos on interquartile range and calculating standard deviation and median and mean. This is just a little bit of a review, and then the difference
between these two is 17.5, and notice, this distance
between these two, this 17.5, this isn't going to change, even if this is 250 billion dollars. So once again, it is both of
these measures are more robust when you have a skewed data set. So the big take away here is
mean and standard deviation, they're not bad if you have
a roughly symmetric data set, if you don't have any
significant outliers, things that really skew the data set, mean and standard deviation
can be quite solid. But if you're looking at
something that could get really skewed by a handful of data
points median might be, median and interquartile range,
median for central tendency, interquartile range for spread
around that central tendency, and that's why you'll see when
people talk about salaries they'll often talk about
median, because you can have some skewed salaries,
especially on the up side. When we talk about things
like home prices you'll see median often measured
more typically than mean, because home prices in a
neighborhood, a lot of, or in a city, a lot of the
houses might be in the 200,000, $300,000 range, but maybe
there's one ginormous mansion that is 100 million dollars,
and if you calculated mean that would skew and give a
false impression of the average or the central tendency
of prices in that city.