Main content

## Grade 8 (Virginia)

### Course: Grade 8 (Virginia) > Unit 8

Lesson 2: Comparing data sets# Example: Comparing distributions

Compare distributions using the features of shape, center, spread, and outliers.

## Want to join the conversation?

- Wouldn't the final round have the higher center since, as far as swim times are concerned, the smaller the time it took to finish race --the better.

Or does higher always refer to the larger of the two regardless of context?(19 votes)- I think Sal misinterpreted the data at4:57. Finalists
**did**swim faster than semifinalists because less time means faster. But I think saying higher center for finalists is ok. We just need to be careful about how we interpret data because more isn't always better.(37 votes)

- I think there is a problem with the structure of this section. We are not introduced to mean/median/deviation, yet we're expected to be able to calculate it.(9 votes)
- i believe mean is found by adding all the data points and dividing to find the average, while median is the point in the middle most or between the middle 2 points. (i just watched a bunch of videos since i didn't find this stuff here either)(3 votes)

- when would you know to use mean or median for the center? would there be a reason to use one instead of the other?(7 votes)
- I still don't understand the concept. Could you try explaining it in the comments so I can get a better understanding?(6 votes)
- Okay, so what we do in this is basically seeing what data set has a higher mean or median, or range and mean absolute deviation. We calculate those then you answer the question.(2 votes)

- they seem to have forgotten to introduce box plots in this unit even though the quiz asked for an interpretation of one.(5 votes)
- What is standard deviation?(5 votes)
- I watched this video three times and i didn't understand it(4 votes)
- Sal got it backwards, I think, and the final round would have the higher center.(2 votes)

- So I'm still confused. Could this be explained in the comments?(1 vote)
- There is no new concept introduced in this video. Instead, we just use our recently acquired knowledge on "Shapes of distributions" and "Clusters, gaps, peaks & outliers" to compare two distributions.

As a side note, comparing distributions is at the heart of controlled experiments like those testing new drugs efficacy (compared against a placebo) or new keyboard layout design (compared against a QWERTY keyboard).(6 votes)

- Wouldn't the final round have the higher center since, as far as swim times are concerned, the smaller the time it took to finish race the better.(3 votes)
- What about the temperature one?(3 votes)

## Video transcript

- [Instructor] What we're
going to do in this video is start to compare distributions. So, for example here, we
have two distributions that show the various temperatures different cities get during
the month of January. This is the distribution for Portland. For example, they get eight days between one and four degrees Celsius. The get 12 days between four
and seven degrees Celsius, and so forth, and so on, and then this is the
distribution for Minneapolis. Now, when we make these
comparisons, what we're going to focus on is the center
of the distributions, to compare that, and also the spread. Sometimes people will talk about the variability
of the distributions. And so, this is, these are the things that we're going to compare. And making the comparison, we're actually going to try to eyeball it. We're not going to try to
pick a measure of central tendencies, say the mean or
the median and then calculate precisely what those
numbers are for these. We might wanna do those if they're close. But if we can eyeball it,
that would be even better. Similar for the spread and variability. In either of these cases,
there are multiple measures in our statistical toolkit center. Mean, median is valuable of the center. For spread/variability, the
range, the interquartile range, the mean absolute deviation,
the standard deviation. These are all measures. But sometimes, you can just
kind of gauge it by looking. So, in this first comparison,
which distribution has a higher center,
or are they comparable? Well, if you look at the
distribution for Portland, the center of this distribution,
let's say if we were to just think about the mean,
although I think the mean and the median would be
reasonably close right over here. It seems like it would
be around, it would be around seven or maybe a
little bit lower than seven. So it would be kind of in... It would be kind of in that range. Maybe between five and seven
would be our central tendency. It would be either our mean or our median, while for Minneapolis,
it looks like our center is much closer to maybe negative two or negative three degrees Celsius. So here, even though we don't know precisely what the mean or the median is of each
of these distributions, you can say that Portland,
Portland distribution has a higher center. Has higher center. However, you wanna measure
either mean or median. Now, what about the spread or variability? Well, if you just superficially
thought about range, you see here that there's
nothing below one degree Celsius and nothing above 13. So you have about a 13-degree
range at most right over here. In fact, what might be
contributing to this first column? It might be a bunch of
things at three degrees or even 3.9 degrees, and similarly, what's
contributing to this last column might be a bunch of
things at 10.1 degrees. But at most, you have a
12-degree range right over here, while over here, it looks
like you have, well, it looks like it's
approaching a 27-degree range. So, based on that, and even
if you just eyeball it, this is just, we're using the same scales for our horizontal axes
here, the temperature axes, and this is just a much wider distribution than what you see over here. And so you would say that
the Minneapolis distribution has more spread or a higher spread or more variability. So, higher spread right over here. Let's do another example. And we'll use a different
representation for the data here. So we're told at the Olympic games, many events have several
rounds of competition. One of these events is the
men's 100-meter backstroke. The upper dot plot shows
the times in seconds of the top eight finishers in the final round of the 2012 Olympics. So that's the green right
over here, the final round. The lower dot plot shows
the times of the same eight swimmers but in
the semi-final round. So, given these distributions, which one has a higher center? Well, once again, and
here, you can actually, it's a little bit easier to eyeball here what the median might be. The mean, I would probably have to do a little bit more mathematics. But let's say the median... Let's see. There's one, two, three, four, five, six, seven, eight data points. So the median is gonna sit between the lower four and the upper four. So the central tendency right over here is for the final round, is, looks like it's around 57.1 seconds, while if we, especially if
we think about the median, while the central tendency
for the semi-final round, let's see. One, two, three, four, five, six, seven, eight. Looks like it is right about there. So this is about 57,
more than 57.3 seconds. So, the semi-final round
seems to have a higher central tendency, which is a
little bit counterintuitive. You would expect the
finalists to be running faster on average than the semi-finalists, but that's not what this data is showing. So the semi-final round has higher center. Higher, higher center. And I just eyeballed the median. And I suspect that the mean would also be higher in
this second distribution. And now what about variability? Well, once again, if you
just looked at range, these are both at the same scale. If you just visually look,
the variability here, the range for the final round is larger than the range for the semi-final round. So you would say that the final round has higher variability. Variability. It has a higher range. Eyeballing it, it looks
like it has a higher, a higher spread, and
there is of course times where one distribution
could have a higher range but then it might have a
lower standard deviation. For example, you could have
data that's like, you know, two data points that are really far apart, but then all the other
data just sits right, it's really, really closely packed. So, for example, a
distribution like this... I'll draw the horizontal axis here, just so you can imagine
it as a distribution. A distribution like this might have a higher range
but lower standard deviation than a distribution like this. Let me just... I'm just drawing a very rough example. A distribution like this has a lower range but actually might have a
higher standard deviation. Might have a higher standard deviation than the one above it. In fact, I can make that even better. A distribution like this would have a lower range but it would also have a
higher standard deviation. So you can't just look at... It's not always the case
that just by looking at one of these measures, the range
of the standard deviation, you'll know for sure, but in cases like this, it's safe to say when you're looking at it
by inspection that look, this green, the final round data does seem to have a higher
range, higher variability, and so I feel pretty good. This is a very high-level comparison.