Main content

### Course: Statistics and probability > Unit 2

Lesson 2: Describing and comparing distributions- Shapes of distributions
- Shape of distributions
- Clusters, gaps, peaks & outliers
- Clusters, gaps, & peaks in data distributions
- Comparing distributions with dot plots (example problem)
- Comparing distributions
- Comparing dot plots, histograms, and box plots
- Comparing data displays
- Example: Comparing distributions
- Comparing data distributions
- Comparing center and spread

© 2024 Khan AcademyTerms of usePrivacy PolicyCookie Notice

# Clusters, gaps, peaks & outliers

This lesson explores the features of distributions in data sets, like clusters, gaps, and peaks. We learn how to identify outliers, which are data points far from the rest. We also discuss how to spot peaks and clusters in data.

## Want to join the conversation?

- What is an outlier?

What is a range?

What is an interquartile range?

What is mean?

What is median?

What is mode?

What is a lower quartile?

What is an upper quartile?

I have them all mixed and am so confused.

PLEASE HELP! 😕(7 votes)- Outlier - a data value that is way different from the other data.

Range - the Highest number minus the lowest number

Interquarticel range - Q3 minus Q1

Mean- the average of the data (add up all the numbers then divide it by the total number of values that you originally added)

Median - the number in the middle of the data. If the numbers are all in order, whichever number is in the middle

Mode - whichever number there is the most of

Lower Quartile - Q1 - the middle of the bottom half of the data, if you take the median, it's the middle of the data on the right of the median(it's basically the number at the 1st quarter.

Upper Quartile - Q3 - the middle of the data above the median, the value at the 3rd quarter of the data.(39 votes)

- What is the exact meaning of an outlier?(5 votes)
- 1) A data point that is distinctly separate from the rest of the data.

2) Any data point more than 1.5 interquartile ranges (IQRs) below the first quartile or above the third quartile.

www.mathwords.com/o/outlier.htm(25 votes)

- Can you please explain peak?(5 votes)
- a peak, like he said in the video is the hight of the numbers. or the highest point.(5 votes)

- A gap in a distribution refers to an interval where there are no data points present. It represents a break or absence in the continuity of the data.(3 votes)

- Whats a outlier(4 votes)
- An outlier is a piece of data that is far away from other data.(6 votes)

- What is cluster? explain please.(6 votes)
- It is data is is clustered like 2 or 3 groups together like if it was 4 - 9 and 6-8 had 3 dots then the cluster would be 6-9(1 vote)

- What's a spread(3 votes)
- In statistics this is a measure of the variation of the data. For example, the range (difference between maximum and minimum values), the mean absolute deviation (average distance between each point and the median), and interquartile range (distance between the lower and upper quartiles).(0 votes)

- Is the peak the mode version of a frequency distribution?

Is the outlier the tail of the distribution?(4 votes)- The mode is how many times a number occurs. The peak doesn't have to be the mode. It can be but it depends on the data. An outlier is not the tail. An outlier is an outlier. I kind of think of it as being the odd one out of the bunch. for example, if everyone else was wearing pink, purple, yellow, blue, and green t shirts and you wore a black t-shirt you would be the outlier. The tail is more of how many people aren't wearing a certain color like 0 people are wearing Orange, red, and white and that is towards the left. So that would be left-tailed.

Hope this helps,

Aliana(4 votes)

- outlier is a small set of data separated from all the big clusters? Right?(4 votes)
- It's usually only one data point (I think)(3 votes)

- a few questions about outlier

1.lets say there is are two clusters on the graph with a huge gap in between

would data in one cluster be considered as an outlier wrt another cluster ? or does this not have any outlier at all

2.lets say that this time there is a cluster on one side of the graph . but after the cluster the data points are just low but no gaps . but after a while for one value there is an abnormally high no.of data points

is this consider an outlier ?

(shown below)

.*________*.____

...*_______*.____

...*_______*.____

.......................*__*-> space/blank (the comment isnt taking more than one space when given , thats why i used underscore)

. -> data point(5 votes)- I think that you would not consider it an outlier, since it is a significant part of your data.(2 votes)

## Video transcript

- [Voiceover] In this video, I wanna do some examples looking at distributions, in particular, different features in distributions like clusters, gaps, and peaks. So over here, I wanna do some examples. Which of the following are accurate descriptions of the distribution below? Select all that apply. So the first statement is the distribution has an outlier. So an outlier is a data point that's way off of where the other data points are, it's way larger or way smaller than where all of the other data points seem to be clustered and if we look over here, we have a lot of data points between zero and six. And let's just think about what they're measuring: this is shelf time for each apple at Gorg's Grocier. So, for example, we see there's one, two, three, four, five, six, seven apples that have a shelf life of zero days, so (laughs), they're about to go bad. You see you have one, two, three, four, five, six, seven, eight apples that are gonna be good for another day. You have two apples that are gonna be good for another six days, and you have one apple that's gonna be good for 10 days, and this is unusual. This is an outlier here, it has a way larger shelf life than all of the other data, so I would say this definitely does have an outlier. We just have this one data point sitting all the way to the right, way larger, way more shelf life than everything else, so it definitely has an outlier, and this one would be the outlier. The distribution has a cluster from four to six days. And we indeed do see a cluster from four to six days. A cluster, you can imagine, it's a grouping of data that's sitting there, or you have a grouping of apples that have a shelf life between four and six days, and you definitely do see that cluster there. And since I already selected two things, I'm definitely not gonna select none of the above. Let me check my answer. Let me do a few more of these. Which of the following are accurate descriptions of the distribution below? And once again we're going to select all that apply. So the distribution has an outlier. So let's see this distribution. I do have a data point here that is at the high end and I have another data point here that's at the low end, but I don't have any data points that are sitting far above or far below the bulk of the data. If I had a data point that was out here, then yeah, I would say that was an outlier to the right, or a positive outlier, if I had a data point way to the left off the screen over here, maybe that would be an outlier, but I don't really see any obvious outliers. All of the data, it's pretty clustered together. So I would not say that the distribution has an outlier. The distribution has a peak at 22 degrees. Yeah, it does indeed look like we have, and let's just look at what we're actually measuring: high temperature each day in Edgeton, Iowa in July. So it does indeed look like we have the most number of days that had a high temperature at 22, most number of days in July had a high tempurature at 22 degrees Celsius, so that is a peak. You can see it, if you imagine this as kind of a mountain this is a peak right here, this is a high point. You have, at least locally, the most number of days at 22 degrees Celsius. So I would say it definitely has a peak there. Since I selected something, I'm not gonna select none of the above. Let's do a couple more of these. Which of the following are accurate descrptions of the distribution below? So the first one, the distribution has an outlier. So... number of guests by day at Seth's Sandwich Shop. So, let's see, the lowest... They have no days... No days where he had between zero and 19 guests, no days where he had between 20 and 39 guests, looks like there's about nine days where he had between 40 and 59 guests, looks like 20 days where he had between 60 and 79 guests, all the way where it looks like maybe 8 days that he had between 180 and 199 guests. But the question of outliers, there doesn't seem to be any day where he had an unusual number of guests. There's not a day that's way out here, where he had, like, 500 guests. So I would say this distribution does not have an outlier. The distribution has a cluster from zero to 39 guests. So zero to 39 guests is right over here, zero to 39 guests. And there is no days where he had between zero and 39 guests neither zero to 19, or 20 to 39. So there's definitely not a cluster there. I would say that the cluster would be between days that had between 40 and 199 guests. Definitely not zero and 39, there was no days that were between zero and 39 guests. So I would say none of the above very confidently. Let's do one more of these. Which of the following are accurate descriptions of the distribution below? (laughs) Alright. The distribution has a peak from 12 to 13 points. Let me see what this is measuring, what this data is about. Test scores by student in Mrs. Frine's class. So you had one student who got between a zero and a one on a 20-point scale, so got between, I guess out of 20 questions, got between zero and one point. And then you see that there's no students got between two and three, or four and five, or six and seven. Then we have another student who got between eight and nine, looks like three students got between 10 and 11, and then we keep increasing, this looks like about 12 students got either a 16 or a 17, or something in between maybe, if you could get decimal points on that test. And then it looks like 10 students got from 18 to 19. Alright, so this says the distribution has a peak from 12 to 13 points, 12 to 13 points, there were five students, but this isn't a peak. If you just go to 14 to 15 points, you have more students. So this is definitely not a peak. If you were looking at this as a mountain of some kind, you definitely wouldn't describe this point as a peak. You would say this distribution has a peak, it has the most number of students who got between 16 and 17 points, so that's the peak right there, not 12 to 13 points. So I would not select that first choice. The distribution has an outlier. Well, yeah, look at this: you have this outlier. Most of the students scored between eight and 19 points, and then you have this one student who got between zero and one, it's really an outlier. You even see this when you look at it visually, it's not even connected to the rest of the distribution. It's way to the left. If something is way to the left or way to the right, that's an outlier if it's unusually low or unusually high. So I would say this distribution definitely does have an outlier, and I'm not gonna pick none of the above since I found a choice. And I think we're all done.