Statistics and probability
- Measures of spread: range, variance & standard deviation
- Variance of a population
- Population standard deviation
- The idea of spread and standard deviation
- Calculating standard deviation step by step
- Standard deviation of a population
- Mean and standard deviation versus median and IQR
- Concept check: Standard deviation
- Statistics: Alternate variance formulas
Learn to choose the "preferred" measures of center and spread when outliers are present in a set of data.
Want to join the conversation?
- 1,2,3 ,1000,2000,10000,20000
median is 1000.
It just tries to stay in between.
Mean is like finding a point that is closest to all. But it gets skewed.
If for a distribution,if mean is bad then so is SD, obvio.
Standard deviation is how many points deviate from the mean.
For two datasets, the one with a bigger range is more likely to be the more dispersed one.
IQR is like focusing on the middle portion of sorted data. So it doesn’t get skewed.
Why not use IQR Range only.
Use standard deviation using the median instead of mean.
Create levels expanding from the IQR range, level 1, level 2.
Is it a good idea?(8 votes)
- When you perform an exploratory data analysis you may be interested the range.
There is no such thing as IQR range. IQR is a form of range (interquartile range).
There is no such thing as levels in IQR. But perhaps you can create a new feature if you feel it is necessary.(2 votes)
- How about mode? Wouldn't that often be more reliable? Like when calculating the average salary in a large population - would the amount most people make not seem the most representative?(4 votes)
- If median and IQR are preferred when there are outliers, doesn't that imply that they are more accurate when there is any variance at all?
The only case where mean and standard deviation are going to be as accurate as median and IQR is if there is no variance at all in the data.
With that being said, is there any situation where mean and standard deviation would be preferable?(4 votes)
- what does the Standard deviation have to do with the IQR(1 vote)
- They are both measures of how far the typical data point is from the center--either the mean or the median, depending on which you use.(7 votes)
- why cant we mix and match
? as we figure out that median captures central tendency better. why cant we still use median in standard deviation formula?. That would be better capturing total variance/spread in the data set(3 votes)
- interesting idea
and it would remedy the misleading by biased mean a bit
but the skew and thus bias by an outlier remain even with median for calculating standard deviation.
i think that's why we better rely on IQR in that type of situations as it can simply ignore too extreme cases.(1 vote)
- i have 2 questions.. the first one is on variance... why was the previous video refer to it as sample biased variance.. what does it mean? the second question is the term skew.. what does it mean here? thank you(2 votes)
- Would the mean be robust if there are outliers on both sides of the main group of data points?(1 vote)
- Still no because it is unknown how drastically the outliers differ from each other. For example, if most of the data were from 50-60 one of the outliers could be 30 while another outlier is 200. Thus if any outliers as a general reasons use the median.(3 votes)
- if mean is 80 how far away is 60 and in what direction(2 votes)
- What is the minimum number of points for using each of the choices - standard deviation or IQR(1 vote)
- Look at the spread and your own intutive reasoning. If you feel data is roughly symetrical around mean then use standard deviation else go with IQR(2 votes)
- [Narrator] So we have nine students who recently graduated from a small school that has a class size of nine, and they wanna figure out what is the central tendency for salaries one year after graduation? And they also wanna have a sense of the spread around that central tendency one year after graduation. So they all agree to put in their salaries into a computer, and so these are their salaries. They're measured in thousands. So one makes 35,000, 50,000, 50,000, 50,000, 56,000, two make 60,000, one makes 75,000, and one makes 250,000. So she's doing very well for herself, and the computer it spits out a bunch of parameters based on this data here. So it spits out two typical measures of central tendency. The mean is roughly 76.2. The computer would calculate it by adding up all of these numbers, these nine numbers, and then dividing by nine, and the median is 56, and median is quite easy to calculate. You just order the numbers and you take the middle number here which is 56. Now what I want you to do is pause this video and think about for this data set, for this population of salaries, which measure, which measure of central tendency is a better measure? All right, so let's think about this a little bit. I'm gonna plot it on a line here. I'm gonna plot my data so we get a better sense and we just don't see them, so we just don't see things as numbers, but we see where those numbers sit relative to each other. So let's say this is zero. Let's say this is, let's see, one, two, three, four, five. So this would be 250, this is 50, 100, 150, 200, 200, and let's see. Let's say if this is 50 than this would be roughly 40 right here, and I just wanna get rough. So this would be about 60, 70, 80, 90, close enough. I'm, I could draw this a little bit neater, but, 60, 70, 80, 90. Actually, let me just clean this up a little bit more too. This one right over here would be a little bit closer to this one. Let me just put it right around here. So that's 40, and then this would be 30, 20, 10. Okay, that's pretty good. So let's plot this data. So, one student makes 35,000, so that is right over there. Two make 50,000, or three make 50,000, so one, two, and three. I'll put it like that. One makes 56,000 which would put them right over here. One makes 60,000, or actually, two make 60,000, so it's like that. One makes 75,000, so that's 60, 70, 75,000. So it's gonna be right around there, and then one makes 250,000. So one's salary is all the way around there, and then when we calculate the mean as 76.2 as our measure of central tendency, 76.2 is right over there. So is this a good measure of central tendency? Well to me it doesn't feel that good, because our measure of central tendency is higher than all of the data points except for one, and the reason is is that you have this one that the, that our, our data is skewed significantly by this data point at $250,000. It is so far from the rest of the distribution from the rest of the data that it has skewed the mean, and this is something that you see in general. If you have data that is skewed, and especially things like salary data where someone might make, most people are making 50, 60, $70,000, but someone might make two million dollars, and so that will skew the average or skew the mean I should say, when you add them all up and divide by the number of data points you have. In this case, especially when you have data points that would skew the mean, median is much more robust. The median at 56 sits right over here, which seems to be much more indicative for central tendency. And think about it. Even if you made this instead of 250,000 if you made this 250,000 thousand, which would be 250 million dollars, which is a ginormous amount of money to make, it wouldn't, it would skew the mean incredibly, but it actually would not even change the median, because the median, it doesn't matter how high this number gets. This could be a trillion dollars. This could be a quadrillion dollars. The median is going to stay the same. So the median is much more robust if you have a skewed data set. Mean makes a little bit more sense if you have a symmetric data set or if you have things that are, you know, where, where things are roughly above and below the mean, or things aren't skewed incredibly in one direction, especially by a handful of data points like we have right over here. So in this example, the median is a much better measure of central tendency. And so what about spread? Well you might say, well, Sal you already told us that the mean is not so good and the standard deviation is based on the mean. You take each of these data points, find their distance from the mean, square that number, add up those squared distances, divide by the number of data points if we're taking the population standard deviation, and then you, and then you, you take the square root of the whole thing. And so since this is based on the mean, which isn't a good measure of central tendency in this situation, and this, this is also going to skew that standard deviation. This is going to be, this is a lot larger than if you look at the, the actual, if you wanted an indication of the spread. Yes, you have this one data point that's way far away from either the mean or the median depending on how you wanna think about it, but most of the data points seem much closer, and so for that situation, not only are we using the median, but the interquartile range is once again more robust. How do we calculate the interquartile range? Well, you take the median and then you take the bottom group of numbers and calculate the median of those. So that's 50 right over here and then you take the top group of numbers, the upper group of numbers, and the median there is 60 and 75, it's 67.5. If this looks unfamiliar we have many videos on interquartile range and calculating standard deviation and median and mean. This is just a little bit of a review, and then the difference between these two is 17.5, and notice, this distance between these two, this 17.5, this isn't going to change, even if this is 250 billion dollars. So once again, it is both of these measures are more robust when you have a skewed data set. So the big take away here is mean and standard deviation, they're not bad if you have a roughly symmetric data set, if you don't have any significant outliers, things that really skew the data set, mean and standard deviation can be quite solid. But if you're looking at something that could get really skewed by a handful of data points median might be, median and interquartile range, median for central tendency, interquartile range for spread around that central tendency, and that's why you'll see when people talk about salaries they'll often talk about median, because you can have some skewed salaries, especially on the up side. When we talk about things like home prices you'll see median often measured more typically than mean, because home prices in a neighborhood, a lot of, or in a city, a lot of the houses might be in the 200,000, $300,000 range, but maybe there's one ginormous mansion that is 100 million dollars, and if you calculated mean that would skew and give a false impression of the average or the central tendency of prices in that city.