Variance and standard deviation of a population
- [Narrator] So we have nine students who recently graduated from a small school that has a class size of nine, and they wanna figure out what is the central tendency for salaries one year after graduation? And they also wanna have a sense of the spread around that central tendency one year after graduation. So they all agree to put in their salaries into a computer, and so these are their salaries. They're measured in thousands. So one makes 35,000, 50,000, 50,000, 50,000, 56,000, two make 60,000, one makes 75,000, and one makes 250,000. So she's doing very well for herself, and the computer it spits out a bunch of parameters based on this data here. So it spits out two typical measures of central tendency. The mean is roughly 76.2. The computer would calculate it by adding up all of these numbers, these nine numbers, and then dividing by nine, and the median is 56, and median is quite easy to calculate. You just order the numbers and you take the middle number here which is 56. Now what I want you to do is pause this video and think about for this data set, for this population of salaries, which measure, which measure of central tendency is a better measure? All right, so let's think about this a little bit. I'm gonna plot it on a line here. I'm gonna plot my data so we get a better sense and we just don't see them, so we just don't see things as numbers, but we see where those numbers sit relative to each other. So let's say this is zero. Let's say this is, let's see, one, two, three, four, five. So this would be 250, this is 50, 100, 150, 200, 200, and let's see. Let's say if this is 50 than this would be roughly 40 right here, and I just wanna get rough. So this would be about 60, 70, 80, 90, close enough. I'm, I could draw this a little bit neater, but, 60, 70, 80, 90. Actually, let me just clean this up a little bit more too. This one right over here would be a little bit closer to this one. Let me just put it right around here. So that's 40, and then this would be 30, 20, 10. Okay, that's pretty good. So let's plot this data. So, one student makes 35,000, so that is right over there. Two make 50,000, or three make 50,000, so one, two, and three. I'll put it like that. One makes 56,000 which would put them right over here. One makes 60,000, or actually, two make 60,000, so it's like that. One makes 75,000, so that's 60, 70, 75,000. So it's gonna be right around there, and then one makes 250,000. So one's salary is all the way around there, and then when we calculate the mean as 76.2 as our measure of central tendency, 76.2 is right over there. So is this a good measure of central tendency? Well to me it doesn't feel that good, because our measure of central tendency is higher than all of the data points except for one, and the reason is is that you have this one that the, that our, our data is skewed significantly by this data point at $250,000. It is so far from the rest of the distribution from the rest of the data that it has skewed the mean, and this is something that you see in general. If you have data that is skewed, and especially things like salary data where someone might make, most people are making 50, 60, $70,000, but someone might make two million dollars, and so that will skew the average or skew the mean I should say, when you add them all up and divide by the number of data points you have. In this case, especially when you have data points that would skew the mean, median is much more robust. The median at 56 sits right over here, which seems to be much more indicative for central tendency. And think about it. Even if you made this instead of 250,000 if you made this 250,000 thousand, which would be 250 million dollars, which is a ginormous amount of money to make, it wouldn't, it would skew the mean incredibly, but it actually would not even change the median, because the median, it doesn't matter how high this number gets. This could be a trillion dollars. This could be a quadrillion dollars. The median is going to stay the same. So the median is much more robust if you have a skewed data set. Mean makes a little bit more sense if you have a symmetric data set or if you have things that are, you know, where, where things are roughly above and below the mean, or things aren't skewed incredibly in one direction, especially by a handful of data points like we have right over here. So in this example, the median is a much better measure of central tendency. And so what about spread? Well you might say, well, Sal you already told us that the mean is not so good and the standard deviation is based on the mean. You take each of these data points, find their distance from the mean, square that number, add up those squared distances, divide by the number of data points if we're taking the population standard deviation, and then you, and then you, you take the square root of the whole thing. And so since this is based on the mean, which isn't a good measure of central tendency in this situation, and this, this is also going to skew that standard deviation. This is going to be, this is a lot larger than if you look at the, the actual, if you wanted an indication of the spread. Yes, you have this one data point that's way far away from either the mean or the median depending on how you wanna think about it, but most of the data points seem much closer, and so for that situation, not only are we using the median, but the interquartile range is once again more robust. How do we calculate the interquartile range? Well, you take the median and then you take the bottom group of numbers and calculate the median of those. So that's 50 right over here and then you take the top group of numbers, the upper group of numbers, and the median there is 60 and 75, it's 67.5. If this looks unfamiliar we have many videos on interquartile range and calculating standard deviation and median and mean. This is just a little bit of a review, and then the difference between these two is 17.5, and notice, this distance between these two, this 17.5, this isn't going to change, even if this is 250 billion dollars. So once again, it is both of these measures are more robust when you have a skewed data set. So the big take away here is mean and standard deviation, they're not bad if you have a roughly symmetric data set, if you don't have any significant outliers, things that really skew the data set, mean and standard deviation can be quite solid. But if you're looking at something that could get really skewed by a handful of data points median might be, median and interquartile range, median for central tendency, interquartile range for spread around that central tendency, and that's why you'll see when people talk about salaries they'll often talk about median, because you can have some skewed salaries, especially on the up side. When we talk about things like home prices you'll see median often measured more typically than mean, because home prices in a neighborhood, a lot of, or in a city, a lot of the houses might be in the 200,000, $300,000 range, but maybe there's one ginormous mansion that is 100 million dollars, and if you calculated mean that would skew and give a false impression of the average or the central tendency of prices in that city.