Current time:0:00Total duration:8:05
0 energy points
Variance as a measure of, on average, how far the data points in a population are from the population mean. Created by Sal Khan.
Video transcript
Let's say I'm trying to judge how many years of experience we have at the Khan Academy. Or on average, how many years of experience we have. And in particular, the particular type of average we'll focus on, is the arithmetic mean. So I go and I survey the folks there. And let's say this was when Khan Academy was a smaller organization, when there were only five people in the organization. And I find-- and I'm surveying the entire population-- so years of experience, the entire population of Khan Academy, because that's what I care about, years of experience at our organization, at Khan Academy. And this was when we had five people. And I were to go-- we're now 36 people, I don't want to date this video too much-- but let's say I go, and I say, OK, there's one person straight out of college, they have one year of experience, or recently out of college, somebody with three years of experience, someone with five years of experience, someone with seven years of experience, and someone very experienced, or reasonably experienced, with 14 years of experience. So based on this data point, and this is our population, for years of experience. I'm assuming that we only have five people in the organization, at this point. What would be the population mean for the years of experience? What is the mean years of experience for my population? Well, we can just calculate that. Our mean experience, and I'm going to denote it with mu, because we're talking about the population now. This is a parameter for the population. It's going to be equal to the sum, from our first data point, so data point one all the way to data point, in this case, data point five-- we have five data points-- of each of-- so we're going to take all, from the first data point, the second data point, the third data point, all the way to the fifth. So this is going to be equal to x1, plus x-- and I'm going to divide it all by the number of data points I have-- plus x2, plus x3, plus x4, plus x sub 5, subscript 5. All of that over 5. And as we said, this is a very fancy way of saying, I'm going to sum up all of these things and then divide by the number of things we have. So let's do that. Get the calculator out. So I'm going to add them all up, 1 plus 3 plus 5-- I really don't need a calculator for this-- plus 7 plus 14. So that's five data points. And I'm going to divide by 5. And I get 6. So the population mean, for years of experience at my organization, is 6. 6 years of experience. Well, that's, I guess, interesting. But now I want to ask another question. I want to get some measure of how much spread there is around that mean. Or how much do the data points vary around that mean. And obviously, I can give someone all the data points. But instead, I actually want to come up with a parameter that somehow represents how much all of these things, on average, are varying from this number right here. Or maybe I will call that thing the variance. And so, what I do-- so the variance-- and I will do-- and this is a population variance that I'm talking about, just to be clear, it's a parameter. The population variance I'm going to denote with the Greek letter sigma, lowercase sigma-- this is capital sigma-- lowercase sigma squared. And I'm going to say, well, I'm going to take the distance from each of these points to the mean. And just so I get a positive value, I'm going to square it. And then, I'm going to divide by the number of data points that I have. So essentially, I'm going to find the average squared distance. Now that might sound very complicated, but let's actually work it out. So I'll take my first data point and I will subtract our mean from it. So this is going to give me a negative number. But if I square it, it's going to be positive. So it's, essentially, going to be the squared distance between 1 and my mean. And then, to that, I'm going to add the squared distance between 3 and my mean. And to that, I'm going to add the squared distance between 5 and my mean. And since I'm squaring, it doesn't matter if I do 5 minus 6, or 6 minus 5. When I square it, I'm going to get a positive result regardless. And then, to that I'm going to add the squared distance between 7 and my mean. So 7 minus 6 squared. All of this, this is my population mean that I'm finding the difference between. And then, finally, the squared difference between 14 and my mean. And then, I'm going to find, essentially, the mean of these squared distances. So I have five squared distances right over here. So let me divide by 5. So what will I get when I make this calculation, right over here? Well, let's figure this out. This is going to be equal to 1 minus 6 is negative 5, negative 5 squared is 25. 3 minus 6 is negative 3, now if I square that, I get 9. 5 minus 6 is negative 1, if I square it, I get positive 1. 7 minus 6 is 1, if I square it, I get positive 1. And 14 minus 6 is 8, if I square it, I get 64. And then, I'm going to divide all of that by 5. And I don't need to use a calculator, but I tend to make a lot of careless mistakes when I do things while making a video. So I get 25 plus 9 plus 1 plus 1 plus 64 divided by 5. So I get 20. So the average squared distance, or the mean squared distance, from our population mean is equal to 20. You may say, wait, these things aren't 20 away. Remember, it's the squared distance away from my population mean. So I squared each of these things. I liked it, because it made it positive. And we'll see later it has other nice properties about it. Now the last thing is, how can we represent this mathematically? We already saw that we know how to represent a population mean, and a sample mean, mathematically like this, and hopefully, we don't find it that daunting anymore. But how would we do the exact same thing? How would we denote what we did, right over here? Well, let's just think it through. We're just saying that the population variance, we're taking the sum of each-- so we're going to take each item, we'll start with the first item. And we're going to go to the n-th item in our population. We're talking about a population here. And we're going to take-- we're not going to just take the item, this would just be the item-- but we're going take the item. And from that, we're going to subtract the population mean. We're going to subtract this thing. We're going to subtract this thing. We're going to square it. We're going to square it. So the way I've written it right now, this would just be the numerator. I've just taken the sum of each of these things, the sum of the difference between each data point and the population mean and squared it. If I really want to get the way I figure out this variance right over here, I have to divide the whole thing by the number of data points we have. So this might seem very daunting, and very intimidating. But all it says is, take each of your data points-- well, one, it says, figure out your population mean. Figure that out first. And then, from each data point, in your population, subtract out that population mean, square it, take the sum of all of those things, and then just divide by the number of data points you have. And you will get your population variance.