0 energy points

# Sample variance

Thinking about how we can estimate the variance of a population by looking at the data in a sample. Created by Sal Khan.
Video transcript
Let's say that you're curious about people's TV watching habits. And in particular, how much TV do people in the country watch? So what you are concerned with, if we imagine the entire country-- and we've already talked about-- especially if we're talking about a country like the United States, but pretty much any country, is a very large population. In the United States, we're talking about on the order of 300 million people. So ideally, if you could somehow magically do it, you would survey or somehow observe all 300 million people and take the mean of how many hours of TV they watch on a given day. And then that will give you the parameter, the population mean. But we've already talked about, in a case like this, that's a very impractical. Even if you tried to do it, by the time you did it, your data might be stale because some people might have passed away, other people might have been born. Who knows what might have happened. And so this is a truth that is out there. There is a theoretical population mean for the amount of the average or the mean hours of TV watched per day by Americans. There is a truth here at any given point in time. It's just pretty much impossible to come up with the exact answer, to come up with this exact truth. But you don't give up. You say, well, maybe I don't have to survey all 300 million or observe all 300 million. Instead, I'm just going to observe a sample, right over here. And let's say, to make the computation simple, you do a sample of six. And we'll talk about later why six might not be as large of a sample as you would like. But you survey how much TV these folks watch. And you find one person who watched 1 and 1/2 hours. Another person watched 2 and 1/2 hours. Another person watched 4 hours. And then you get one person who watched 2 hours. And you get two people who watched 1 hour each. So given this data from your sample, what do you get as your sample mean? Well, the sample mean, which we would denote by lowercase x with a bar over it, is just the sum of all of these divided by the number of data points we have. So let's see we have 1.5 plus 2.5 plus 4 plus 2 plus 1 plus 1. And all of that divided by 6, which gives-- let's see, the numerator 1.5 plus 2.5 is 4, plus 4 is 8, plus 2 is 10, plus 2 more is 12. So it's going to be 12 over 6, which is equal to 2 hours of television. So at least for your sample, you say, my sample mean is two hours of television. It's an estimate. It's a statistic that is trying to estimate this parameter, this thing that's very hard to know. But it's our best shot. Maybe we get a better answer if we get more data points. But this is we have so far. Now the next question you ask yourself is, well, I don't want to just estimate my population mean. I also want to estimate another parameter. I also am interested in estimating my population variance. So once again, since we can't survey every one in the population, this is pretty much impossible to know. But we're going to attempt to estimate of this parameter. We attempted to estimate the mean. Now we will also attempt to estimate this parameter, this variance parameter. So how would you do it? Well, reasonable logic would say, well, we maybe we'll do the same thing with a sample as we would have done with the population. When you're doing the population variance, you would take each data point in the population, find the distance between that and the normal population mean, take the square of that difference, and then add up all the squares of those differences, and then divide by the number of data points you have. So let's try that over here. So let's try to find-- take each of these data points, and find the difference-- let me do that in a different color-- each of these data points, and find the difference between that data point and our sample mean-- not the population mean, we don't know what the population mean-- the sample mean. So that's that first data point plus the second data point-- so it's 4 minus 2 squared plus 1 minus 2 squared. And this is what you would have done if you were taking a population variance. If this was your entire population, this is how you would you find a population mean here, if this was your entire population. And you find the squared distances from each of those data points and then divide by the number of data points. So let's just think about this a little bit. 1 minus 2 squared. Then you have 2.5 minus 2-- 2 being the sample mean-- squared. Let me see, this green color. Plus 2 minus 2 squared. Plus 1 minus 2 squared. And then maybe you would divide by the number of data points that you have, where you have the number of data points. So in this case, we're dividing by 6. And what would we get in this circumstance? Well, if we just do the computation, 1.5 minus 2 is negative 0.5. We square that. This becomes a positive 0.25. 4 minus 2 squared is going to be 2 squared, which is 4. 1 minus 2 squared-- well, that's negative 1 squared, which is just 1. 2.5 minus 2 is 0.5 squared, is 0.25. 2 minus 2 squared-- well, that's just 0. And then 1 minus 2 squared is 1, it's negative 1 squared. So we just get 1. And if we add all of this up-- let me add the whole numbers first. 4 plus 1 is 5, plus 1 is 6, and then we have two 0.25s. So this is going to be equal to 6.5-- let me write this in a neutral color. So this is going to be 6.5 over this 6 right over here. Well, there's a couple of ways we could write this, but I'll just get the calculator out and we can just calculate it. So 6.5 divided by 6 gets us-- if we round, it's approximately 1.08. So it's approximately 1.08 is this calculation. Now what we have to think about is whether this is the best calculation, whether this is the best estimate for the population variance, given the data that we have. You can always argue that we could have more data. But given the data we have, is this the best calculation that we can make to estimate the population variance? And I'll have you think about that for a second. Well, it turns out that this is close, this is close to the best calculation, the best estimate that we can make, given the data we have. And sometimes this will be called the sample variance. But it's a particular type of sample variance where we just divide by the number of data points we have. And so people will write just an n over here. So this is one way to define a sample variance in an attempt to estimate our population variance. But it turns out-- and in the next video I'll give you an intuitive explanation of why it turns out this way. And then I would also like to write a computer simulation that, at least experimentally, makes you feel a little bit better. But it turns out, you're going to get a better estimate-- and it's a little bit weird and voodooish at first when you first think about it-- you're going to get a better estimate for your population variance if you don't divide by 6, if you don't divide by the number of data points you have but you divide by one less than the number of data points you have. So how would we do that? And we can denote that as sample variance. So when most people talk about the sample variance, they're talking about the sample variance where you do this calculation, but instead of dividing by 6 you were to divide by 5. You would divide by 5. So they would say you divide by n minus 1. So what would we get in those circumstances? Well, the top part is going to be the exact same thing. We're going to get 6.5. But then our denominator, our n is 6. We have 6 data points. But we're going to divide by 1 less than 6. We're going to divide by 5. And 6.5 divided by 5 is equal to 1.3. So when we calculate our sample variance with this technique, which is the more mainstream technique-- and it seems voodoo. Why are we dividing by n minus 1, wherein for a population variance we divide by n? But remember we're trying to estimate the population variance. And it turns out that this is a better estimate. Because this calculation is underestimating what the population variance is, this is a better estimate. We don't know for sure what it is. These both could be way off. It could be just by chance what we happen to sample. But over many samples-- and there's many ways to think about it-- this is going to be a better calculation. It's going to give you a better estimate. And so how would we write this down? How would we write this down with mathematical notation? Well, remember, we're taking the sum. And we're taking each of the data points. So we'll start with the first data point all the way to the nth data point. This lowercase n says that, hey, we're looking at the sample. If I have an uppercase N, that usually denotes that we're trying to sum up everything in the population. Here we're looking at a sample of size, lower case n. And we're taking each data point, so each x sub i, and from it we're subtracting the sample mean. And then we're squaring it. We're taking the sum of the squared distances. And then we're dividing, not by the number of data points we have, but by 1 less than the number of data points we have. So this calculation, where we just summed up all of this and then we divided by 5, not by 6, this is the standard definition of sample variance. So I'll leave you there. In the next video, I will attempt to give you an intuition of why we're dividing by n minus 1 instead of dividing by n.