Current time:0:00Total duration:4:46
0 energy points
Video transcript
Here is a simulation created by Khan Academy user Justin Helps that once again tries to give us an understanding of why we divide by n minus 1 to get an unbiased estimate of population variance when we're trying to calculate the sample variance. So what he does here, the simulation, it has a population that has a uniform distribution. So he says, I used a flat probabilistic distribution from 0 to 100 for my population. Then we start sampling from that population. We're going to use samples of size 50. And what we do is for each of those samples, we calculate the sample variance based on dividing by n, by dividing by n minus 1 and n minus 2. And as we keep having more and more and more samples, we take the mean of the variances calculated in different ways. And we figure out what those means converge to. So that's a sample. Here's another sample. Here's another sample. If I sample here then I'm now adding a bunch and I'm sampling continuously. And he saw something very interesting happen. When I divide by n, I get my sample variance is still, even when I'm taking the mean of many, many, many, many sample variances that I've already taken, I'm still underestimating the true variance. When I divide by n minus 1, it looks like I'm getting a pretty good estimate, the mean of all of my sample variances is really converged to the true variance. When I divided by n minus 2 just for kicks, it's pretty clear that I overestimated with my mean of my sample variances, I overestimated the true variance. So this gives us a pretty good sense of n minus 1 is the right thing to do. Now this is another interesting way of visualizing it. In the horizontal axis right over here, we're comparing each plot is one of our samples, and how far to the right is how much more is that sample mean than the true mean? And when we go to the left, it's how much less is a sample mean than the true mean? So for example, this sample right over here, it's all the way over to the right. It's the sample mean there was a lot more than the true mean. Sample mean here was a lot less than the true mean. Sample mean here only a little bit more than the true mean. In the vertical axis, using this denominator, dividing by n, we calculate two different variances. One variance, we use the sample mean. The other variance, we use the population mean. And this, in the vertical axis, we compare the difference between the mean calculated with the sample mean versus the mean calculated with the population mean. So for example, this point right over here, when we calculate our mean with our sample mean, which is the normal way we do it, it significantly underestimates what the mean would have been if somehow we knew what the population mean was and we could calculate it that way. And you get this really interesting shape. And it's something to think about. And he recommends some thinking about why or what kind of a shape this actually is. The other interesting thing is when you look at it this way, it's pretty clear this entire graph is sitting below the horizontal axis. So we're always, when we calculate our sample variance using this formula, when we use of the sample mean to do it, which we typically do, we're always getting a lower variance than when we use the population mean. Now this over here, when we divide by n minus 1, we're not always underestimating. Sometimes we are overestimating it. And when you take the mean of all of these variances, you converge. And here we're overestimating it a little bit more. And just to be clear what we're talking about in these three graphs, let me take a screen shot of it and explain it in a little bit more depth. So just to be clear, in this red graph right over here, let me do this. A color close to at least. So this orange, what this distance is for each of these samples, we're calculating the sample variance using, so let me, using the sample mean. And in this case, we are using n as our denominator. In this case right over here. And from that we're subtracting the sample variance, or I guess you could call this some kind of pseudo sample variance, if we somehow knew the population mean. This isn't something that you see a lot in statistics. But it's a gauge of how much we are underestimating our sample variance given that we don't have the true population mean at our disposal. And so this is the distance. This is the distance we're calculating. And you see we're always underestimating. Here we overestimate a little bit. And we also underestimate. But when you take the mean, when you average them all out, it converges to the actual value. So here we're dividing by n minus 1, here we're dividing by n minus 2.