If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Simulation providing evidence that (n-1) gives us unbiased estimate

UNC‑1 (EU)
UNC‑1.J (LO)
UNC‑1.J.3 (EK)
UNC‑3 (EU)
UNC‑3.I (LO)
UNC‑3.I.1 (EK)

Video transcript

Here's a simulation created by Khan Academy user TETF. I can assume that's pronounced tet f. And what it allows us to do is give us an intuition as to why we divide by n minus 1 when we calculate our sample variance and why that gives us an unbiased estimate of population variance. So the way this starts off, and I encourage you to go try this out yourself, is that you can construct a distribution. It says build a population by clicking in the blue area. So here, we are actually creating a population. So every time I click, it increases the population size. And I'm just randomly doing this, and I encourage you to go onto this scratch pad-- it's on the Khan Academy Computer Science-- and try to do it yourself. So here I could stop at some point. So I've constructed a population. I can throw out some random points up here. So this is our population, and as you saw while I was doing that, it was calculating parameters for the population. It was calculating the population mean at 204.09 and also the population standard deviation, which is derived from the population variance. This is the square root of the population variance, and it's at 63.8. It was also pop plotting the population variance down here. You see it's 63.8, which is the standard deviation, and it's a little harder to see, but it says it's squared. These are these numbers squared. So essentially, 63.8 squared is the population variance. So that's interesting by itself, but it really doesn't tell us a lot so far about why we divide by n minus 1. And this is the interesting part. We can now start to take samples, and we can decide what sample size we want to do. I'll start with really small samples, so the smallest possible sample that makes any sense. So I'm going to start with a really small sample. And what they're going to do-- what the simulation is going to do-- is every time I take a sample, it's going to calculate the variance. So the numerator is going to be the sum of each of my data points in my sample minus my sample mean, and I'm going to square it. And then it's going to divide it by n plus a, and it's going to vary a. It's going to divide it by anywhere between n plus negative 3, so n minus 3, all the way to n plus a. And we're going to do it in many, many, many, many, times. We're going to essentially take the mean of those variances for any a and figure out which gives us the best estimate. So if I just generate one sample right over there, when we see kind of this curve, when we have high values of a, we are underestimating. When we have lower values of a, we are overestimating the population variance, but that was just for one sample, not really that meaningful. It's one sample of size two. Let's generate a bunch of samples and then average them over many of them. And you see when you look at many, many, many, many, many examples, something interesting is happening. When you look at the mean of those samples, when you average together those curves from all of those samples, you see that our best estimate is when a is pretty close to negative 1, is when this is n plus negative 1 or n minus 1. Anything less than negative 1-- if we did negative n minus 1.05 or n minus 1.5-- we start overestimating the variance. Anything less than negative 1, so if we have n plus 0, if we divide by n or if we have n plus 0.05 or whatever it might be, we start underestimating the population variance. And you can do this for samples of different sizes. Let me try a sample size 6. And here you go once again, as I press-- I'm just keeping Generate Sample pressed down-- as we generate more and more and more samples-- and for all the a's we essentially take the average across those samples for the variance depending on how we calculate it-- you'll see that once again, our best estimate is pretty darn close to negative 1. And if you were to get this to millions of samples generated, you'll see that your best estimate is when a is negative 1 or when you're dividing by n minus 1. So once again, thanks TETF, tet f, for this. I think it's a really interesting way to think about why we divide by n minus 1.