Current time:0:00Total duration:6:24

# Simulation showing bias in sample variance

## Video transcript

Voiceover: This right here is a
simulation that was created by Peter Collingridge using the Khan Academy computer science scratch pad to better understand why
we divide by n minus one when we calculate an
unbiased sample variance. When we are in an unbiased
way trying to estimate the true population variance. So what this simulation does is first it constructs a population distribution, a random one, and every time you go to it, it will be a different
population distribution. This one has a population of 383, and then it calculates the parameters for that population directly from it. the mean is 10 point nine the
variance is 25 point five. and then it uses that population and samples from it and it does samples of size two, three, four,
five, all the way up to 10, and it keeps sampling from it, calculates the statistics
for those samples so the sample mean and
the sample variance, in particular the biased sample variance It starts telling us some things about us that give us some intuition. You can actually click on
each of these and zoom in to really be able to study
these graphs in detail. I have already taken a screen shot of this and put it on my little doodle pad, so you can really delve
into some of the math and the intuition of what
this is actually showing us. So here I took a screen shot, and you see for this case right over here, the population was 529. Population mean was 10 point six, and down here in this chart, he plots the population mean
right here at 10 point six, right over there, and you see that the population variance
is at 36 point eight, and right here he plots that
right over here, 36 point eight. This first chart on the bottom left tells us a couple of interesting things. Just to be clear, this is
the biased sample variance that he is calculating. This is the biased sample variance. So he is calculating it. That is being calculated
for each of our data points. So starting with our first data
point in each of our samples, going to our nth data point in the sample. You're taking that data point, subtracting out the
sample mean, squaring it, and then dividing the whole
thing, not by n minus one, but by lower case n. This tells us several interesting things. The first thing it shows us is that the cases where we are
significantly underestimating the sample variance, and we are getting sample
variances close to zero, these are also the cases, or
they are disproportionately the cases where the
means for those samples are way far off from the true sample mean, or we could do that the other way around. The cases where the mean is way
far off from the sample mean it seems like you're much
more likely to underestimate the sample variance in those situations. The other thing that might pop out at you is the realization that the pinker dots are the ones for smaller sample size, while the bluer dots are the
ones of a larger sample size. You see here these two little, I guess the tails ,so
to speak, of this hump, that these ends, are
more of a reddish color. that most of the blueish
or the purplish dots are focused right in the
middle right over here, that they are giving us better estimates. There are some red ones here, and that's why it gives
us that purplish color, but out here on these tails, it's almost purely some of these red. Every now and then by happenstance you get a little blue one, but it's disproportionately far more red, which really makes sense when you have a smaller sample size, you are more likely to get a sample mean that is a bad estimate
of the population mean, that's far from the population mean, and you're more likely to significantly underestimate
the sample variance. Now this next chart really
gets to the meat of the issue, because what this is telling us is that for each of these sample sizes, so this right over here
for sample size two, if we keep taking sample size two, and we keep calculating
the biased sample variances and dividing that by
the population variance, and finding the mean over all of those, you see that over many, many, many trials, and many, many samples of size two, that that biased sample variance
over population variance, it's approaching half of the
true population variance. When sample size is three,
it's approaching 2/3, 66 point six percent, of the
true population variance. When sample size is four, it's approaching 3/4 of the
true population variance. So we can come up with the
general theme that's happening. When we use the biased estimate, we're not approaching
the population variance. We're approaching n minus one over n times the population variance. When n was two, this approached 1/2. When n is three, this is 2/3. When n is four, this is 3/4. So this is giving us a biased estimate. So how would we unbias this? Well, if we really want
to get our best estimate of the true population variance, not n minus one over n times
the population variance, we would want to multiply, I'll do this in a color
I haven't used yet, we would want to multiply
times n over n minus one. to get an unbiased estimate. Here, these cancel out
and you are just left with your population variance. That's what we want to estimate. Over here you are left
with our unbiased estimate of population variance, our unbiased sample variance, which is equal to, and this is what we saw in the last several videos, what
you see in statistics books, and sometimes it's confusing why, hopefully Peter's simulation
gives you a good idea of why, or at least convinces
you that it is the case. So you would want to
divide by n minus one.