Here is a simulation created
by Khan Academy user Justin Helps that once again tries to
give us an understanding of why we divide by n minus 1 to get an
unbiased estimate of population variance when we're trying to
calculate the sample variance. So what he does
here, the simulation, it has a population that
has a uniform distribution. So he says, I used a flat
probabilistic distribution from 0 to 100 for my population. Then we start sampling
from that population. We're going to use
samples of size 50. And what we do is for
each of those samples, we calculate the sample
variance based on dividing by n, by dividing by n
minus 1 and n minus 2. And as we keep having more
and more and more samples, we take the mean of the
variances calculated in different ways. And we figure out what
those means converge to. So that's a sample. Here's another sample. Here's another sample. If I sample here then
I'm now adding a bunch and I'm sampling continuously. And he saw something
very interesting happen. When I divide by n, I get
my sample variance is still, even when I'm taking the mean
of many, many, many, many sample variances that
I've already taken, I'm still underestimating
the true variance. When I divide by n
minus 1, it looks like I'm getting a
pretty good estimate, the mean of all of
my sample variances is really converged
to the true variance. When I divided by n
minus 2 just for kicks, it's pretty clear
that I overestimated with my mean of my
sample variances, I overestimated
the true variance. So this gives us a pretty
good sense of n minus 1 is the right thing to do. Now this is another interesting
way of visualizing it. In the horizontal
axis right over here, we're comparing each plot
is one of our samples, and how far to the right
is how much more is that sample mean
than the true mean? And when we go to the
left, it's how much less is a sample mean
than the true mean? So for example, this
sample right over here, it's all the way
over to the right. It's the sample mean there was
a lot more than the true mean. Sample mean here was a lot
less than the true mean. Sample mean here only a little
bit more than the true mean. In the vertical axis, using
this denominator, dividing by n, we calculate two
different variances. One variance, we
use the sample mean. The other variance, we
use the population mean. And this, in the
vertical axis, we compare the difference
between the mean calculated with the sample mean
versus the mean calculated with the population mean. So for example, this
point right over here, when we calculate our mean
with our sample mean, which is the normal way we
do it, it significantly underestimates
what the mean would have been if somehow we knew
what the population mean was and we could
calculate it that way. And you get this really
interesting shape. And it's something
to think about. And he recommends some
thinking about why or what kind of a shape this actually is. The other interesting thing is
when you look at it this way, it's pretty clear
this entire graph is sitting below
the horizontal axis. So we're always, when we
calculate our sample variance using this formula, when we use
of the sample mean to do it, which we typically do, we're
always getting a lower variance than when we use
the population mean. Now this over here, when
we divide by n minus 1, we're not always
underestimating. Sometimes we are
overestimating it. And when you take the mean
of all of these variances, you converge. And here we're overestimating
it a little bit more. And just to be clear what we're
talking about in these three graphs, let me take
a screen shot of it and explain it in a
little bit more depth. So just to be clear, in this
red graph right over here, let me do this. A color close to at least. So this orange,
what this distance is for each of
these samples, we're calculating the sample
variance using, so let me, using the sample mean. And in this case, we are
using n as our denominator. In this case right over here. And from that we're subtracting
the sample variance, or I guess you could call this
some kind of pseudo sample variance, if we somehow
knew the population mean. This isn't something that
you see a lot in statistics. But it's a gauge of how much we
are underestimating our sample variance given that we don't
have the true population mean at our disposal. And so this is the distance. This is the distance
we're calculating. And you see we're
always underestimating. Here we overestimate
a little bit. And we also underestimate. But when you take the mean,
when you average them all out, it converges to
the actual value. So here we're
dividing by n minus 1, here we're dividing
by n minus 2.