Main content
Statistics and probability
Course: Statistics and probability > Unit 3
Lesson 6: More on standard deviation- Why we divide by n - 1 in variance
- Simulation showing bias in sample variance
- Simulation providing evidence that (n-1) gives us unbiased estimate
- Unbiased estimate of population variance
- Review and intuition why we divide by n-1 for the unbiased sample variance
© 2023 Khan AcademyTerms of usePrivacy PolicyCookie Notice
Why we divide by n - 1 in variance
Another visualization providing evidence that dividing by n-1 truly gives an unbiased estimate of population variance. Simulation at: http://www.khanacademy.org/cs/unbiased-variance-visualization/1167453164. Created by Sal Khan.
Want to join the conversation?
- Is there a concrete logical or mathematical proof using simple maths behind this?(15 votes)
- There is a concrete mathematical proof. Whether or not it uses "simple" math depends on what you think is simple vs. not-simple math. The proof requires you to understand the following:
1. Expected values of probability distributions.
2. Expected values of sums of independent random variables.
If you are comfortable with these three things, the proof is easily accessible. If you are not comfortable with these things, the proof may seem like picking things out of thin air. In the math, I'm going to use a dot (•) to represent multiplication. The asterisk sometimes causes issues with formatting, and the × get too confused with an x.
Our goal with the sample variance is to provide an estimate of the population variance that will be correct on average. Taking different samples will result in different values of s², but if we take a lot of samples, and record s² each time, we want that distribution to be centered on σ². Since s² is a random variable (different samples will result in different values), we write this mathematically as saying that the expected value of s² should be equal to σ²:E[ s² ] = σ²
One thing we need to assume is that all observations are independent, and identically distributed - meaning that they all come from a population with the same mean µ and the same variance σ².
First, we're going to need a little side-derivation. For a random variable X, the variance,σ² = E[ (X - µ)² ]
, where E[X]=µ. Expanding the square, we get:σ² = E[X² - 2•X•µ + µ² ]
σ² = E[X²] - 2µE[ X ] + µ²
σ² = E[X²] - µ²
E[X²] = µ² + σ²
We will get to a point where we need E[X²], so just keep this in your back pocket for the moment. Now let's get back to E[ s² ]. To start, just substitute in the definition for the sample variance:E[ s² ] = E[ Σ (xi - xbar)² / (n-1) ]
Now, since (n-1) is a constant, it can be pulled out of the expected value. I'm also going to expand the squared term.E[ s² ] = (1/(n-1)) E[ Σ (xi - xbar)² ]
First, expand the square:E[ s² ] = (1/(n-1)) E[ Σ xi² - 2•xi•xbar + xbar² ]
Summations can be distributed across addition and subtraction, we get get three separate sums:E[ s² ] = (1/(n-1)) E[ Σ xi² - Σ2•xi•xbar + Σxbar² ]
Now, xbar and xbar² are constant respective to their summations, so they can get pulled out:E[ s² ] = (1/(n-1)) E[ Σxi² - 2•xbar•Σxi + n•xbar² ]
Also, note that sincexbar = (1/n) Σ xi
, we can multiply each side by n to getΣ xi = n*xbar
. This is a useful little trick.E[ s² ] = (1/(n-1)) E[ Σxi² - 2•xbar•n•xbar + n•xbar² ]
Combine the second and third terms:E[ s² ] = (1/(n-1)) • E[ Σxi² - n•xbar² ]
Now, the expected value can distribute over addition and subtraction to get us:E[ s² ] = (1/(n-1)) • [ ΣE[xi²] - n•E[xbar²] ]
Remember that little thing we derived earlier and put in our back pocket? We need it now. We have two random variables, xi and xbar, that are squared, and for which we need the expectation. So: E[X²] = µ² + σ². The second one is a little different, because we need the mean and variance of the sampling distribution of the sample mean. These are µ and σ²/n, respectively. So for the second term we have:
E[xbar²] = µ² + σ²/n.
Substituting these values in above, we have:E[ s² ] = (1/(n-1)) • [ Σ (µ² + σ²) - n•(µ² + σ²/n) ]
We can do this, because E[xi²] is the same for every xi (we assumed earlier that the x's are independent and identically distributed, so E[xi²] doesn't depend on the the i part). Now nothing depends on the summation anymore, we are just adding a constant, so we can just multiply by n:E[ s² ] = (1/(n-1)) • [n•(µ² + σ²) - n•(µ² + σ²/n) ]
Then distribute the n multiplication over the parentheses:E[ s² ] = (1/(n-1)) • [n•µ² + n•σ² - n•µ² - n•σ²/n ]
And simplify:E[ s² ] = (1/(n-1)) • [ n•σ² - σ² ]
E[ s² ] = (1/(n-1)) • [ (n-1)•σ² ]
E[ s² ] = σ²
Voila! We are done, and we have proven thatE[ s² ] = σ²
. If, going back to the beginning, we had divided by n in the denominator instead of by n-1, that would have carried through to the end, and the result would have been:E[ s² ] = [(n-1)/n] • σ²
Which is not exactly equal to σ², it is slightly smaller, because the ratio (n-1)/n is less than 1.(74 votes)
- I get these various "intuitions" about why n-1 is better, but I have two questions:
1.) Who figured this out in the first place, and how did they do so? Presumably they didn't run computer simulations, did the tediously do a lot of simulations by hand?
2.) Is there a mathematical proof than n-1 is better, or is it all based on intuition and empirically experimenting with different ways of getting the least biased sample variance?(18 votes) - I kind of feel some of these videos for this section are not ordered correctly.(16 votes)
- I think this process of n-1 'unbiases' the estimation of variation of data because of the nature of collecting data. In real life, there is going to be much more variation of things than you will ever see in a sample group. Take height, for instance.
There are so many possible heights from very short to amazingly tall. However, if we wanted to find out what the average height is, the actual odds that we will meet the extremely tall and extremely short people is unlikely because they are what people call 'statistical outliers' - because these people are rare, you probably won't be able to include them in your list of peoples heights, so naturally you're going to meet a less diverse group of people. This will mean your data has less variation than real life does. Therefore your data is underestimating the variation of real life because of the odds of finding certain types of people being greater or smaller.(9 votes)- A reasonable thought, but it's not really the reason. The reason dividing by
n-1
corrects the bias is because we are using the sample mean, instead of the population mean, to calculate the variance. Since the sample mean is based on the data, it will get drawn toward the center of mass for the data. In other words, using the sample mean to calculate the variance is too specific to the dataset. If we were able to use the population mean instead of the sample mean, there would be no bias.(9 votes)
- Initially I found this confusing, but here's a restatement:
Makes perfect sense (once I read Justin Help’s explanation, which is available at https://www.khanacademy.org/computer-programming/unbiased-variance-visualization/1167453164). BASICALLY , when sample mean is the same as population mean (center of the charts) then sample VARIANCE is also the same as “sample variance calculated using the true population mean” (this is a weird statistic, but allows you to see why n-1 works). However, when sample variance is calculated using the sample variance (the normal way) this differs increasingly (by a negative amount, as variance is being underestimated) from the “sample variance calculated using the true population mean” (the weird statistic which Sal refers to as “pseudo sample variance” again). Thus, the charts on the left show sample variance against true population variance, the charts on the rights show sample variance against “pseudo sample variance”, a statistic that is a hybrid of true population variance and sample variance.(6 votes) - At, when he is subtracting by the population mean, the denominator shouldn't be N instead of n since we are talking in this part of the formula about the population and not just the sample? 4:12(5 votes)
- He says he is actually subtracting from the sample variance a "pseudo-sample variance," using the true mean but changing nothing else. Therefore, the denominator is n for both expressions for the red graph (and would be n-1 for the blue's and n-2 for the green's).
If he was finding the difference between the sample and population variances, you would be correct. But for this simulation, he uses the "pseudo-sample variance" to best demonstrate the unbiased estimate.(2 votes)
- So, what is the significance of the number 1 with respect to unbiased sample variance? What I mean is, it just seems a bit odd that it would conveniently converge to a whole number such as 1. Is there a simple answer or is it a mystical property akin to Euler's identity?(6 votes)
- If you look at the sample variance for lots and lots of random samples, and take the average of all those different variances, that average will tend to agree with the true population variance. It will tend to agree more as you consider more samples. Formally, we say that the "expectation value" of the sample variance is equal to the population variance. This Wikipedia article shows a proof of why this true: https://en.wikipedia.org/wiki/Variance#Sample_variance That proof also shows where the factor of n-1 comes from.(1 vote)
- I still don't understand how does the computer program calculate the "pseudo-sample variance" @if we don't know mu's value. Can someone please explain? 4:10(2 votes)
- In real life we generally don't know the value of μ. However, in a simulation, we are making up the data, and we do in fact know μ. What were doing is:
1. Set μ and create some data from a distribution with that mean.
2. Pretend that we don't know μ, and calculate the mean and standard deviation.
3. Remember that we know μ, and perform the calculations shown in the video.(8 votes)
- didn't understand a thing(4 votes)
- atit should have been Σ(xi - x̅)^2/N - Σ(xi - x̅)^2/n, isn't it? 4:23(2 votes)
- Σ(xi - x̅)^2/n - Σ(xi - μ)^2/N, yep. Sal makes a lot of typos.(1 vote)
Video transcript
Here is a simulation created
by Khan Academy user Justin Helps that once again tries to
give us an understanding of why we divide by n minus 1 to get an
unbiased estimate of population variance when we're trying to
calculate the sample variance. So what he does
here, the simulation, it has a population that
has a uniform distribution. So he says, I used a flat
probabilistic distribution from 0 to 100 for my population. Then we start sampling
from that population. We're going to use
samples of size 50. And what we do is for
each of those samples, we calculate the sample
variance based on dividing by n, by dividing by n
minus 1 and n minus 2. And as we keep having more
and more and more samples, we take the mean of the
variances calculated in different ways. And we figure out what
those means converge to. So that's a sample. Here's another sample. Here's another sample. If I sample here then
I'm now adding a bunch and I'm sampling continuously. And he saw something
very interesting happen. When I divide by n, I get
my sample variance is still, even when I'm taking the mean
of many, many, many, many sample variances that
I've already taken, I'm still underestimating
the true variance. When I divide by n
minus 1, it looks like I'm getting a
pretty good estimate, the mean of all of
my sample variances is really converged
to the true variance. When I divided by n
minus 2 just for kicks, it's pretty clear
that I overestimated with my mean of my
sample variances, I overestimated
the true variance. So this gives us a pretty
good sense of n minus 1 is the right thing to do. Now this is another interesting
way of visualizing it. In the horizontal
axis right over here, we're comparing each plot
is one of our samples, and how far to the right
is how much more is that sample mean
than the true mean? And when we go to the
left, it's how much less is a sample mean
than the true mean? So for example, this
sample right over here, it's all the way
over to the right. It's the sample mean there was
a lot more than the true mean. Sample mean here was a lot
less than the true mean. Sample mean here only a little
bit more than the true mean. In the vertical axis, using
this denominator, dividing by n, we calculate two
different variances. One variance, we
use the sample mean. The other variance, we
use the population mean. And this, in the
vertical axis, we compare the difference
between the mean calculated with the sample mean
versus the mean calculated with the population mean. So for example, this
point right over here, when we calculate our mean
with our sample mean, which is the normal way we
do it, it significantly underestimates
what the mean would have been if somehow we knew
what the population mean was and we could
calculate it that way. And you get this really
interesting shape. And it's something
to think about. And he recommends some
thinking about why or what kind of a shape this actually is. The other interesting thing is
when you look at it this way, it's pretty clear
this entire graph is sitting below
the horizontal axis. So we're always, when we
calculate our sample variance using this formula, when we use
of the sample mean to do it, which we typically do, we're
always getting a lower variance than when we use
the population mean. Now this over here, when
we divide by n minus 1, we're not always
underestimating. Sometimes we are
overestimating it. And when you take the mean
of all of these variances, you converge. And here we're overestimating
it a little bit more. And just to be clear what we're
talking about in these three graphs, let me take
a screen shot of it and explain it in a
little bit more depth. So just to be clear, in this
red graph right over here, let me do this. A color close to at least. So this orange,
what this distance is for each of
these samples, we're calculating the sample
variance using, so let me, using the sample mean. And in this case, we are
using n as our denominator. In this case right over here. And from that we're subtracting
the sample variance, or I guess you could call this
some kind of pseudo sample variance, if we somehow
knew the population mean. This isn't something that
you see a lot in statistics. But it's a gauge of how much we
are underestimating our sample variance given that we don't
have the true population mean at our disposal. And so this is the distance. This is the distance
we're calculating. And you see we're
always underestimating. Here we overestimate
a little bit. And we also underestimate. But when you take the mean,
when you average them all out, it converges to
the actual value. So here we're
dividing by n minus 1, here we're dividing
by n minus 2.