- Review and intuition why we divide by n-1 for the unbiased sample variance
- Why we divide by n - 1 in variance
- Simulation showing bias in sample variance
- Simulation providing evidence that (n-1) gives us unbiased estimate
- Unbiased estimate of population variance
Why we divide by n - 1 in variance
Another visualization providing evidence that dividing by n-1 truly gives an unbiased estimate of population variance. Simulation at: http://www.khanacademy.org/cs/unbiased-variance-visualization/1167453164. Created by Sal Khan.
Want to join the conversation?
- Is there a concrete logical or mathematical proof using simple maths behind this?(15 votes)
- There is a concrete mathematical proof. Whether or not it uses "simple" math depends on what you think is simple vs. not-simple math. The proof requires you to understand the following:
1. Expected values of probability distributions.
2. Expected values of sums of independent random variables.
If you are comfortable with these three things, the proof is easily accessible. If you are not comfortable with these things, the proof may seem like picking things out of thin air. In the math, I'm going to use a dot (•) to represent multiplication. The asterisk sometimes causes issues with formatting, and the × get too confused with an x.
Our goal with the sample variance is to provide an estimate of the population variance that will be correct on average. Taking different samples will result in different values of s², but if we take a lot of samples, and record s² each time, we want that distribution to be centered on σ². Since s² is a random variable (different samples will result in different values), we write this mathematically as saying that the expected value of s² should be equal to σ²:
E[ s² ] = σ²
One thing we need to assume is that all observations are independent, and identically distributed - meaning that they all come from a population with the same mean µ and the same variance σ².
First, we're going to need a little side-derivation. For a random variable X, the variance,
σ² = E[ (X - µ)² ], where E[X]=µ. Expanding the square, we get:
σ² = E[X² - 2•X•µ + µ² ]
σ² = E[X²] - 2µE[ X ] + µ²
σ² = E[X²] - µ²
E[X²] = µ² + σ²
We will get to a point where we need E[X²], so just keep this in your back pocket for the moment. Now let's get back to E[ s² ]. To start, just substitute in the definition for the sample variance:
E[ s² ] = E[ Σ (xi - xbar)² / (n-1) ]
Now, since (n-1) is a constant, it can be pulled out of the expected value. I'm also going to expand the squared term.
E[ s² ] = (1/(n-1)) E[ Σ (xi - xbar)² ]
First, expand the square:
E[ s² ] = (1/(n-1)) E[ Σ xi² - 2•xi•xbar + xbar² ]
Summations can be distributed across addition and subtraction, we get get three separate sums:
E[ s² ] = (1/(n-1)) E[ Σ xi² - Σ2•xi•xbar + Σxbar² ]
Now, xbar and xbar² are constant respective to their summations, so they can get pulled out:
E[ s² ] = (1/(n-1)) E[ Σxi² - 2•xbar•Σxi + n•xbar² ]
Also, note that since
xbar = (1/n) Σ xi, we can multiply each side by n to get
Σ xi = n*xbar. This is a useful little trick.
E[ s² ] = (1/(n-1)) E[ Σxi² - 2•xbar•n•xbar + n•xbar² ]
Combine the second and third terms:
E[ s² ] = (1/(n-1)) • E[ Σxi² - n•xbar² ]
Now, the expected value can distribute over addition and subtraction to get us:
E[ s² ] = (1/(n-1)) • [ ΣE[xi²] - n•E[xbar²] ]
Remember that little thing we derived earlier and put in our back pocket? We need it now. We have two random variables, xi and xbar, that are squared, and for which we need the expectation. So: E[X²] = µ² + σ². The second one is a little different, because we need the mean and variance of the sampling distribution of the sample mean. These are µ and σ²/n, respectively. So for the second term we have:
E[xbar²] = µ² + σ²/n.
Substituting these values in above, we have:
E[ s² ] = (1/(n-1)) • [ Σ (µ² + σ²) - n•(µ² + σ²/n) ]
We can do this, because E[xi²] is the same for every xi (we assumed earlier that the x's are independent and identically distributed, so E[xi²] doesn't depend on the the i part). Now nothing depends on the summation anymore, we are just adding a constant, so we can just multiply by n:
E[ s² ] = (1/(n-1)) • [n•(µ² + σ²) - n•(µ² + σ²/n) ]
Then distribute the n multiplication over the parentheses:
E[ s² ] = (1/(n-1)) • [n•µ² + n•σ² - n•µ² - n•σ²/n ]
E[ s² ] = (1/(n-1)) • [ n•σ² - σ² ]
E[ s² ] = (1/(n-1)) • [ (n-1)•σ² ]
E[ s² ] = σ²
Voila! We are done, and we have proven that
E[ s² ] = σ². If, going back to the beginning, we had divided by n in the denominator instead of by n-1, that would have carried through to the end, and the result would have been:
E[ s² ] = [(n-1)/n] • σ²
Which is not exactly equal to σ², it is slightly smaller, because the ratio (n-1)/n is less than 1.(74 votes)
- I get these various "intuitions" about why n-1 is better, but I have two questions:
1.) Who figured this out in the first place, and how did they do so? Presumably they didn't run computer simulations, did the tediously do a lot of simulations by hand?
2.) Is there a mathematical proof than n-1 is better, or is it all based on intuition and empirically experimenting with different ways of getting the least biased sample variance?(18 votes)
- I kind of feel some of these videos for this section are not ordered correctly.(16 votes)
- I think this process of n-1 'unbiases' the estimation of variation of data because of the nature of collecting data. In real life, there is going to be much more variation of things than you will ever see in a sample group. Take height, for instance.
There are so many possible heights from very short to amazingly tall. However, if we wanted to find out what the average height is, the actual odds that we will meet the extremely tall and extremely short people is unlikely because they are what people call 'statistical outliers' - because these people are rare, you probably won't be able to include them in your list of peoples heights, so naturally you're going to meet a less diverse group of people. This will mean your data has less variation than real life does. Therefore your data is underestimating the variation of real life because of the odds of finding certain types of people being greater or smaller.(9 votes)
- A reasonable thought, but it's not really the reason. The reason dividing by
n-1corrects the bias is because we are using the sample mean, instead of the population mean, to calculate the variance. Since the sample mean is based on the data, it will get drawn toward the center of mass for the data. In other words, using the sample mean to calculate the variance is too specific to the dataset. If we were able to use the population mean instead of the sample mean, there would be no bias.(9 votes)
- Initially I found this confusing, but here's a restatement:
Makes perfect sense (once I read Justin Help’s explanation, which is available at https://www.khanacademy.org/computer-programming/unbiased-variance-visualization/1167453164). BASICALLY , when sample mean is the same as population mean (center of the charts) then sample VARIANCE is also the same as “sample variance calculated using the true population mean” (this is a weird statistic, but allows you to see why n-1 works). However, when sample variance is calculated using the sample variance (the normal way) this differs increasingly (by a negative amount, as variance is being underestimated) from the “sample variance calculated using the true population mean” (the weird statistic which Sal refers to as “pseudo sample variance” again). Thus, the charts on the left show sample variance against true population variance, the charts on the rights show sample variance against “pseudo sample variance”, a statistic that is a hybrid of true population variance and sample variance.(6 votes)
- At4:12, when he is subtracting by the population mean, the denominator shouldn't be N instead of n since we are talking in this part of the formula about the population and not just the sample?(5 votes)
- He says he is actually subtracting from the sample variance a "pseudo-sample variance," using the true mean but changing nothing else. Therefore, the denominator is n for both expressions for the red graph (and would be n-1 for the blue's and n-2 for the green's).
If he was finding the difference between the sample and population variances, you would be correct. But for this simulation, he uses the "pseudo-sample variance" to best demonstrate the unbiased estimate.(2 votes)
- So, what is the significance of the number 1 with respect to unbiased sample variance? What I mean is, it just seems a bit odd that it would conveniently converge to a whole number such as 1. Is there a simple answer or is it a mystical property akin to Euler's identity?(6 votes)
- If you look at the sample variance for lots and lots of random samples, and take the average of all those different variances, that average will tend to agree with the true population variance. It will tend to agree more as you consider more samples. Formally, we say that the "expectation value" of the sample variance is equal to the population variance. This Wikipedia article shows a proof of why this true: https://en.wikipedia.org/wiki/Variance#Sample_variance That proof also shows where the factor of n-1 comes from.(1 vote)
- I still don't understand how does the computer program calculate the "pseudo-sample variance" @4:10if we don't know mu's value. Can someone please explain?(2 votes)
- In real life we generally don't know the value of μ. However, in a simulation, we are making up the data, and we do in fact know μ. What were doing is:
1. Set μ and create some data from a distribution with that mean.
2. Pretend that we don't know μ, and calculate the mean and standard deviation.
3. Remember that we know μ, and perform the calculations shown in the video.(8 votes)
- didn't understand a thing(4 votes)
- at4:23it should have been Σ(xi - x̅)^2/N - Σ(xi - x̅)^2/n, isn't it?(2 votes)
- Σ(xi - x̅)^2/n - Σ(xi - μ)^2/N, yep. Sal makes a lot of typos.(1 vote)
Here is a simulation created by Khan Academy user Justin Helps that once again tries to give us an understanding of why we divide by n minus 1 to get an unbiased estimate of population variance when we're trying to calculate the sample variance. So what he does here, the simulation, it has a population that has a uniform distribution. So he says, I used a flat probabilistic distribution from 0 to 100 for my population. Then we start sampling from that population. We're going to use samples of size 50. And what we do is for each of those samples, we calculate the sample variance based on dividing by n, by dividing by n minus 1 and n minus 2. And as we keep having more and more and more samples, we take the mean of the variances calculated in different ways. And we figure out what those means converge to. So that's a sample. Here's another sample. Here's another sample. If I sample here then I'm now adding a bunch and I'm sampling continuously. And he saw something very interesting happen. When I divide by n, I get my sample variance is still, even when I'm taking the mean of many, many, many, many sample variances that I've already taken, I'm still underestimating the true variance. When I divide by n minus 1, it looks like I'm getting a pretty good estimate, the mean of all of my sample variances is really converged to the true variance. When I divided by n minus 2 just for kicks, it's pretty clear that I overestimated with my mean of my sample variances, I overestimated the true variance. So this gives us a pretty good sense of n minus 1 is the right thing to do. Now this is another interesting way of visualizing it. In the horizontal axis right over here, we're comparing each plot is one of our samples, and how far to the right is how much more is that sample mean than the true mean? And when we go to the left, it's how much less is a sample mean than the true mean? So for example, this sample right over here, it's all the way over to the right. It's the sample mean there was a lot more than the true mean. Sample mean here was a lot less than the true mean. Sample mean here only a little bit more than the true mean. In the vertical axis, using this denominator, dividing by n, we calculate two different variances. One variance, we use the sample mean. The other variance, we use the population mean. And this, in the vertical axis, we compare the difference between the mean calculated with the sample mean versus the mean calculated with the population mean. So for example, this point right over here, when we calculate our mean with our sample mean, which is the normal way we do it, it significantly underestimates what the mean would have been if somehow we knew what the population mean was and we could calculate it that way. And you get this really interesting shape. And it's something to think about. And he recommends some thinking about why or what kind of a shape this actually is. The other interesting thing is when you look at it this way, it's pretty clear this entire graph is sitting below the horizontal axis. So we're always, when we calculate our sample variance using this formula, when we use of the sample mean to do it, which we typically do, we're always getting a lower variance than when we use the population mean. Now this over here, when we divide by n minus 1, we're not always underestimating. Sometimes we are overestimating it. And when you take the mean of all of these variances, you converge. And here we're overestimating it a little bit more. And just to be clear what we're talking about in these three graphs, let me take a screen shot of it and explain it in a little bit more depth. So just to be clear, in this red graph right over here, let me do this. A color close to at least. So this orange, what this distance is for each of these samples, we're calculating the sample variance using, so let me, using the sample mean. And in this case, we are using n as our denominator. In this case right over here. And from that we're subtracting the sample variance, or I guess you could call this some kind of pseudo sample variance, if we somehow knew the population mean. This isn't something that you see a lot in statistics. But it's a gauge of how much we are underestimating our sample variance given that we don't have the true population mean at our disposal. And so this is the distance. This is the distance we're calculating. And you see we're always underestimating. Here we overestimate a little bit. And we also underestimate. But when you take the mean, when you average them all out, it converges to the actual value. So here we're dividing by n minus 1, here we're dividing by n minus 2.