Question 1

Just curious: Was it by simulations like this that statisticians originally figured out the n-1 thing? Or is that conclusion actually really obvious if you just understand the "pure math" underlying it?

Accepted Answer

No, they did it analytically. They probably came up with some intuition of the need to adjust the variance, but intuition cannot tell you why you have to divide exactly by n-1.

There is a geometrical reason to dividing by n-1, it's the number of degrees of freedom. You can see this for the sample variance by considering the number of independent data points. To compute the sample variance, you compute first the sample mean. This means that given this sample mean, if someone gives you all the data points except one, you can figure out by yourself what the last data point is. So, you actually don't have a sample size of n data points to compute the sample variance, but a sample size of n-1.

Question 2

I'm sorry, but what does biased and unbiased mean?

Accepted Answer

A biased estimate is an one that consistently underestimates or overestimates.

For example, sample estimates using (n) tend to consistently underestimate the population variance.  So we say it has a BIAS for underestimation.

Sample estimates using (n-1) however do not tend to underestimate or overestimate, so we consider it UNBIASED.

Note that unbiased is not the same thing as accurate.  Suppose I use another method that sometimes way underestimates, but at other times way overestimates.  This method is not very accurate, but it is also unbiased -- the mean of its errors would be close to zero since the overestimates would "cancel out" the underestimates.

Question 3

These explanations are based on empirical evidence, Is there a theoretical explanation for dividing by n-1?

Accepted Answer

For me this Wikipedia link is more detailed and more understandable, however more or less the same as the "Sample variance" page. But still a bit different, might be worth to check for those who needed more info on "n-1" stuff after the Sample variance article. https://en.wikipedia.org/wiki/Bias_of_an_estimator

Question 4

When do you make a question to do with variance (n-1)? When is it just n? Thank you. would really appreciate a clear answer...

Accepted Answer

n-1 when you chose a sample from the population
n when you've counted the entire population.

Question 5

Isn't the relative size of the sample compared to the population relevant when calculating the sample variance? I mean, if we calculate the variance of 99 elements out of a population of 100 elements, won't the variance of this sample be more accurately described by N, and not (N-1)? Is there a threshold for a sample to be described by (N-1)?

Accepted Answer

That’s an excellent question, and I’m not sure about the answer.

But if our sample size is only one or two less than our population size, we might as well look at every element in the population instead. Sampling is used when it is not practical to take information from the whole population, so there is usually a good portion of the population left over. So, this situation isn’t practical, but it is interesting to think about theoretically.

Question 6

I understand that n-1 provides a more accurate estimation. However, if we know our population N value, couldn't we just subtract the n/N ratio from n instead? For example, if N=20 and n=10, we would know the ratio is 0.5. Therefore, we could find an even better estimate from n-0.5.

Accepted Answer

The number that we subtract has nothing to do with the size of the population. It's not just that it makes the estimate "more accurate," it's that it makes it what Statisticians call "unbiased."

Think back to the sampling distribution of the sample mean. So, if we repeated an experiment over and over again, and recorded the sample mean from each of the repeated experiments. The mean of the sampling distribution of the sample mean -- what Sal talks sometimes refers to as the "mean of means" -- happens to be equal to the mean of the original distribution. Because of this, we say that the sample mean is "unbiased" - it doesn't systematically overestimate or underestimate the population mean.

This is not the case with the variance. If we calculate the variance over and over again, using n in the denominator, the "mean of variances" (a strange concept, but it's the proper one to think about) will not be equal to σ^2, it will be σ^2 * (n-1)/n. By dividing by n-1 instead of n, we fix this problem. Using n, the sample variance is biased, because it tends to underestimate the population variance. Using n-1, the sample variance is unbiased.

So in this sense, it's not possible to get a better estimate for the variance. Subtracting 1, and specifically 1, is the best we can do. Changing what we divide by can _only_ make it worse. Now, there are other criteria we might look at which may make a different estimate of the sample variance seem "better," but if we're just talking about the denominator we're using, n-1 can't be beat.

Question 7

what dont we divide our sample mean by n-1, is it not a biased estimator?

Accepted Answer

Different sample means will oscillate around the population mean, can be both higher and lower, but different sample variances will tend to be lower than then population variance.

Course: Statistics and probability > Unit 3

Simulation providing evidence that (n-1) gives us unbiased estimate

Want to join the conversation?

Video transcript