Question 1

I would love to see another simulation comparing the biased estimator to the unbiased estimator (two of the right hand corner graph, one with each type) to appreciate the real difference, has anyone done this?
thanks!

Accepted Answer

Done: https://www.khanacademy.org/cs/unbiased-estimate-of-population-variance/1552640497

Basically the bars for the unbiased estimate tend to 100% (which is what you'd expect).

Question 2

i understand how the n-1 can be derived through simulation, but from a logical standpoint, why could it not be n-2 or n-3 ?

Accepted Answer

I think thats getting into the integral formulations territory.

Question 3

Is n-1 only an unbiased sample if the underling population is normal distribution or is it always true?

Accepted Answer

It doesn't matter what the underlying population distribution is. If you look at the simulation, the top graph shows the population distribution and it is not normally distributed. Central limit theorem tells us that the distribution of sample means will be normally distributed regardless of the population distribution.

https://www.khanacademy.org/math/probability/statistics-inferential/sampling_distribution/v/central-limit-theorem

https://www.khanacademy.org/math/probability/statistics-inferential/sampling_distribution/v/central-limit-theorem

Question 4

How do we know when the sample size is small enough to warrant using n-1 as opposed to n?

Accepted Answer

we need to use "n-1" instead of "n" as long as the sample size is smaller than the population (which is, mostly all the time)... doesn't matter if the sample is really large or really small, either ways it's less than than the population so we need to compensate by using "n-1"... and the value of  "n-1" changes when the size of the sample changes so the compensation is in proportion to the size of the sample.. hope this was helpful..

Question 5

I understand how Sal shows the population variance to equal sum(xi-x[bar])^2/n-1.  What i do not quite understand is the relationship between the unbiased calculated population variance [sigma^2] and the sample variance [s^2].  Are they the same? Do we use [sigma^2] when using calculating p-values?
Thanks

Accepted Answer

I think you may have had a typo in there. If we have population data (all the data possible), we could calculate the population variance exactly, because we have the population: `σ^2 = sum(xi-μ)^2/n`.

If we only have a sample, then we need to estimate the population variance. It's a foregone conclusion that we can't _calculate_ it, since we don't have all the possible data, but we can estimate it. And this is where the "unbiased" version comes into play: not all estimates are created equal. Going back to the population version (above), we could simply replace the μ with an xbar and call it a day:
`s^2 = sum(xi-xbar)^2/n`
However, this is what Sal is showing is a biased estimate. Since we're simulating data, we know the true variance, but we pretend as if we don't and so calculate the sample variance. If this formula gave us "good" results, the ratio of these two values, `s^2 / σ^2`, should be about 1. Sometimes it would be larger, other times smaller, but that's what it would average out to be. Unfortunately, this doesn't work. The formula above gives biased results - it tends to miss the mark by underestimating a little bit. Thankfully, we can figure out how to correct for this, and it's just a scaling factor. So we replace the formula above with
`s^2 = sum(xi-xbar)^2/(n-1)`
When we use this formula, the ratio of the sample variance to the population variance will tend to be 1, so we have an "unbiased" estimate of the sample variance.

The estimated sample variance is not "the same" as the population variance, it's our best guess at what the population variance is. This is the same way that the sample mean is not the same as the population mean, but it's our best guess.

In terms of what we use: if we have σ^2, we should use it. If we don't, then we use s^2. In the case of just variances, since the formulas are so similar, this basically means the following: If we have population data, we divide by n. If we only have a sample, then we divide by n-1. There are some additional consequences of this further down the line (e.g., if we have σ^2, then we'd use a Z-test instead of a t-test later on).

Question 6

So i understand how and why the correction is necessary. What i dont understand is: what is the reason for the bias to be (n-1)/n?

It makes sense that the variance will be different in a sample. But why this pattern of (n-1)/n?
If the points i take from the sample are close to each other the variance will be smaller than from the total population. If they are at the extrem ends or just few points spread out over maximum distances the variance could be even greater than the one of the total population. Why dont the samples cancel each other out? Why (n-1)/n?

Accepted Answer

If you look at some of the other questions, you may find people asking the same question as you. I wrote an answer to this a while back, here's a link to the question, and my answer is below it.

https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/simulation-showing-bias-in-sample-variance?qa_expand_key=ag5zfmtoYW4tYWNhZGVteXJfCxIIVXNlckRhdGEiQXVzZXJfaWRfa2V5X2h0dHA6Ly9nb29nbGVpZC5raGFuYWNhZGVteS5vcmcvMTE1OTg0NTAzMjE5NjQ2MjU2MzA3DAsSCEZlZWRiYWNrGMGgCgw

Question 7

what actually biased and unbiased tells us, what they actually means?

Accepted Answer

here biased means we are underestimating the sample variance, thus it is not predicting the population variance. dividing by n-1, we are compensating the underestimation, and making it unbiased.

Question 8

can someone explain this differintly for me?

Accepted Answer

the SD of a sample is an estimate of the entire population based on the mean drawn from a sample of that population. So the SD is always going to be biased towards differences based on the average of the sample. n-1 corrects for that bias.

Course: AP®︎/College Statistics > Unit 3

Simulation showing bias in sample variance

Want to join the conversation?

Video transcript