If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content
Current time:0:00Total duration:4:46
AP.STATS:
UNC‑1 (EU)
,
UNC‑1.J (LO)
,
UNC‑1.J.3 (EK)
,
UNC‑3 (EU)
,
UNC‑3.I (LO)
,
UNC‑3.I.1 (EK)

Video transcript

here is a simulation created by Khan Academy user Justin helps that once again tries to give us an understanding of why we divide by n minus 1 to get an unbiased estimate of population variance when we're trying to calculate the sample variance so what he does here is a simulation it has a population that has a uniform distribution so he says I used a flat probabilistic distribution from 0 to 100 for my population then we start sampling from that population we're going to use samples of size 50 and what we do is for each of those samples we calculate the variance based on dividing by n we calculate the sample variance based on dividing by n by dividing by n minus 1 and n minus 2 and as we keep having more and more and more samples we take the mean of the variances calculated in different ways and we figure out what those means converge to so that's a sample here's another sample here's another sample if I sample here then I'm now adding a bunch and I'm sampling continuously and you saw something very interesting happen when I divide by n I get my my my sample variance is still even when I'm taking the mean of many many many many sample variances that I've already taken I'm still under estimating the true variance when I divide by n minus 1 it looks like I'm getting a pretty good estimate the mean of all of my sample variances is really converged to the true variance when I divided by n minus 2 just for kicks it's pretty clear that I overestimated that I overestimated the with my mean of my sample variances i overestimated the true variance so this gives us a pretty good sense that n minus 1 is the right thing to do now this is another way another interesting way of visualizing it in the horizontal axis right over here we're comparing each plot is one of our samples and how far to the right is how much more is that sample mean than the true mean and when we go to the left it's how much less is the sample mean than the true mean so for example this sample right over here it's all the way over to the right it's the sample mean there was a lot more than the true mean sample mean here was a lot less than the true mean sample mean here only a little bit more than the true mean in the vertical axis using this denominator dividing by n we calculate two different variances one variance we use the sample mean the other variance we use the population mean and this in the vertical axis we compare the difference between the mean calculated with the sample mean versus the mean calculated with the population mean so for example this point right over here when we calculate our mean with our sample mean which is a normal way we do it it significantly underestimates what the mean would have been if somehow we knew what the population mean was and we could calculate it that way and you get this really interesting shape and it's something to think about it you recommend some thinking about why or what kind of a shape this actually is the other interesting thing is when you look at it this way it's pretty clear this entire graph is sitting below the horizontal axis so we're always when we calculate our sample variance using this formula what when we use the sample mean to do it which we typically do we're always under s we're always getting a lower variance then when we then when we use the population mean now this over this over here when we divide by n minus one we're not always under estimating sometimes we're overestimating it and when you take the mean of all of these variances you converge and here we're overestimating it a little bit more and just to be clear what we're talking about in these three graphs let me take a screenshot of it and explain it a little bit more depth so just to be clear just to be clear in this red graph right over here let me do this color close to at least so this orange what this distance is for each of these samples we're calculating we're calculating the sample variance using so let me using the sample mean and in this case we are using n as our denominator in this case right over here and from that we're subtracting the sample variance or I guess you could call this some kind of pseudo sample variance if we somehow knew if we somehow knew the population mean this isn't something that you see a lot in statistics but it's a gauge of how much we are under estimating our sample our sample variance given that we don't have the true population mean at our disposal and so this is the distance this is the distance we're calculating you see we're always we are always under estimating here we overestimate a little bit and we also underestimate but when you take the mean when you average them all out it converges to the actual value so here we're dividing by n minus 1 here we're dividing by n minus 2