If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

# Simulation showing bias in sample variance

AP.STATS:
UNC‑1 (EU)
,
UNC‑1.J (LO)
,
UNC‑1.J.3 (EK)
,
UNC‑3 (EU)
,
UNC‑3.I (LO)
,
UNC‑3.I.1 (EK)

## Video transcript

this right here is a simulation that was created by Peter Cole encourage using the Khan Academy computer science scratch pad to better understand why we divide by n minus 1 when we calculate an unbiased sample variance when we are in an unbiased way trying to estimate the true population variance so what this simulation does is at first it constructs a population distribution a random one and every time you go to it it'll be a different population distribution this one has a population of 383 and then it calculates the parameters for that population directly from it the mean is ten point nine the variance is twenty five point five and then it uses that population and samples from it and it does samples of size two three four five all the way up to ten and it keeps sampling from it calculates the statistics for those samples so the sample mean and the sample variance in particular the biased sample variance and it starts telling us some things about us that gives us some intuition and you can actually click on each of these and zoom in to really be able to study these graphs in detail so I've already taken a screenshot of this and put it on my my little doodle pad so we can really delve into some of the math and the intuition of what this is actually showing us so here I took a screenshot and you see for this case right over here the population was 529 population mean was ten point six and down here in this chart he plots the population mean right here at ten point six right over there and you see that the population variance is at thirty six point eight and right here he popped plots that right over here 36 point eight so this first chart on the bottom left tells us a couple of interesting things and just to be clear this is the biased this is the biased sample variance that he's calculating this is the biased biased sample variance so he's calculating it that is being calculated as for each of our data points so starting with our first data point in each of our samples going to our nth data point in the sample you're taking that data point subtract acting out the sample mean squaring it and then dividing the whole thing not by n minus one but by lowercase n and this tells us several interesting things the first thing it shows us is that the cases where we are significantly under estimating the sample variance when we're getting sample variance is close to zero these are also the cases these are also the cases or they just they're disproportionately the cases where the the means for those samples are way far off from the true sample mean or you could view that the other way around the cases where the mean is way far off from the sample mean it seems like you're much more likely to underestimate the sample variance in those situations the other thing that might pop out at you is the realization that the Pinker dots are the ones for smaller sample size while the bluer dots are the ones of a larger sample size and you see here these two little the two little I guess the details so to speak of this hump that these at these these ends you disproportionately it's more of a reddish color that most of the blueish or the purplish dots are sir are focused right in the middle right over here that they are giving us a better estimates there are some red ones here and that's why it gives us that purplish color but they're out here on these on these tails it's almost purely some of these red every now and then by happenstance you get a little blue one but this is disproportionately far more red which really makes sense when you have a smaller sample size you're more likely to get a sample mean that is a bad estimate of the population mean that's far from the population mean and you're more likely to significantly underestimate the sample variance now this next chart really gets to the meat it really gets to the meat of the issue because what it's telling us is that for each of these sample sizes so this right over here for sample size two if we keep taking sample size two and we keep calculating the by sample variances and dividing that by the population variance and finding the mean over all of those you see that over many many many trials many many samples of size two that that by sample variance over population variance it's approaching half of the true population variance when sample size is three it's approaching two thirds sixty six point six percent of the true population variance when sample size is four it's a pop it's approaching three-fourths of the true population variance and so we can come up with a general theme that's happening when we use the biased estimate when we use the biased estimate we're not approaching the population variance we're approaching we're approaching n minus one let me write this down we're approaching n minus one over N times the population variance when n was to this approached one half one half when n is three this is two thirds when n is four this is three fourths so this is giving us a biased estimate so how would we unbias well if we really want to get our best estimate of the true population variance not n minus one over N times the population variance we would want to multiply and reduce in a color I haven't used yet we would want to multiply times n over n minus one we would want to multiply n over n minus one to get an unbiased estimate here these cancel out and you are left just with your left just left with your population variance that's what we want to estimate and over here over here you are left with our unbiased estimate our unbiased estimate of population variance or unbiased sample variance which is equal to and this is what we see we saw on last several videos what you see in statistics books and sometimes it's confusing why hopefully Peters simulation gives you a good idea of why or it at least convinces you that it is the case so you would want to divide by n minus one
AP® is a registered trademark of the College Board, which has not reviewed this resource.