If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content
Current time:0:00Total duration:9:44

Review and intuition why we divide by n-1 for the unbiased sample variance

AP.STATS:
UNC‑1 (EU)
,
UNC‑1.J (LO)
,
UNC‑1.J.3 (EK)
,
UNC‑3 (EU)
,
UNC‑3.I (LO)
,
UNC‑3.I.1 (EK)

Video transcript

what I want to do in this video is review much of what we've already talked about and then hopefully build some of the intuition on why we divide by n minus 1 if we want to have an unbiased estimate of the population variance when we take when we're calculating the sample variance so let's think about a population so let's say this is the population right over here and it is of size capital n and we also have a sample of that population so a sample of that population and it its size we have lowercase n data points so let's think about all of the parameters and statistics that we know about so far so the first is the idea of the mean of the mean so if we're trying to calculate the mean for the population is that going to be a parameter or a statistic well when we're trying to calculate it on the population we are calculating a parameter we are calculating a parameter so let me write this down so this is going to be so for the population population we are calculating a parameter it is a parameter and when we calculate when we attempt to calculate something for a sample a sample we would call that a statistic statistic so how do we think about the mean for population well first of all we denote it with the Greek letter mu and we essentially take every data point in our population so we take the sum of every data point so we start at the first data point and we go all the way to the capital n theta point so every data point we add up so this is the ith data point so X sub one plus X sub 2 all the way to X sub Capital n and then we divide by the total number of data points we have well how do we calculate the sample mean well the sample mean we do a very similar thing with the sample and we denote it with a X with a bar over it and that's going to be taking every every data point in the sample so going up to lowercase n adding them up so these are all the sum of all the data points in our sample and then dividing by the number of data points that we actually had now the other the other I guess thing that we're trying to calculate for the population which was a parameter then we'll also try to calculate it for the sample and estimated for the population was the variance which was a measure of how dispersed or how how much the data points vary from the mean so let's write a variance variance right over here and how do we how do we denote and calculate variance for a population well for population we'd say that the variance we use the Greek letter Sigma squared is equal to and you could view it as the mean of the squared distances from the population mean but what we do is we take width for each data point so I equal 1 all the way to N we take that data point subtract from it the population mean so if you want to calculate this you'd want to figure this out or that's one way to do it we'll see there's other ways to do it where you can kind of calculate them at the same time but you would the easiest or the most intuitive calculate this first and for each of the data points take the data point and subtract it from that subtract the mean from that square it and then divide by the total number of data points you have now we get to the interesting part sample variance there's several ways where when people talk about sample variance there's several I guess tools and their toolkits or there are several ways to calculate it one way is the biased sample variance the the non unbiased estimator of the population variance and that's denoted usually denoted by s with a subscript N and what is the unbiased estimator you how would we calculate it well we would calculate it very similar to how we calculated the variance right over here but we would do it for our plot for our sample not our population so for every data point in our in our sample so we have n of them we take that data point and from it we subtract our sample mean we subtract our sample mean square it and then divide by the number of data points that we have but we already talked about in the last video how would we find what is our best unbiased estimate of the population variance this is usually what we're trying to get at we're trying to find an unbiased estimate of the population variance well in the last video we talked about that if we want to have an unbiased estimate and here in this video I want to give you a sense of the intuition why we would take the sum so we're going to go through every data point in our sample we're going to take that data point subtract from it the sample mean square that but instead of dividing by n we divide by n minus 1 we're dividing by we're dividing by a smaller number we're dividing by a smaller number and when you divide by a smaller number you're going to get a larger you're going to get a larger value so this is going to be larger this is going to be larger this is going to be smaller and this one we refer to the unbiased estimate unbiased estimate and this one we refer to the biased estimate biased estimate if people just right if people just try to this they're talking about the sample variance it's a good idea to clarify which one they're talking about but if you had to guess and people give you no no further information they're probably talking about the unbiased estimate of the variance so you would probably divide by n minus 1 but let's think about let's think about why this estimate will be would be would be biased and why we might want to have an estimate like this this is a lot that is larger and then maybe in the future we could have a computer program or something that really makes us feel better that dividing by n minus 1 gives us a better estimate of the true population variance so let's amay let's imagine all of the data in a population and I'm just going to plot them on a number line all the data so this is my number line this is my number line and let me plot all of the data points in my population so this is some data this is some data here's some data and here is some data here and I can just do as many points as I want so these are just points on the number line now let's say I take a sample of this so this is my entire population so let's see I'm a 1 2 3 4 5 6 7 8 9 10 11 12 13 14 so in this case what would be my big end my big end would be 14 big n would be 14 now let's say I take a sample a lowercase n of let's say my sample size is 3 I could take I could take well before I even think about that let's think about roughly where the mean of the this population would sit so the way I drew it and I'm not going to calculate it exactly it looks like the mean might sit someplace roughly right over here so the mean the true population mean the parameter is going to sit right over here now let's think about what happens when we sample and I'm going to do just a very small sample size just to give us the intuition but this is true of any sample size so let's say we have sample size of 3 so there is some possibility when we take our sample size of 3 that we happen to sample it in a way that our sample mean is pretty close to our population mean so for example if we sample to that point that point and at that point I can imagine our sample mean might actually sit pretty close pretty close to our population mean but there's there's there's a distinct possibility there's a distinct possibility that I may be when I take a sample I sample that that and that and the key idea here is when you take a sample your sample mean is always going to sit within your sample and so there's a there is a possibility that when you take your sample your mean could even be outside of the sample and so in this situation and this is just to give you an intuition so here your sample mean is going to be sitting your sample mean is going to be sitting someplace in there and so if you were to just calculate the distance from each of this points to the sample mean so this distance that distance and you square it and you were to divide by the number of data points you have this is going to be a much lower estimate than the true variance the true variance from the actual population mean where these things are much much much further now you're always not going to have the true population mean outside of your sample but it's possible that you do so in in general well this when you just take your points find the squared distance to your sample mean which is always going to sit inside of your data even though the true population mean could be outside of it then your or it could be in you know one end of your data however you might want to think about it you are likely to be under estimating you are likely to be under estimating the true population variance so this right over here is an underestimate underestimate and it does turn out that if you just instead of dividing by n you divide by n minus 1 you'll get a slightly larger sample variance and this is a this is an unbiased estimate in the next video and I might not to get to it immediately I would like to generate some type of a computer program that is more convincing that this is a better estimate this this is a better estimate of the population variance than this is