If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

# Sample variance

AP.STATS:
UNC‑1 (EU)
,
UNC‑1.J (LO)
,
UNC‑1.J.3 (EK)

## Video transcript

let's say that you're curious about people's TV watching habits in particular how much TV do people in the country watch so what you are concerned with if we imagine the entire country and we've already talked about it especially we're talking about a country like the United States but pretty much any country is a very large population in the United States we're talking about on the order of 300 million people so ideally if you could somehow magically do it you would survey or somehow observe all 300 million people and take the mean of how much hour how many hours of TV they watch on a given day and then that will give you that will give you the parameter the population mean but we've already talked about in a case like this that's very impractical even if you went tried to do it by the time you did it your data might be stale because some people might have passed away other people might have been born who knows what might have happened and so this is this is a truth that is out there this is this is there is a theoretical population mean for the amount of the the average or the mean hours of TV watched per day by Americans there is a truth here at any given point in time it's just pretty much impossible to come up with the exact answer to come up with this exact truth but you don't give up you say well well maybe I don't have to survey all 300 million or observe all 300 million instead I'm just going to observe a sample I'm just going to observe a sample right over here and let's say for the sake make the computation simple you do a sample of 6 and we'll talk about later why 6 might not be as large of a sample as you would like but you survey how much TV these folks watch and you get that they you find one person who watched one and a half hours another person watched two and a half hours another person watched four hours and then you get one person who watched two hours and then you get two people who watched one hour each so given this data from your sample what do you get as your sample mean well the sample mean which we would denote by lower case X with a bar over it it's just the sum of all of these divided by the number of data points we have so let's see we have one point five plus two point five plus four plus two plus one plus one and all of that divided by six which gives let's see the numerator one point five plus two point five is 4 plus 4 is 8 plus 2 is 10 plus 2 more is 12 so this is going to be 12 over 6 which is equal to 2 hours of television so at least for your sample you say my sample mean is 2 hours of television it's a it's an estimate it's a statistic that is trying to estimate this parameter this thing that's very hard to know but it's our best shot maybe we'll get a better answer if we get more data points but this is what we have so far now the next question you ask yourself is well I don't want to just estimate my population mean I also I also want to estimate another parameter I also am interested in estimating my population variance so once again since we can't survey everyone in the population this is much this is pretty much impossible to know but we're going to attempt to estimate this parameter we have tempted to estimate the mean now we will also attempt to estimate this parameter this variance parameter so how would you do it well reasonable logic would say well we maybe will do the same thing that the sample as we would have done with the population when you're doing the population variance you would take each point each data point in the population find the distance between that and the normal the the population mean take the square of that difference and then add up all those squares of those difference and then divide by the number of data points you have so let's try that over here so let's try to find take each of these data points take each of these data points and find the difference let me do that in a different color each of these data points and find the difference between that data point the difference between that data point and our sample mean not the population we don't know what the population mean the sample mean so that's that first data point plus the second data point plus the second data point so it's 4 are minus 2 squared plus 1 minus 2 squared and this is what you would have done if you're taking a population variance if this was your entire population this is how you would you would find a population mean here if this was your entire population and you would find the squared distances from each of those data points and then divide by the number of data points so let's just think about this a little bit 1 minus 2 squared then you have 2.5 minus 2 squared 2 point 5 minus 2 2 being the sample mean squared plus let me see this green color plus 2 plus 2 minus 2 squared plus plus 1 minus 2 squared and then maybe you would divide by the number of data points that you have where you have the number of data points so in this case we're dividing by 6 and what would we get in this circumstance well if we just do the computation 1.5 minus 2 is negative 0.5 we square that this becomes a positive 0.25 4 minus 2 squared is going to be that's going to be 2 squared which is 4 1 minus 2 squared well that's negative 1 squared which is just 1 2 point 5 minus 2 is 0.5 squared is 0.25 2 minus 2 squared well this is 0 and then 1 minus 2 squared is is 1 it's negative 1 squared so we just get 1 just good 1 and if we add all of this up let's see we get a let me add the whole numbers first 4 plus 1 is 5 plus 1 is 6 and then we have 2.25 so this is going to be equal to 6.5 let me write this in a neutral color so this is going to be 6.5 over over this 6 right over here and we could write this as well there's a couple of ways that we could write this but I'll just get the calculator out and we can just calculate it so 6.5 6.5 divided by 6 gets us if we round it's approximately one point zero eight so it's approximately 1.08 is is this calculation now what we have to think about is whether this is the best calculation whether this is the best estimate for for the population variance given the data that we have you can always argue that we could have more data but given the data we have is this the best calculation that we can make to estimate the population variance and I'll have you think about that for a second well it turns out that this is close this is close to the best calculation the best estimate that we can make given the data we have and sometimes this will be called the sample variance but it's a particular type of sample variance where we just divide by the number of data points we have and so people will write just an n over here so this is one way to define a sample variance in an attempt to estimate our population variance but it turns out and in the next video I'll give them give you an intuitive explanation of why it turns out this way and then I would also like to write a computer simulation that at least experimentally makes you feel a little bit better but it turns out you're going to get a better estimate and it's a little bit weird and voodoo ish at first when you first think about it you're going to get a better estimate for your population variance for your population variance if you don't divide by 6 if you don't divide by the number of data points you have but you divide by one less than the number of data points you have so how would we do that and we can denote that as sample variance so when most people talk about the sample variance they're talking about the sample variance where you do this calculation but instead of dividing by 6 you were to divide by 5 you were to divide by 5 so they'd say you divide by n minus 1 so what would we get in those circumstances well the top part is going to be the exact same thing we're going to get 6.5 we're going to get 6.5 but then our denominator we have our n is 6 we have six data points we're going to divide by one less one less than 6 so we're going to divide by five and six point five divided by five is equal to one point three so when we when we calculate our sample variance this technique which is the more mainstream technique and I know it seems voodoo why are we dividing by n minus 1 wherefore population variance we divide by n but remember we're trying to estimate the population variance and it turns out that this is a better estimate because this this calculation is under estimating what the popular the population variance is this is a better estimate we don't know for sure what it is these both could be way off it could be just by chance what we happen to sample but over over many samples and there's many ways to think about it this is going to be a better calculation it's going to give you a better estimate and so how would we write this down how would we write this down with mathematical notation well we could remember we're taking the sum the sum and we're taking each of the data points so we'll start with the first data point all the way to the nth data point this lowercase n says that hey we're taking we're looking at the sample if I just an uppercase n that usually denotes that we're trying to sum up everything in the population here we're looking at a sample of size lowercase n now we're taking each data point so each X sub I and from it we're subtracting we're subtracting the sample mean we're subtracting the sample mean and then we're squaring it we're taking the sum of the squared distances and then we're dividing not by the number of data points we have but by one less than the number of data points we have so this calculation where we summed up all of this and then we've divided by five not by 6 this is the standard definition of sample of sample variance so I'll leave you there in the next video I will attempt to give you an intuition of why we are dividing by n minus 1 instead of dividing by n
AP® is a registered trademark of the College Board, which has not reviewed this resource.