Sample variance Thinking about how we can estimate the variance of a population by looking at the data in a sample.
⇐ Use this menu to view and help create subtitles for this video in many different languages. You'll probably want to hide YouTube's captions if using these subtitles.
- Let's say that you're curious about people's TV watching habits.
- And, in particular, how much TV do people in the country watch.
- So what you're concerned with, if we imagine the entire country,
- and we've already talked about, especially if we're talking about a country like the United States, but pretty much any country has a large population.
- And the United States we're talking about on the order of 300 million people.
- Ideally, if you could somehow magically do it, you'd survey or somehow observe all 300 million people and take the mean of how many hours of TV they watch on average.
- On a given day.
- And that will give you, that will give you, the parameter. The Population Mean.
- But we have already talked about in a case like this that's very impractical, even if you tried to do it, by the time you did it your data might be stale.
- Some people might have passed away, other people might have been born. Who knows what might have happened?
- So this is a truth that is out there, this is, there is a theoretical population mean for the amount of, the average or the mean of average hours of TV watched per day.
- By Americans.
- There is a truth here a any give point in time.
- It's just pretty much impossible to come up with the exact answer.
- Come up with this exact truth.
- But you don't give up, you say "Well, may I don't have to survey all 300 million or observe all 300 million. Instead I'm just doing to observe a sample. I'm just going to observe a sample right over here."
- And lets say, just to make the computation simple, you do a sample of 6.
- We'll talk about later why 6 might not be as large of a sample as you'd like.
- But you survey how much TV these folks watch.
- And you get that they - you find one person who watched 1 and a half hours, another watched 2 and a half hours, another person watched 4 hours.
- And then you get one person who watched 2 hours, and then you get 2 people who watched 1 hour each.
- So given this data from your sample, what do you get as your sample mean?
- Well the sample, which we would denote by lower case x with a bar over it, is just the sum of all these divided by the number of data points we have.
- So let's see: we have 1.5 + 2.5 + 4 + 2 + 1 + 1.
- And all of that divided by six, which gives us- lets see.
- The numerator 1.5 + 2.5 is 4 + 4 is 8 + 2 is 10 + 2 more is 12. So this is going to be 12 over 6.
- Which is equal to 2 hours of Television.
- So at least for your sample you say "My sample mean is 2 hours of television." It's an estimate, it's a statistic that is trying to estimate this parameter, this thing that is very hard to know
- But it's our best shot. Maybe we'll get a better answer if we have more data points, but it's what we have so far.
- Now the next question you ask yourself is "While I just don't want to estimate my population mean, I also, I also want to estimate another parameter."
- "I also am interested in estimating my population varience."
- So once again, because we can't survey everyone in the population, this is pretty much impossible to know.
- But we're going to attempt to estimate this parameter. We attempted to estimate the mean now will also attempt to estimate this parameter, this varience.
- So how would you do it?
- Well, reasonable logic would say, "Maybe we'll do the same thing to the sample as we would have done with the population."
- When you do the population varience you take each point, each data point in the population, find the difference between that and the normal, the population mean.
- Take the square of that difference and then add up the squares of all those differences and then divide by number of data points you have.
- So let's try that over here. Let's try to find. Take each of these data points, take each of these data points and find the difference
- - I'm going to do that in a different color -
- The difference between that data point, The difference between that data point and our sample mean.
- Not our population mean, we don't know what the population mean. The sample mean.
- So that first Data point plus the second data point.
- So that's 4 minus 2 squared plus 1 minus 2 squared
- and this is what you would have done if you were taking a population varience. If this were your population this is how you would find your population mean.
- You would find the squared distance for each of thos data points and then divide by the number of data points.
- Let's just think about this for a little bit.
- 1 minus 2, squared, then you have 2.5 minus 2 squared. 2.5 minus 2, 2 being the sample mean. Plus let me see (this green color) plus 2 minus 2, squared, plus 1 minus 2 squared and then maybe you would divide by the number of data points you have.
- Where you have the number of data points. So in this case we're dividing by 6.
- And what would we get in this circumstance?
- Well, if we just do the computation: 1.5-2 is -.5, we square that. this becomes a positive 0.25 4-2 squared is going to be 2 squared so that's 4.
- 1-2 squared is just -1 squared so that's 1. 2.5 - 2 is .5 squared, is 0.25. 2-2 squared, well that's just 0. and then 1-2 squared is, is 1.
- it's negative 1 squared so we just get 1. and if we add all of this up, let me see. let me add the whole numbers first.
- 4+1 is 5 plus 1 is 6, and we have 2 .25. so this is, let me write this in a neutral color, this is going to be 6.5 over
- over this 6 right over here. and we could write this as well there's a couple of ways we could write this, but I'll just get the calculator out and we can just calculate it.
- so 6.5 divided by 6, if we round it's approximately 1.08.
- so it's approximately 1.08. Is this calculation.
- Now, what we have to think about is whether this is the best calculation, whether this si the best estimate for the population varience, given the data we have.
- You can always aruge that we could have more data, but given the data we have, is this the best calculation we can make to estimate the population varience?
- And I'll have you think about for a second.
- Well, it turns out this is close, this is close, to the best calculation.
- The best estimate we can make given the data we have. And sometimes this will be called the sample varience.
- But it's a particular type of sample varience where we just divide by the number of data points we have.
- So people will write an N over here. This is one way to difine a sample varience and an attempt to estimate our population varience.
- But it turns out, in the next video I'll give you an intuitive explanation of why it turns out this way, and then I'd also like to write a computer simulation for experimentally it makes you feel a little better.
- But it turns out you're to get a better estimate, and it's a little weird and voodoo-ish at first, when you first think about it.
- You're going to get a better estimate for your population varience for your population varience if you don't divide by 6.
- If you doint divide by the number of data points you have, but you divide by one less than the number of data point you have.
- So how would we do that? We can note that as sample varience.
- When most people talk about the sample varience, they're talking about the sample varience wher eyou do this calculation, but instead of dividing by 6 you were to divide by 5.
- So they'd say you divide by N-1.
- So what would we get in those circumstances? Well, the top part is going to be the exact same, we're going to get 6.5. but on the denominator our N is 6, we have six data points, but were' going to divide by one less than six
- We're going to divide by 5. And 6.5 divided by 5 is equal to 1.3.
- So when we calculate our sample varience by this technique, which is the more mainstream technique, but it seems voodoo.
- Why are we dividing by N-1, where for population varience we'd divide by N?
- But remember, we're trying to estimate the population varience. But it turns out this is a better estimate.
- Because this, this calculation is underestimating what the population varience is.
- This is a better estimate. We don't know what it is. These both could be way off. It could be just by chance what we happened to sample.
- Over many samples, and there's many ways to think about it, this is going to be a better calculation.
- It's going to give you a better estimate.
- So how would we write this down? How would we write this down with Mathematical notation?
- Well, we could - remember, we're taking the sum and we're taking each of the data points, so we'll start with the first data point, all the way to the Nth data point.
- This lower case n says 'hey, we're looking at the sample'. If I'd said Upper case N, that usually denotes we're looking at the whole population.
- But here we're looking at a sample of size lower case n.
- And we're taking each data point. So each x sub i, and from it we're subtracting, we're subtracting the sample mean.
- And then we're squaring it, we're taking the sum of the square distances, and then we're dividing not by number of data points we have, but by one less than the number.
- Of data points we have.
- So this calculation, where we summed up all of this, then we divided by 5, not by 6.
- THis is the standard definition of sample. of sample varience.
- So I'll leave you there, and in the next video I'll attempt to give you an intuition of why we're dividing by N-1.
Be specific, and indicate a time in the video:
At 5:31, how is the moon large enough to block the sun? Isn't the sun way larger?
Have something that's not a question about this content?
This discussion area is not meant for answering homework questions.
Share a tip
When naming a variable, it is okay to use most letters, but some are reserved, like 'e', which represents the value 2.7831...
Have something that's not a tip or feedback about this content?
This discussion area is not meant for answering homework questions.
Discuss the site
For general discussions about Khan Academy, visit our Reddit discussion page.
Flag inappropriate posts
Here are posts to avoid making. If you do encounter them, flag them for attention from our Guardians.
- disrespectful or offensive
- an advertisement
- low quality
- not about the video topic
- soliciting votes or seeking badges
- a homework question
- a duplicate answer
- repeatedly making the same post
- a tip or feedback in Questions
- a question in Tips & Feedback
- an answer that should be its own question
about the site