Review and intuition why we divide by n-1 for the unbiased sample variance Reviewing the population mean, sample mean, population variance, sample variance and building an intuition for why we divide by n-1 for the unbiased sample variance
Review and intuition why we divide by n-1 for the unbiased sample variance
⇐ Use this menu to view and help create subtitles for this video in many different languages. You'll probably want to hide YouTube's captions if using these subtitles.
- What I want to do in this video is review much of what we've already talked about.
- And the hopefully build the intuition on why we divide by n-1
- if we want to have an unbiased estimate of the population variance
- when we're calculating the sample variance.
- So let's think about a population.
- So let's say this is the population right over here.
- And it is of size capital N.
- And we also have a sample of that population.
- And it's size is lower case n data points.
- So let's talk about all the parameters and statistics
- we know about so far.
- So the first is the idea of the mean.
- So if we're trying to calculate the mean for the population.
- Is that going to be a parameter or a statistic?
- Well, when we're trying to calculate it on the population
- we are calculating a parameter.
- So let me right this down
- So this is going to be, for the population
- It is a parameter.
- And when we calculate, when we attempt to calculate something for a sample
- we would call that a statistic.
- So how do we think about the mean for a population?
- Well, first of all, we denote it with the greek
- letter miu.
- And we essentially take every data point in our population.
- So we take the sum of every data point.
- We start at the first data point
- and we go all the way to the capital Nth data point.
- For every data point we add up.
- So this is the ith data point:
- so x sub 1 plus x sub 2 all the way to x sub capital N.
- And then we divide by the total number of data points we have.
- Well how do we calculate the sample mean.
- Well for the sample mean we do a very similar thing but for the sample.
- We denote it with an X with a bar over it.
- And that's going to be taking every data point in the sample
- so going up to lower case n
- adding them up
- the sum of all the data points in our sample
- then dividing by the number of data points that we actually had.
- The other thing that we're trying to calculate for the population
- which also was a parameter
- then we're also going to calculate it for the sample
- and estimate it for the population
- was the variance.
- Which was a measure of how dispersed or how
- much the data points vary from the mean.
- So let's write variance:
- How do we denote and calculate variance for a population?
- well for a population we'd say that the variance - we use the greek letter sigma squared -
- is equal to the squared distances from the population mean.
- But what we do is we take
- for each data point
- so i equal 1 all the way to N
- we take that data point, subtract from it the population mean
- so if you want to calculate this you'd want to
- figure this out. Well that's one way to do it
- we'll see there are other ways to do it
- where you can kind of calculate them at the same time
- but the easiest or the most intuitive is to calculate
- this first, then for each of the data points take the data point
- and subtract from that the mean, square it and then
- divide by the total number of data points we have.
- Now we get to the interesting part:
- Sample variance.
- There's several ways, when people talk about the sample variance,
- there's several tools in their toolkits, there's several ways to calculate it.
- One way is the biased sample variance, the non unbiased estimator
- of the population variance and that's denoted
- usually by an S squared with subscript n
- and what is the biased estimator?
- How do we calculate it?
- Well we would calculate it very similiar to how we would calculate it over here.
- But we would do it for our sample not our population.
- So for every data point in our sample, so we have n of them
- we take that data point, from it we subtract our sample mean
- square it and then divide by the number of data points that we have.
- But we already talked about in the last video
- how would we find, what is our best unbiased estimate of the population variance.
- We're trying to find an unbiased estimate of the population variance.
- Well in the last video we talked about
- if we want to have an unbiased estimate and
- here in this video I want to give you a sense, an intuition why
- we would take the sum, so we're going to go through
- every data point in our sample, we're going to take that data point
- subtract the sample mean, square that, but
- instead of dividing by n, we will divide by n minus 1.
- We're dividing by a smaller number and
- when you divide by a smaller number,
- you're going to get a larger value.
- So this is going to be larger
- this is going to be smaller.
- and this one we refer to as the unbiased estimate
- and this one we refer to as the biased estimate.
- if people just write this, they're talking about the sample variance
- it's a good idea to clarify which one they're talking about
- but if you had to guess and people would give you no further information
- they're probably talking about the unbiased estimate.
- So you'd probably divide by n minus 1.
- But let's think about why this estimate will be
- biased and why might want to have an estimate like this.
- that is larger.
- and maybe in the future we can have a computer program or something
- that really makes us feel better that dividing by n-1 gives us
- a better estimate of the true population variance.
- so let's imagine all of the data in a population and I'm just going to plot them on
- a number. All the data. So this is my number line. And let me plot all of the data points in my population.
- So this is some data, this is some data, here is some data and here is some data here
- and I can just do as many points as I want.
- So these are just points on the number line. Now let's say I take
- a sample of this. So this is my entire population, so let's
- see, I have 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
- so in this case what would be my big N?
- my big N would be 14.
- Now let's say I take a sample. A lower case n of, let's say my sample size is 3.
- I could take... before I even think about that, let's
- think about roughly were the mean of this population would sit.
- So the way I drew it, I'm not going to calculate it exactly, it looks like
- the mean might sit someplace roughly right over here.
- So the mean, the true population mean, the parameter is going to sit
- right over here. Now let's think what happens when we sample
- and I'm going to do just a very small sample size just to give us an intuition, but
- this is true of any sample size. So let's say we have sample size of
- So there is some possibility that when we take our sample size of 3 that
- we happen to sample in a way that our sample mean is pretty close to our population mean.
- So e.g. if we sample that point, that point and that point I could imagine our sample mean
- might actually sit pretty close to our population mean.
- But there's a distinct possibility that maybe when I take
- a sample and I sample that, that and that and the key
- idea here is that when you take a sample, your sample mean is always going to sit within your sample.
- So there is a possibility that when you take your sample your mean could even be outside of the
- sample and so in this situation and this is just to give you an intuition, so here
- your sample mean is going to be sitting someplace in there.
- And so if your were to just calculate the distance from
- each of these points to the sample mean, so
- this distance, that distance and you square it and
- were to divide by the number of data points you have
- this is going to be a much lower estimate than the true variance from the actual population mean.
- Where these things are much, much, much further.
- Now you're always not going to have the true population mean outside of your sample
- but it's possible you do.
- so in general this, if you just take your points,
- find the squared distance to the sample mean, which
- is always going to sit inside of your data, even though
- the true population mean could be oustide of it, then
- or it could be at on end of your data, however you might want to think about it
- then your are likely to be underestimating the true population variance.
- So this right over here is underestimate.
- And it does turn out that if you just instead of
- dividing by n divide by n-1
- you'll get a slightly larger sample variance
- and this is an unbiased estiamte.
- And in the next video and I might not get to in immediately
- I would like to generate some type of computer program that is more convincing
- that this is better estimate
- of the population variance than this is.
Be specific, and indicate a time in the video:
At 5:31, how is the moon large enough to block the sun? Isn't the sun way larger?
Have something that's not a question about this content?
This discussion area is not meant for answering homework questions.
Share a tip
When naming a variable, it is okay to use most letters, but some are reserved, like 'e', which represents the value 2.7831...
Have something that's not a tip or feedback about this content?
This discussion area is not meant for answering homework questions.
Discuss the site
For general discussions about Khan Academy, visit our Reddit discussion page.
Flag inappropriate posts
Here are posts to avoid making. If you do encounter them, flag them for attention from our Guardians.
- disrespectful or offensive
- an advertisement
- low quality
- not about the video topic
- soliciting votes or seeking badges
- a homework question
- a duplicate answer
- repeatedly making the same post
- a tip or feedback in Questions
- a question in Tips & Feedback
- an answer that should be its own question
about the site