Current time:0:00Total duration:10:38

0 energy points

# Sample variance

Thinking about how we can estimate the variance of a population by looking at the data in a sample. Created by Sal Khan.

Video transcript

Let's say that you're
curious about people's TV watching habits. And in particular, how much TV
do people in the country watch? So what you are concerned
with, if we imagine the entire country-- and
we've already talked about-- especially if we're talking
about a country like the United States, but pretty
much any country, is a very large population. In the United
States, we're talking about on the order of
300 million people. So ideally, if you could
somehow magically do it, you would survey or somehow
observe all 300 million people and take the mean of
how many hours of TV they watch on a given day. And then that will give you the
parameter, the population mean. But we've already
talked about, in a case like this, that's
a very impractical. Even if you tried to do
it, by the time you did it, your data might be stale because
some people might have passed away, other people
might have been born. Who knows what
might have happened. And so this is a truth
that is out there. There is a theoretical
population mean for the amount of the
average or the mean hours of TV watched per
day by Americans. There is a truth here at
any given point in time. It's just pretty much
impossible to come up with the exact answer, to
come up with this exact truth. But you don't give up. You say, well, maybe I don't
have to survey all 300 million or observe all 300 million. Instead, I'm just going
to observe a sample, right over here. And let's say, to make
the computation simple, you do a sample of six. And we'll talk about
later why six might not be as large of a sample
as you would like. But you survey how much
TV these folks watch. And you find one person who
watched 1 and 1/2 hours. Another person watched
2 and 1/2 hours. Another person watched 4 hours. And then you get one
person who watched 2 hours. And you get two people
who watched 1 hour each. So given this data
from your sample, what do you get as
your sample mean? Well, the sample mean, which
we would denote by lowercase x with a bar over
it, is just the sum of all of these divided by the
number of data points we have. So let's see we have 1.5
plus 2.5 plus 4 plus 2 plus 1 plus 1. And all of that
divided by 6, which gives-- let's see, the numerator
1.5 plus 2.5 is 4, plus 4 is 8, plus 2 is 10, plus 2 more is 12. So it's going to
be 12 over 6, which is equal to 2 hours
of television. So at least for your
sample, you say, my sample mean is two
hours of television. It's an estimate. It's a statistic that
is trying to estimate this parameter, this thing
that's very hard to know. But it's our best shot. Maybe we get a better answer
if we get more data points. But this is we have so far. Now the next question
you ask yourself is, well, I don't want to just
estimate my population mean. I also want to estimate
another parameter. I also am interested in
estimating my population variance. So once again, since
we can't survey every one in the
population, this is pretty much
impossible to know. But we're going to attempt to
estimate of this parameter. We attempted to
estimate the mean. Now we will also
attempt to estimate this parameter, this
variance parameter. So how would you do it? Well, reasonable logic would
say, well, we maybe we'll do the same thing
with a sample as we would have done
with the population. When you're doing the
population variance, you would take each data
point in the population, find the distance between that
and the normal population mean, take the square of
that difference, and then add up all the
squares of those differences, and then divide by the number
of data points you have. So let's try that over here. So let's try to find-- take
each of these data points, and find the difference--
let me do that in a different color--
each of these data points, and find the difference
between that data point and our sample mean--
not the population mean, we don't know what
the population mean-- the sample mean. So that's that first data
point plus the second data point-- so it's 4 minus 2
squared plus 1 minus 2 squared. And this is what
you would have done if you were taking a
population variance. If this was your
entire population, this is how you would you
find a population mean here, if this was your
entire population. And you find the
squared distances from each of those data
points and then divide by the number of data points. So let's just think
about this a little bit. 1 minus 2 squared. Then you have 2.5
minus 2-- 2 being the sample mean-- squared. Let me see, this green color. Plus 2 minus 2 squared. Plus 1 minus 2 squared. And then maybe you would divide
by the number of data points that you have, where you have
the number of data points. So in this case,
we're dividing by 6. And what would we get
in this circumstance? Well, if we just do the
computation, 1.5 minus 2 is negative 0.5. We square that. This becomes a positive 0.25. 4 minus 2 squared is going
to be 2 squared, which is 4. 1 minus 2 squared--
well, that's negative 1 squared, which is just 1. 2.5 minus 2 is 0.5
squared, is 0.25. 2 minus 2 squared--
well, that's just 0. And then 1 minus 2 squared is
1, it's negative 1 squared. So we just get 1. And if we add all
of this up-- let me add the whole numbers first. 4 plus 1 is 5, plus 1 is 6,
and then we have two 0.25s. So this is going to
be equal to 6.5-- let me write this
in a neutral color. So this is going to be 6.5
over this 6 right over here. Well, there's a couple of
ways we could write this, but I'll just get
the calculator out and we can just calculate it. So 6.5 divided by 6
gets us-- if we round, it's approximately 1.08. So it's approximately
1.08 is this calculation. Now what we have
to think about is whether this is the best
calculation, whether this is the best estimate for the
population variance, given the data that we have. You can always argue that
we could have more data. But given the data we have,
is this the best calculation that we can make to estimate
the population variance? And I'll have you think
about that for a second. Well, it turns out
that this is close, this is close to the best
calculation, the best estimate that we can make,
given the data we have. And sometimes this will be
called the sample variance. But it's a particular
type of sample variance where we just divide by the
number of data points we have. And so people will write
just an n over here. So this is one way to define a
sample variance in an attempt to estimate our
population variance. But it turns out--
and in the next video I'll give you an
intuitive explanation of why it turns out this way. And then I would also like to
write a computer simulation that, at least
experimentally, makes you feel a little bit better. But it turns out, you're going
to get a better estimate-- and it's a little bit weird
and voodooish at first when you first think
about it-- you're going to get a better estimate
for your population variance if you don't divide by
6, if you don't divide by the number of
data points you have but you divide by one less
than the number of data points you have. So how would we do that? And we can denote that
as sample variance. So when most people talk
about the sample variance, they're talking about
the sample variance where you do this calculation,
but instead of dividing by 6 you were to divide by 5. You would divide by 5. So they would say you
divide by n minus 1. So what would we get
in those circumstances? Well, the top part is going
to be the exact same thing. We're going to get 6.5. But then our
denominator, our n is 6. We have 6 data points. But we're going to
divide by 1 less than 6. We're going to divide by 5. And 6.5 divided by
5 is equal to 1.3. So when we calculate our sample
variance with this technique, which is the more
mainstream technique-- and it seems voodoo. Why are we dividing
by n minus 1, wherein for a population
variance we divide by n? But remember we're trying
to estimate the population variance. And it turns out that
this is a better estimate. Because this calculation
is underestimating what the population variance
is, this is a better estimate. We don't know for
sure what it is. These both could be way off. It could be just by chance
what we happen to sample. But over many samples--
and there's many ways to think about
it-- this is going to be a better calculation. It's going to give
you a better estimate. And so how would
we write this down? How would we write this down
with mathematical notation? Well, remember,
we're taking the sum. And we're taking each
of the data points. So we'll start with
the first data point all the way to the
nth data point. This lowercase n says that, hey,
we're looking at the sample. If I have an uppercase
N, that usually denotes that we're
trying to sum up everything in the population. Here we're looking at a
sample of size, lower case n. And we're taking each data
point, so each x sub i, and from it we're
subtracting the sample mean. And then we're squaring it. We're taking the sum of
the squared distances. And then we're dividing, not
by the number of data points we have, but by 1 less than the
number of data points we have. So this calculation, where
we just summed up all of this and then we divided
by 5, not by 6, this is the standard
definition of sample variance. So I'll leave you there. In the next video,
I will attempt to give you an intuition of
why we're dividing by n minus 1 instead of dividing by n.