Current time:0:00Total duration:11:18

0 energy points

# Statistics: Sample variance

Video transcript

This video, here, is a
groundbreaking video, for multiple reasons. One, I'm going to introduce
you to the variance of a sample, which is
interesting in its own right. And I'm attempting to
record this video in HD, and hopefully you can see it
bigger and clearer than ever before. But we'll see how
all of that goes. So this is a bit of an
experiment, so bear with me. But so, just before we go
into the variance of a sample, I think it's
instructive to review the variance the population, and
we can compare their formulas. The variance of a
population-- and it's this Greek letter, sigma. Lowercase sigma squared. That means variance. I know it's weird that
a variable already has a squared in it. You're not squaring
the variable. This is the variable. Sigma squared means variance. Actually, let me
write that down. That equals variance. And that is equal to--
you take each data point-- and we'll call them x sub i. You take each data
point, find out how far it is from the
mean of the population-- the mean of the population. You square it, and then you take
the average of all of those. So you take the average. You sum them all up. You go from i is equal to 1. So from the first point all
the way to the N-th point. And then to average,
you sum them all up, and then you divide by N. So the variance is the average
of the squared distances of each point from the mean. And just to give you
the intuition again, it essentially says, on
average, roughly how far away are each of the points
from the middle? That's the best way to
think about the variance. Now what if we're dealing-- this
was for a population, right? And we said, if we
wanted to figure out the variance of men's
heights in the country, it would be very
hard to figure out the variance for the population. You would have to go
and, essentially, measure everyone's height--
250 million people. Or what if it's for some
population where it's just completely impossible
to have the data, or some random variable? And we'll go more
into that later. So a lot of times,
you actually want to estimate this
variance by taking the variance of a sample. Same way that you could never
get the mean of a population, but maybe you want
to estimate it by getting the mean of a sample. And we learned that
in that first video. This is-- if that's the
whole population, that's millions of data points-- or
even data points in the future that you'll never
be able to get, because it's a random variable. So this is the population. You might just want to estimate
things by looking at a sample. And this is actually what
most of inferential statistics is all about. Figuring out descriptive
statistics about the sample, and making inferences
about the population. Let me try this
drug on 100 people. And if it seems to have
statistically significant results, this drug
will probably work on the population as a whole. So that's what it's all about. So it's really
important to understand this notion of a sample
versus a population, and being able to
find statistics on a sample that,
for the most part, can describe the population or
help us estimate-- they call it parameters for
the population. So what's the mean of a-- let
me rewrite these definitions. What's the mean of a population? I'll do it in that purple--
purple for population. The mean of a
population, you just take each of the data points. So you take each of the data
points in the population-- xi. You sum them up. You start with the
first data point, and you go all the way
to the N-th data point, and you divide by N.
You sum them all up and divide them by
N. That's the mean. So then you plug it
into this formula, and you just can see
how far each point is from that central
point, from that mean. And you get the variance. Now what happens if
we do it for a sample? Well, if we want to estimate
the mean of a population by somehow calculating a mean
for a sample, the best thing I can think of--
and really these are kind of engineered formulas. These are human
beings saying, well, what is the best
way to sample it? Well, all we can do is really
take an average of our sample. And that's the sample mean. And we learned in
the first video that notation-- the formula's
almost identical to this. It's just the
notation is different. Instead of writing mu, you
write x with a line over it. Sample mean is equal to-- Once again, you take each of the
data points now in the sample, not in the whole population. You sum them up, from the first
one and then to the n-th one, right? They're saying that there are
n data points in this sample. And then you divide it by the
number of data points you have. Fair enough. It's really the same formula. The way I took the
mean of a population, I said, well, if I
just have a sample, let me just take the
mean the same way. And it might-- it's
probably a good estimate of the mean of the population. Now, it gets interesting
when we talk about variance. So your natural reaction
is, OK, I have this sample. If I want to estimate the
variance of the population, why don't I just apply this
same formula, essentially, to the sample? So I could say-- and this is
actually a sample variance. They use a formula-- s squared. So sigma is kind of the
Greek-letter equivalent of s. So now when we're
dealing with a sample, we just write the s there. So this is sample variance. Let me write that down. Sample variance. So we might just say,
well, maybe a good way to take the sample variance
is, do it the same way. Let's take the distance of each
of the points in the sample, find out how far it is
from our sample mean. Right here, we used
the population mean. But now we'll just
use the sample mean, because that's
all we can have. We don't know what
the population mean is without looking at
the whole population. Take the square of that. That makes it positive. It has other properties
which we'll go over later. And then take the average of
all of these squared distances. So you take it from--
you sum them all up. And there's n of them to
sum up, right-- lowercase n. And you divide by lowercase n. You say, well, you know,
this is a good estimate. Whatever this variance
is, that might be a good estimate for
the population as a whole. And actually this is what
some people often refer to, when they talk about
sample variance. And sometimes it'll actually
be referred to as this. They'll put a little
lowercase n there. And the reason why I do that
is because we divided by n. So you say, Sal, what's
the problem here? And then the problem--
and I'll give you the intuition, because
this is actually something that used
to boggle my mind. And I'm still,
frankly, struggling with the intuition behind it. Well, I have the
intuition, but more of kind of rigorously
proving it to myself, that this is
definitely the case. But think about this. If I have a bunch of numbers--
and I'll draw a number line, here. If I draw a number
line here-- let's say I have a bunch of
numbers in my population. So let's say-- I'm just
going to randomly put a bunch of numbers
in my population. And the ones to the right
are bigger than the ones to the left. And if I were to take a
sample of them, right? Maybe I take-- and the
sample, it's random. You actually want to
take a random sample. You don't want to be
skewed in any way. So maybe I take this one, this
one, this one, and that one, right? And then if I were to take
the mean of that number, that number, that
number, and that number, it'll be someplace
in the middle. It might be
someplace over there. And then, if I wanted to figure
out the sample variance using this formula, I'd
say, OK, this distance squared plus this distance
squared plus this distance squared plus that distance
squared, and average them all out. And then I would
get this number. And that probably would be
a pretty good approximation for the variance of
this entire population. The population of
the mean is probably going to be-- I don't know, it
might be pretty close to this. If we actually took all of the
data points and averaged them, maybe they're, like,
here someplace. And then, if you figure
out the variance, it probably would be pretty
close to the average of all of these lines, right, of the
sample variance distances. Fair enough. So you say, hey, Sal, this
looks pretty good now. But there's one little catch. What if-- I mean, there's always
a probability that, instead of picking these fairly
well-distributed numbers in my sample, what
if I happened to pick this number, this number, and
that number, and, let's say, that number, as my sample? Well, whatever your sample
is, your sample mean's always going to be in
the middle of it, right? So in this case, your sample
mean might be right here. So all of these numbers,
you might say, OK, this number's not too
far from that number. That number's not too far. And then that
number's not too far. So your sample variance,
when you do it this way, it might turn out
a little bit low. Because all of these numbers,
they're almost, by definition, going to be pretty close
to the mean of each other. But in this case, your
sample is kind of skewed, and the actual mean
of the population is out here someplace. So the actual variance
of the sample, if you had actually
known the mean-- I know this is all a little
confusing-- if you had actually known the mean, you
would've said, oh, wow. You would have found
these distances, which would have
been a lot more. The whole point
of what I'm saying is, when you take
a sample, there's some chance that
your sample mean is pretty close to the
population mean. Maybe your sample mean is
here, and your population mean is here. And then this formula
would probably work out pretty well, at least
given your sample data points, of figuring out what
the variance is. But there's a reasonable
chance that your sample mean-- your sample
mean is always going to be within your
data sample, right? It's always going to be the
center of your data sample. But it's completely possible
that the population mean is outside of your data sample. It might have just
been-- you know, you just happened
to pick ones that don't contain the
actual population mean. And then this sample
variance, calculated this way, will actually underestimate
the actual population variance, because they're always going
to be closer to their own mean than they are to
the population mean. And if you're understanding,
frankly, even 10% of this, you are a very advanced
statistics student. But I'm saying all of
this to just give you, hopefully, some
intuition to realize that this will
often underestimate. This formula will
often underestimate the actual population variance. And there's a formula--
and this is actually proven more rigorously
than I'll do it-- that is considered to be
a better-- or, they'll call it, unbiased-- estimate of
the population variance, or the unbiased sample variance. And sometimes it's just
denoted by the s squared again. Sometimes it's denoted by
this-- s n minus 1 squared-- and I'll show you why. It's almost the same thing. You take each of
the data points, figure out how far they
are from the sample mean, you square them, and then
you take the average of those squared, except for
one slight difference. i equals 1 to i equals n. Instead of dividing
by n, you divide by a slightly smaller number. You divide by n minus 1. So when you divide by n minus
1 instead of dividing by n, you're going to get a
slightly larger number here. And it turns out
that this is actually a much better estimate. And one day, I'm going to
write a computer program to at least prove it to
myself, experimentally, that this is a better estimate
of the population variance. And you would calculate
it the same way. You just divide by n minus 1. The other way to think about
it-- and actually-- no, no. I'm all out of time. I'll leave you there, now,
and then, in the next video, we'll do a couple
calculations, just so you don't get too
overwhelmed with these ideas, because we're getting
a little bit abstract. See you in the next video.