Current time:0:00Total duration:9:33

0 energy points

# Sample standard deviation and bias

Sals shows an example of calculating standard deviation and bias. Created by Sal Khan.

Video transcript

Let's say that you're
a watermelon farmer, and you want to study
how dense the seeds are in your watermelon. Perhaps you want to do this
because over time, you're trying to breed watermelons
that have fewer seeds, and you should see whether you
are actually making progress. And you don't want to
cut open every watermelon in your watermelon farm
or patch or whatever it might be called, because
you want to sell most of them. You just want to sample
a few watermelons, and then take samples
of those watermelons to figure out how dense the
seeds are, and hope that you can calculate statistics
on those samples that are decent estimates of the
parameters for the population. So let's start doing that. So let's say that you take these
little cubic inch chunks out of a random sample
of your watermelons. And then you count the
number of seeds in them. And you have 8
samples like this. So in one of them,
you found 4 seeds. In the next, you found
3, 5, 7, 2, 9, 11, and 7. So this is a
sample, just to make sure we're visualizing it right. If this is the population
of all of the chunks-- I guess we could view
this as a cubic inch-- the cubic inch chunks in
my entire watermelon farm, I'm sampling a very
small sample of them. Maybe I could have had
a million over here. A million chunks
of watermelon could have been produced from
my farm, but I'm only sampling-- so capital
N would be 1 million, lowercase n is equal to 8. And once again, you might
want to have more samples, but this'll make our math easy. Now, let's think about what
statistics we can measure. Well, the first one
that we often do is a measure of
central tendency. And that's the arithmetic mean. But here, we're trying to
estimate the population mean by coming up with
the sample mean. So what is the sample
mean going to be? Well, all we have to do
is add up these points, add up these measurements,
and then divide by the number of
measurements we have. So let's get our
calculator out for that. Actually, maybe I don't
need my calculator. Let's see. So 4 plus 3 is 7. 7 plus 5 is 12. 12 plus 7 is 19. 19 plus 2 is 21, plus 9 is 30,
plus 11 is 41, plus 7 is 48. So I'm going to get
48 over 8 data points. So this worked out quite well. 48 divided by 8 is equal to 6. So our sample mean is 6. It's our estimate of what
the population mean might be. But we also want to think about
how much in our population we want to estimate, how
much spread is there, or how much do our measurements
vary from this mean. So there, we say, well, we can
try to estimate the population variance by calculating
the sample variance. And we're going to calculate
the unbiased sample variance. Hopefully, we're fairly
convinced at this point why we divide by n minus 1. So we're going to calculate
the unbiased sample variance. And if we do that,
what do we get? I'll do this in a
different color. It's going to be 4 minus 6
squared plus 3 minus 6 squared plus 5 minus 6 squared
plus 7 minus 6 squared plus 2 minus 6 squared
plus 9 minus 6 squared plus 11 minus 6 squared plus
7 minus 6 squared, all of that divided by-- not by 8. Remember, we want the
unbiased sample variance. We're going to divide
it by 8 minus 1. So we're going to divide by 7. Let me give myself a little
bit more real estate. The unbiased sample
variance-- and I could even denote it by this to
make it clear that we're dividing by lowercase
n minus 1-- is going to be equal to-- let's see,
4 minus 6 is negative 2. That squared is positive 4. So I did that one. 3 minus 6 is negative 3. That squared is going to be 9. 5 minus 6 squared is
1 squared, which is 1. 7 minus 6 is once again
1 squared, which is 1. 2 minus 6, negative
4 squared is 16. 9 minus 6 squared, well,
that's going to be 9. 11 minus 6 squared, that is 25. And then finally, 7 minus 6
squared, that's another 1. And we're going
to divide it by 7. Let's see if we can add
this up in our heads. 4 plus 9 is 13, plus 1 is
14, 15, 31, 40, 65, 66. So this is going to
be equal to 66 over 7. And we could either divide--
we get that's 9 and 3/7. We could write
that as 9 and 3/7. Or if we want to write
that as a decimal, I could just take
66 divided by 7 gives us 9 point--
I'll just round it. So it's approximately 9.43. Now, that gave us our
unbiased sample variance. Well, how could we calculate
a sample standard deviation? We want to somehow get added
estimate of what the population standard deviation might be. Well, the logic, I guess,
is reasonable to say, well, this is our unbiased
sample variance. It's our best estimate of
what the true population variance is. When we think about
population parameters to get the population
standard deviation, we just take the square root
of the population variance. So if we want to get an
estimate of the sample standard deviation, why
don't we just take the square root of the
unbiased sample variance? So that's what we'll do. So we'll define it that way. We'll call it the sample
standard deviation. We're going to define it to
be equal to the square root of the unbiased sample variance. It's going to be the square
root of this quantity, and we can take
our calculator out. It's going to be the square
root of what I just typed in. I can do 2nd answer. It'll be the last entry here. So the square root of that
is-- and I'll just round. It's approximately
equal to 3.07. Now, I'm going to
tell you something very counterintuitive. Or at least initially
it's counterintuitive, but hopefully you'll
appreciate this over time. This we've already talked
about in some depth. People have even
created simulations to show that this is an unbiased
estimate of population variance when we divide it by n minus 1. And that's a good
starting point if we're going to take the
square root of anything. But it actually turns out
that because the square root function is nonlinear,
that this sample standard deviation-- and
this is how it tends to be defined-- sample standard
deviation, that this sample standard deviation, which is
the square root of our sample variance, so from
i equals 1 to n of our unbiased sample variance,
so we divide it by n minus 1. This is how we literally divide
our sample standard deviation. Because the square root
function is nonlinear, it turns out that this is
not an unbiased estimate of the true population
standard deviation. And I encourage people to
make simulations of that if they're interested. But then you might say, well,
we went through great pains to divide by n minus
1 here in order to get an unbiased estimate
of the population variance. Why don't we go
through similar pains and somehow figure out a
formula for an unbiased estimate of the population
standard deviation? And the reason why
that's difficult is to unbias the
sample variance, we just have to divide by
n minus 1 instead of n. And that'd work for any
probability distribution for our population. It turns out to
do the same thing for the standard deviation. It's not that easy. It's actually dependent on how
that population is actually distributed. So in statistics, we just define
the sample standard deviation. And the one that
we typically use is based on the square root of
the unbiased sample variance. But when you take
that square root, it does give you a
biased result when you're trying to use this
to estimate the population standard deviation. But it's the simplest,
best tool we have.