Current time:0:00Total duration:11:11

0 energy points

# Small sample size confidence intervals

Constructing small sample size confidence intervals using t-distributions. Created by Sal Khan.

Video transcript

7 patients blood pressures
have been measured after having been given a new
drug for 3 months. They had blood pressure
increases of, and they give us seven data points right here--
who knows, that's in some blood pressure units. Construct a 95% confidence
interval for the true expected blood pressure increase for all
patients in a population. So there's some population
distribution here. It's a reasonable assumption
to think that it is normal. It's a biological process. So if you gave this drug to
every person who has ever lived, that will result in some
mean increase in blood pressure, or who knows, maybe
it actually will decrease. And there's also going to be
some standard deviation here. It is a normal distribution. And the reason why it's
reasonable to assume that it's a normal distribution
is because it's a biological process. It's going to be the sum of many
thousands and millions of random events. And things that are sums of
millions and thousands of random events tend to be
normal distribution. So this is a population
distribution. And we don't know anything
really about it outside of the sample that we have here. Now, what we can do is, and this
tends to be a good thing to do, when you do have a
sample just figure out everything that you can
figure out about that sample from the get-go. So we have our seven
data points. And you could add them up and
divide by 7 and get your sample mean. So our sample mean
here is 2.34. And then you can also
calculate your sample standard deviation. Find the square distance from
each of these points to your sample mean, add them up, divide
by n minus 1, because it's a sample, then take the
square root, and you get your sample standard deviation. I did this ahead of time
just to save time. Sample standard deviation
is 1.04. And when you don't know anything
about the population distribution, the thing that
we've been doing from the get-go is estimating that
character with our sample standard deviation. So we've been estimating the
true standard deviation of the population with our sample
standard deviation. Now in this problem, this exact
problem, we're going to run into a problem. We're estimating our standard
deviation with an n of only 7. So this is probably going to
be a not so good estimate because-- let me just write--
because n is small. In general, this is considered
a bad estimate if n is less than 30. Above 30 you're dealing
in the realm of pretty good estimates. So the whole focus of this video
is when we think about the sampling distribution, which
is what we're going to use to generate our interval,
instead of assuming that the sampling distribution is normal
like we did in many other videos using the central
limit theorem and all of that, we're going to tweak the
sampling distribution. We're not going to assume it's
a normal distribution because this is a bad estimate. We're going to assume that
it's something called a t-distribution. And a t-distribution is
essentially, the best way to think about is it's almost
engineered so it gives a better estimate of your
confidence intervals and all of that when you do have
a small sample size. It looks very similar to
a normal distribution. It has some mean, so this is
your mean of your sampling distribution still. But it also has fatter tails. And the way I think about why
it has fatter tails is when you make an assumption that this
is a standard deviation for-- let me take
one more step. So normally what we do is we
find the estimate of the true standard deviation, and then
we say that the standard deviation of the sampling
distribution is equal to the true standard deviation of our
population divided by the square root of n. In this case, n is equal to 7. And then we say OK, we never
know the true standard, or we seldom know-- sometimes you do
know-- we seldom know the true standard deviation. So if we don't know that the
best thing we can put in there is our sample standard
deviation. And this right here, this is the
whole reason why we don't say that this is just a 95
probability interval. This is the whole reason why
we call it a confidence interval because we're making
some assumptions. This thing is going to change
from sample to sample. And in particular, this is going
to be a particularly bad estimate when we have a
small sample size, a size less than 30. So when you are estimating the
standard deviation where you don't know it, you're estimating
it with your sample standard deviation, and your
sample size is small, and you're going to use this to
estimate the standard deviation of your sampling
distribution, you don't assume your sampling distribution
is a normal distribution. You assume it has
fatter tails. And it has fatter tails because
you're essentially underestimating-- you're
underestimating the standard deviation over here. Anyway, with all of that said,
let's just actually go through this problem. So we need to think about a 95%
confidence interval around this mean right over here. So a 95% confidence interval,
if this was a normal distribution you would just
look it up in a Z-table. But it's not, this is
a t-distribution. We're looking for a 95%
confidence interval. So some interval around
the mean that encapsulates 95% of the area. For a t-distribution you use
t-table, and I have a t-table ahead of time right over here. And what you want to do is use
the two-sided row for what we're doing right over here. And the best way to think
about it is that we're symmetric around the mean. And that's why they
call it two-sided. It would be one-sided if it
was kind of a cumulative percentage up to some
critical threshold. But in this case, it's
two-sided, we're symmetric. Or another way to think
about it is we're excluding the two sides. So we want the 95%
in the middle. And this is a sampling
distribution of the sample mean for n is equal to 7. And I won't go into the details
here, but when n is equal to 7 you have 6 degrees
of freedom, or n minus 1. And the way that t-tables are
set up, you go and find the degrees of freedom. So you don't go to the n,
you go to the n minus 1. So you go to the 6 right here. So if you want to encapsulate
95% of this right over here, and you have an n of 6, you
have to go 2.447 standard deviations in each direction. And this t-table assumes that
you are approximating that standard deviation using your
sample standard deviation. So another way to think of it
you have to go 2.447 of these approximated standard
deviations. Let me it right here. So you have to go 2.447-- this
distance right here is 2.447 times this approximated
standard deviation. And sometimes you'll see this
in some statistics book. This thing right here,
this exact number, is shown like this. They put a little hat on top of
the standard deviation to show that it has been
approximated using the sample standard deviation. So we'll put a little hat over
here, because frankly, this is the only thing that
we can calculate. So this is how far you have
to go in each direction. And we know what
this value is. We know what the sample
distribution is. So let's get our
calculator out. So we know our sample standard
deviation is 1.04. And we want to divide that
by the square root of 7. So we get 0.39. So this right here is 0.39. And so if we want to find
the distance around this population mean that
encapsulates 95% of the population or of the sampling
distribution, we have to multiply 0.39 times 2.447,
so let's do that. So times 2.447 is
equal to 0.96. So this is equal to-- so this
distance right here is 0.96, and then this distance
right here is 0.96. So if you take a random sample,
and that's exactly what we did when we found
these 7 samples. When we took these 7 samples and
took their mean, that mean can be viewed as a random
sample from the sampling distribution. And so the probability, and so
we can view it, we could say that there's a 95% chance-- and
we have to actually caveat everything with a confident,
because we're doing all of these estimations here. So it's not a true precise
95% chance. We're just confident that
there's a 95% chance that our random population, our random
sampling mean right here, so that 2.34, which we can kind of
use-- we just picked that 2.34 from this distribution
right here. So there's a 95% chance that
2.34 is within 0.96 of the true sampling distribution mean,
which we know is also the same thing as the
population mean. Or we can just rearrange the
sentence and say that there is a 95% chance that the mean, the
true mean, which is the same thing as a sampling
distribution mean, is within 0.96 of our sample
mean, of 2.34. So at the low end, so if you go
2.36 minus-- if you go 2.34 minus 0.96-- that's the low
end of our confidence interval, 1.38. And the high end of our
confidence interval, 2.34 plus 0.96 is equal to 3.3. So our 95% confidence interval
is from 1.38 to 3.3.