Main content

## Statistics and probability

### Unit 11: Lesson 4

More confidence interval videos# Small sample size confidence intervals

Constructing small sample size confidence intervals using t-distributions. Created by Sal Khan.

## Want to join the conversation?

- In this series of videos, I don't think Sal explains why for n<30 we should use the t-distribution. Where does the magic number 30 come from? Also, shouldn't the sample size that approximates a normal distribution depend on the population size?(39 votes)
- Positing that the sample distribution adheres to a normal distribution is an assumption. In small data sets, that isn't necessarily true. The t-distribution (also known as the Student t-distribution) is the correction to the normal for small sample sizes. The bigger tails indicate the higher frequency of outliers which come with a small data set. Although as the sample size, n, increases, the t-distribution approaches the normal distribution. At n = 30, the distributions are practically the same, and hence we can use the normal distribution. See the graphical demonstration at the wiki page: http://en.wikipedia.org/wiki/Student%27s_t-distribution, it helps provide the intuition.(43 votes)

- What have I missed? you have a 95% chance of being between 1.4 and 3.3 - but two of the values used to calculate that is outside that inteval (0.9 and 3.9). About 29% of the data set is outside the 95% confidence interval.(7 votes)
- It's not that there's a 95% chance that any sample will be between 1.4 and 3.3, but that there's a 95% chance that the
**mean**of any**group of samples**will be in that range; individual samples may well be outside that range, dependent on the sample variance.(29 votes)

- Isn't it incorrect to say "There is a 95% CHANCE than that the true value of mu is within...."? It's not a 95% chance... mu is either in the range we calculate, or it's not. Wouldn't it be more accurate to say "We can say with 95% confidence that....."(7 votes)
- From Wikipedia article on 'confidence interval':

"A 95% confidence interval does not mean that for a given realised interval calculated from sample data there is a 95% probability the population parameter lies within the interval, nor that there is a 95% probability that the interval covers the population parameter. Once an experiment is done and an interval calculated, this interval either covers the parameter value or it does not, it is no longer a matter of probability. The 95% probability relates to the reliability of the estimation procedure, not to a specific calculated interval.

At9:50, Sal says "There is a 95% chance that our random sampling mean is within 0.96 of the population mean." What he should say is that if this procedure were repeated many times, the results would tend toward this interval with a probability of 95%.(10 votes)

- I cannot grasp how there is a '95% chance mu is within +/- 0,96 of 2,34'. For example if mu is 1,05 then 2,34 is not within 0,96 of mu... I can understand that there is '95% chance 2,34 is within 0,96 of mu' but the logic behind reversing this statement is not clear to me.(4 votes)
- If there was a 95% probability that a given interval around mu (we don't know mu, but it has some particular value) contains our sample mean (which we know), then wouldn't there also be the same probability that the same interval around our sample mean contains mu?

(Does mu change because we change our statement? They still have the same relationship to each other and our probability is still the same.)(1 vote)

- The sample standard deviation is quoted to be 1.04 which is the answer you get if you plug the numbers in excel and apply the formula. However, if you crunch the numbers manually you get 1.08. Anyone aware of why this discrepancy exists? I have done it several times and can't find errors in my calculations.(3 votes)
- Your 1.08 represents the variance, take the sq root of 1.08 and you'll get the SD which is 1.04.(14 votes)

- Why is the population mean is equal to the sample mean?(3 votes)
- It's not. It's what the s a m p l e means are randomly distributed around.

We use this fact to calculate confidence intervals.(1 vote)

- so i have a problem i can't quite figure out. the problem is as follows:

you randomly choose 16 unfurnished one-bedroom apartments from a large number of advertisements in your local newspaper. You calculate that their mean monthly rent is $613 and their standard deviation is $96. What is the standard error of the mean? What are the degrees of freedom for a one sample t statistic?

standard error is the mean/square root of n, or in this case 16, right? which comes out to 24. (69/sqrt 16)

In my book it says to get the one sample t statistic, you take x-bar minus mu divided by the standard error.. but i don't know mu. How do i get mu from just one sample mean?(2 votes)- The standard error of the mean is the standard deviation divided by the square root of the sample size, or s/√n. You shouldn't need to calculate a test statistic to find the degrees of freedom.(3 votes)

- Where does the t-table come from? I understand the z-table comes from the definate integral of the normal distribution function but how is the t-distribution defined and why is it that small sample sizes tend to follow a t-distribution model rather than some other model?(2 votes)
- You can give this page a glance to see how the probability density function for the t distribution looks like: http://en.wikipedia.org/wiki/Student%27s_t-distribution#Probability_density_function

It looks a lot more complex than the standard normal density function. However, there are tables for it, which makes it very useful.(2 votes)

- We use the sample mean and standard deviation as an estimate of the population standard deviation. What do we do when we do a number of trials that each generate 7 data points? How do we estimate the mean and standard deviation in that case?(2 votes)
- I think you can group things any way that's most convenient. So you could consider those 7 trials as "Test 1", and consider that your 7-sample set from the larger population.(1 vote)

- I have a question, why did Sal mulitply 0.39, the standard deviation of sample distribution, to 2.447 to get the distance from the miu to the critical value?(2 votes)
- He multiplied because 2.447 is the t-critical value that corresponds to a 95% two-sided confidence interval using a t-distribution. The 2.447 is a standardized value that explains what t-values will contain 95% of the t-distribution. The t-critical value must then be converted back to units of the original question. Multiplying by the standard deviation of the sampling distribution will then result in the distance from the sampling mean (mu).

When finding one-sample t confidence intervals, the general equation x_bar +/- (t critical value)*s/sqrt(n) is used. The multiplication is the (t critical value)*s/sqrt(n).(1 vote)

## Video transcript

7 patients blood pressures
have been measured after having been given a new
drug for 3 months. They had blood pressure
increases of, and they give us seven data points right here--
who knows, that's in some blood pressure units. Construct a 95% confidence
interval for the true expected blood pressure increase for all
patients in a population. So there's some population
distribution here. It's a reasonable assumption
to think that it is normal. It's a biological process. So if you gave this drug to
every person who has ever lived, that will result in some
mean increase in blood pressure, or who knows, maybe
it actually will decrease. And there's also going to be
some standard deviation here. It is a normal distribution. And the reason why it's
reasonable to assume that it's a normal distribution
is because it's a biological process. It's going to be the sum of many
thousands and millions of random events. And things that are sums of
millions and thousands of random events tend to be
normal distribution. So this is a population
distribution. And we don't know anything
really about it outside of the sample that we have here. Now, what we can do is, and this
tends to be a good thing to do, when you do have a
sample just figure out everything that you can
figure out about that sample from the get-go. So we have our seven
data points. And you could add them up and
divide by 7 and get your sample mean. So our sample mean
here is 2.34. And then you can also
calculate your sample standard deviation. Find the square distance from
each of these points to your sample mean, add them up, divide
by n minus 1, because it's a sample, then take the
square root, and you get your sample standard deviation. I did this ahead of time
just to save time. Sample standard deviation
is 1.04. And when you don't know anything
about the population distribution, the thing that
we've been doing from the get-go is estimating that
character with our sample standard deviation. So we've been estimating the
true standard deviation of the population with our sample
standard deviation. Now in this problem, this exact
problem, we're going to run into a problem. We're estimating our standard
deviation with an n of only 7. So this is probably going to
be a not so good estimate because-- let me just write--
because n is small. In general, this is considered
a bad estimate if n is less than 30. Above 30 you're dealing
in the realm of pretty good estimates. So the whole focus of this video
is when we think about the sampling distribution, which
is what we're going to use to generate our interval,
instead of assuming that the sampling distribution is normal
like we did in many other videos using the central
limit theorem and all of that, we're going to tweak the
sampling distribution. We're not going to assume it's
a normal distribution because this is a bad estimate. We're going to assume that
it's something called a t-distribution. And a t-distribution is
essentially, the best way to think about is it's almost
engineered so it gives a better estimate of your
confidence intervals and all of that when you do have
a small sample size. It looks very similar to
a normal distribution. It has some mean, so this is
your mean of your sampling distribution still. But it also has fatter tails. And the way I think about why
it has fatter tails is when you make an assumption that this
is a standard deviation for-- let me take
one more step. So normally what we do is we
find the estimate of the true standard deviation, and then
we say that the standard deviation of the sampling
distribution is equal to the true standard deviation of our
population divided by the square root of n. In this case, n is equal to 7. And then we say OK, we never
know the true standard, or we seldom know-- sometimes you do
know-- we seldom know the true standard deviation. So if we don't know that the
best thing we can put in there is our sample standard
deviation. And this right here, this is the
whole reason why we don't say that this is just a 95
probability interval. This is the whole reason why
we call it a confidence interval because we're making
some assumptions. This thing is going to change
from sample to sample. And in particular, this is going
to be a particularly bad estimate when we have a
small sample size, a size less than 30. So when you are estimating the
standard deviation where you don't know it, you're estimating
it with your sample standard deviation, and your
sample size is small, and you're going to use this to
estimate the standard deviation of your sampling
distribution, you don't assume your sampling distribution
is a normal distribution. You assume it has
fatter tails. And it has fatter tails because
you're essentially underestimating-- you're
underestimating the standard deviation over here. Anyway, with all of that said,
let's just actually go through this problem. So we need to think about a 95%
confidence interval around this mean right over here. So a 95% confidence interval,
if this was a normal distribution you would just
look it up in a Z-table. But it's not, this is
a t-distribution. We're looking for a 95%
confidence interval. So some interval around
the mean that encapsulates 95% of the area. For a t-distribution you use
t-table, and I have a t-table ahead of time right over here. And what you want to do is use
the two-sided row for what we're doing right over here. And the best way to think
about it is that we're symmetric around the mean. And that's why they
call it two-sided. It would be one-sided if it
was kind of a cumulative percentage up to some
critical threshold. But in this case, it's
two-sided, we're symmetric. Or another way to think
about it is we're excluding the two sides. So we want the 95%
in the middle. And this is a sampling
distribution of the sample mean for n is equal to 7. And I won't go into the details
here, but when n is equal to 7 you have 6 degrees
of freedom, or n minus 1. And the way that t-tables are
set up, you go and find the degrees of freedom. So you don't go to the n,
you go to the n minus 1. So you go to the 6 right here. So if you want to encapsulate
95% of this right over here, and you have an n of 6, you
have to go 2.447 standard deviations in each direction. And this t-table assumes that
you are approximating that standard deviation using your
sample standard deviation. So another way to think of it
you have to go 2.447 of these approximated standard
deviations. Let me it right here. So you have to go 2.447-- this
distance right here is 2.447 times this approximated
standard deviation. And sometimes you'll see this
in some statistics book. This thing right here,
this exact number, is shown like this. They put a little hat on top of
the standard deviation to show that it has been
approximated using the sample standard deviation. So we'll put a little hat over
here, because frankly, this is the only thing that
we can calculate. So this is how far you have
to go in each direction. And we know what
this value is. We know what the sample
distribution is. So let's get our
calculator out. So we know our sample standard
deviation is 1.04. And we want to divide that
by the square root of 7. So we get 0.39. So this right here is 0.39. And so if we want to find
the distance around this population mean that
encapsulates 95% of the population or of the sampling
distribution, we have to multiply 0.39 times 2.447,
so let's do that. So times 2.447 is
equal to 0.96. So this is equal to-- so this
distance right here is 0.96, and then this distance
right here is 0.96. So if you take a random sample,
and that's exactly what we did when we found
these 7 samples. When we took these 7 samples and
took their mean, that mean can be viewed as a random
sample from the sampling distribution. And so the probability, and so
we can view it, we could say that there's a 95% chance-- and
we have to actually caveat everything with a confident,
because we're doing all of these estimations here. So it's not a true precise
95% chance. We're just confident that
there's a 95% chance that our random population, our random
sampling mean right here, so that 2.34, which we can kind of
use-- we just picked that 2.34 from this distribution
right here. So there's a 95% chance that
2.34 is within 0.96 of the true sampling distribution mean,
which we know is also the same thing as the
population mean. Or we can just rearrange the
sentence and say that there is a 95% chance that the mean, the
true mean, which is the same thing as a sampling
distribution mean, is within 0.96 of our sample
mean, of 2.34. So at the low end, so if you go
2.36 minus-- if you go 2.34 minus 0.96-- that's the low
end of our confidence interval, 1.38. And the high end of our
confidence interval, 2.34 plus 0.96 is equal to 3.3. So our 95% confidence interval
is from 1.38 to 3.3.