Statistics and probability
Small sample size confidence intervals
Constructing small sample size confidence intervals using t-distributions. Created by Sal Khan.
Want to join the conversation?
- In this series of videos, I don't think Sal explains why for n<30 we should use the t-distribution. Where does the magic number 30 come from? Also, shouldn't the sample size that approximates a normal distribution depend on the population size?(39 votes)
- Positing that the sample distribution adheres to a normal distribution is an assumption. In small data sets, that isn't necessarily true. The t-distribution (also known as the Student t-distribution) is the correction to the normal for small sample sizes. The bigger tails indicate the higher frequency of outliers which come with a small data set. Although as the sample size, n, increases, the t-distribution approaches the normal distribution. At n = 30, the distributions are practically the same, and hence we can use the normal distribution. See the graphical demonstration at the wiki page: http://en.wikipedia.org/wiki/Student%27s_t-distribution, it helps provide the intuition.(43 votes)
- What have I missed? you have a 95% chance of being between 1.4 and 3.3 - but two of the values used to calculate that is outside that inteval (0.9 and 3.9). About 29% of the data set is outside the 95% confidence interval.(7 votes)
- It's not that there's a 95% chance that any sample will be between 1.4 and 3.3, but that there's a 95% chance that the mean of any group of samples will be in that range; individual samples may well be outside that range, dependent on the sample variance.(29 votes)
- Isn't it incorrect to say "There is a 95% CHANCE than that the true value of mu is within...."? It's not a 95% chance... mu is either in the range we calculate, or it's not. Wouldn't it be more accurate to say "We can say with 95% confidence that....."(7 votes)
- From Wikipedia article on 'confidence interval':
"A 95% confidence interval does not mean that for a given realised interval calculated from sample data there is a 95% probability the population parameter lies within the interval, nor that there is a 95% probability that the interval covers the population parameter. Once an experiment is done and an interval calculated, this interval either covers the parameter value or it does not, it is no longer a matter of probability. The 95% probability relates to the reliability of the estimation procedure, not to a specific calculated interval.
At9:50, Sal says "There is a 95% chance that our random sampling mean is within 0.96 of the population mean." What he should say is that if this procedure were repeated many times, the results would tend toward this interval with a probability of 95%.(10 votes)
- I cannot grasp how there is a '95% chance mu is within +/- 0,96 of 2,34'. For example if mu is 1,05 then 2,34 is not within 0,96 of mu... I can understand that there is '95% chance 2,34 is within 0,96 of mu' but the logic behind reversing this statement is not clear to me.(4 votes)
- If there was a 95% probability that a given interval around mu (we don't know mu, but it has some particular value) contains our sample mean (which we know), then wouldn't there also be the same probability that the same interval around our sample mean contains mu?
(Does mu change because we change our statement? They still have the same relationship to each other and our probability is still the same.)(1 vote)
- The sample standard deviation is quoted to be 1.04 which is the answer you get if you plug the numbers in excel and apply the formula. However, if you crunch the numbers manually you get 1.08. Anyone aware of why this discrepancy exists? I have done it several times and can't find errors in my calculations.(3 votes)
- Your 1.08 represents the variance, take the sq root of 1.08 and you'll get the SD which is 1.04.(14 votes)
- Why is the population mean is equal to the sample mean?(3 votes)
- It's not. It's what the s a m p l e means are randomly distributed around.
We use this fact to calculate confidence intervals.(1 vote)
- so i have a problem i can't quite figure out. the problem is as follows:
you randomly choose 16 unfurnished one-bedroom apartments from a large number of advertisements in your local newspaper. You calculate that their mean monthly rent is $613 and their standard deviation is $96. What is the standard error of the mean? What are the degrees of freedom for a one sample t statistic?
standard error is the mean/square root of n, or in this case 16, right? which comes out to 24. (69/sqrt 16)
In my book it says to get the one sample t statistic, you take x-bar minus mu divided by the standard error.. but i don't know mu. How do i get mu from just one sample mean?(2 votes)
- The standard error of the mean is the standard deviation divided by the square root of the sample size, or s/√n. You shouldn't need to calculate a test statistic to find the degrees of freedom.(3 votes)
- Where does the t-table come from? I understand the z-table comes from the definate integral of the normal distribution function but how is the t-distribution defined and why is it that small sample sizes tend to follow a t-distribution model rather than some other model?(2 votes)
- You can give this page a glance to see how the probability density function for the t distribution looks like: http://en.wikipedia.org/wiki/Student%27s_t-distribution#Probability_density_function
It looks a lot more complex than the standard normal density function. However, there are tables for it, which makes it very useful.(2 votes)
- We use the sample mean and standard deviation as an estimate of the population standard deviation. What do we do when we do a number of trials that each generate 7 data points? How do we estimate the mean and standard deviation in that case?(2 votes)
- I think you can group things any way that's most convenient. So you could consider those 7 trials as "Test 1", and consider that your 7-sample set from the larger population.(1 vote)
- I have a question, why did Sal mulitply 0.39, the standard deviation of sample distribution, to 2.447 to get the distance from the miu to the critical value?(2 votes)
- He multiplied because 2.447 is the t-critical value that corresponds to a 95% two-sided confidence interval using a t-distribution. The 2.447 is a standardized value that explains what t-values will contain 95% of the t-distribution. The t-critical value must then be converted back to units of the original question. Multiplying by the standard deviation of the sampling distribution will then result in the distance from the sampling mean (mu).
When finding one-sample t confidence intervals, the general equation x_bar +/- (t critical value)*s/sqrt(n) is used. The multiplication is the (t critical value)*s/sqrt(n).(1 vote)
7 patients blood pressures have been measured after having been given a new drug for 3 months. They had blood pressure increases of, and they give us seven data points right here-- who knows, that's in some blood pressure units. Construct a 95% confidence interval for the true expected blood pressure increase for all patients in a population. So there's some population distribution here. It's a reasonable assumption to think that it is normal. It's a biological process. So if you gave this drug to every person who has ever lived, that will result in some mean increase in blood pressure, or who knows, maybe it actually will decrease. And there's also going to be some standard deviation here. It is a normal distribution. And the reason why it's reasonable to assume that it's a normal distribution is because it's a biological process. It's going to be the sum of many thousands and millions of random events. And things that are sums of millions and thousands of random events tend to be normal distribution. So this is a population distribution. And we don't know anything really about it outside of the sample that we have here. Now, what we can do is, and this tends to be a good thing to do, when you do have a sample just figure out everything that you can figure out about that sample from the get-go. So we have our seven data points. And you could add them up and divide by 7 and get your sample mean. So our sample mean here is 2.34. And then you can also calculate your sample standard deviation. Find the square distance from each of these points to your sample mean, add them up, divide by n minus 1, because it's a sample, then take the square root, and you get your sample standard deviation. I did this ahead of time just to save time. Sample standard deviation is 1.04. And when you don't know anything about the population distribution, the thing that we've been doing from the get-go is estimating that character with our sample standard deviation. So we've been estimating the true standard deviation of the population with our sample standard deviation. Now in this problem, this exact problem, we're going to run into a problem. We're estimating our standard deviation with an n of only 7. So this is probably going to be a not so good estimate because-- let me just write-- because n is small. In general, this is considered a bad estimate if n is less than 30. Above 30 you're dealing in the realm of pretty good estimates. So the whole focus of this video is when we think about the sampling distribution, which is what we're going to use to generate our interval, instead of assuming that the sampling distribution is normal like we did in many other videos using the central limit theorem and all of that, we're going to tweak the sampling distribution. We're not going to assume it's a normal distribution because this is a bad estimate. We're going to assume that it's something called a t-distribution. And a t-distribution is essentially, the best way to think about is it's almost engineered so it gives a better estimate of your confidence intervals and all of that when you do have a small sample size. It looks very similar to a normal distribution. It has some mean, so this is your mean of your sampling distribution still. But it also has fatter tails. And the way I think about why it has fatter tails is when you make an assumption that this is a standard deviation for-- let me take one more step. So normally what we do is we find the estimate of the true standard deviation, and then we say that the standard deviation of the sampling distribution is equal to the true standard deviation of our population divided by the square root of n. In this case, n is equal to 7. And then we say OK, we never know the true standard, or we seldom know-- sometimes you do know-- we seldom know the true standard deviation. So if we don't know that the best thing we can put in there is our sample standard deviation. And this right here, this is the whole reason why we don't say that this is just a 95 probability interval. This is the whole reason why we call it a confidence interval because we're making some assumptions. This thing is going to change from sample to sample. And in particular, this is going to be a particularly bad estimate when we have a small sample size, a size less than 30. So when you are estimating the standard deviation where you don't know it, you're estimating it with your sample standard deviation, and your sample size is small, and you're going to use this to estimate the standard deviation of your sampling distribution, you don't assume your sampling distribution is a normal distribution. You assume it has fatter tails. And it has fatter tails because you're essentially underestimating-- you're underestimating the standard deviation over here. Anyway, with all of that said, let's just actually go through this problem. So we need to think about a 95% confidence interval around this mean right over here. So a 95% confidence interval, if this was a normal distribution you would just look it up in a Z-table. But it's not, this is a t-distribution. We're looking for a 95% confidence interval. So some interval around the mean that encapsulates 95% of the area. For a t-distribution you use t-table, and I have a t-table ahead of time right over here. And what you want to do is use the two-sided row for what we're doing right over here. And the best way to think about it is that we're symmetric around the mean. And that's why they call it two-sided. It would be one-sided if it was kind of a cumulative percentage up to some critical threshold. But in this case, it's two-sided, we're symmetric. Or another way to think about it is we're excluding the two sides. So we want the 95% in the middle. And this is a sampling distribution of the sample mean for n is equal to 7. And I won't go into the details here, but when n is equal to 7 you have 6 degrees of freedom, or n minus 1. And the way that t-tables are set up, you go and find the degrees of freedom. So you don't go to the n, you go to the n minus 1. So you go to the 6 right here. So if you want to encapsulate 95% of this right over here, and you have an n of 6, you have to go 2.447 standard deviations in each direction. And this t-table assumes that you are approximating that standard deviation using your sample standard deviation. So another way to think of it you have to go 2.447 of these approximated standard deviations. Let me it right here. So you have to go 2.447-- this distance right here is 2.447 times this approximated standard deviation. And sometimes you'll see this in some statistics book. This thing right here, this exact number, is shown like this. They put a little hat on top of the standard deviation to show that it has been approximated using the sample standard deviation. So we'll put a little hat over here, because frankly, this is the only thing that we can calculate. So this is how far you have to go in each direction. And we know what this value is. We know what the sample distribution is. So let's get our calculator out. So we know our sample standard deviation is 1.04. And we want to divide that by the square root of 7. So we get 0.39. So this right here is 0.39. And so if we want to find the distance around this population mean that encapsulates 95% of the population or of the sampling distribution, we have to multiply 0.39 times 2.447, so let's do that. So times 2.447 is equal to 0.96. So this is equal to-- so this distance right here is 0.96, and then this distance right here is 0.96. So if you take a random sample, and that's exactly what we did when we found these 7 samples. When we took these 7 samples and took their mean, that mean can be viewed as a random sample from the sampling distribution. And so the probability, and so we can view it, we could say that there's a 95% chance-- and we have to actually caveat everything with a confident, because we're doing all of these estimations here. So it's not a true precise 95% chance. We're just confident that there's a 95% chance that our random population, our random sampling mean right here, so that 2.34, which we can kind of use-- we just picked that 2.34 from this distribution right here. So there's a 95% chance that 2.34 is within 0.96 of the true sampling distribution mean, which we know is also the same thing as the population mean. Or we can just rearrange the sentence and say that there is a 95% chance that the mean, the true mean, which is the same thing as a sampling distribution mean, is within 0.96 of our sample mean, of 2.34. So at the low end, so if you go 2.36 minus-- if you go 2.34 minus 0.96-- that's the low end of our confidence interval, 1.38. And the high end of our confidence interval, 2.34 plus 0.96 is equal to 3.3. So our 95% confidence interval is from 1.38 to 3.3.