Main content

# Small sample size confidence intervals

## Video transcript

7 patients blood pressures have been measured after having been given a new drug for 3 months. They had blood pressure increases of, and they give us seven data points right here-- who knows, that's in some blood pressure units. Construct a 95% confidence interval for the true expected blood pressure increase for all patients in a population. So there's some population distribution here. It's a reasonable assumption to think that it is normal. It's a biological process. So if you gave this drug to every person who has ever lived, that will result in some mean increase in blood pressure, or who knows, maybe it actually will decrease. And there's also going to be some standard deviation here. It is a normal distribution. And the reason why it's reasonable to assume that it's a normal distribution is because it's a biological process. It's going to be the sum of many thousands and millions of random events. And things that are sums of millions and thousands of random events tend to be normal distribution. So this is a population distribution. And we don't know anything really about it outside of the sample that we have here. Now, what we can do is, and this tends to be a good thing to do, when you do have a sample just figure out everything that you can figure out about that sample from the get-go. So we have our seven data points. And you could add them up and divide by 7 and get your sample mean. So our sample mean here is 2.34. And then you can also calculate your sample standard deviation. Find the square distance from each of these points to your sample mean, add them up, divide by n minus 1, because it's a sample, then take the square root, and you get your sample standard deviation. I did this ahead of time just to save time. Sample standard deviation is 1.04. And when you don't know anything about the population distribution, the thing that we've been doing from the get-go is estimating that character with our sample standard deviation. So we've been estimating the true standard deviation of the population with our sample standard deviation. Now in this problem, this exact problem, we're going to run into a problem. We're estimating our standard deviation with an n of only 7. So this is probably going to be a not so good estimate because-- let me just write-- because n is small. In general, this is considered a bad estimate if n is less than 30. Above 30 you're dealing in the realm of pretty good estimates. So the whole focus of this video is when we think about the sampling distribution, which is what we're going to use to generate our interval, instead of assuming that the sampling distribution is normal like we did in many other videos using the central limit theorem and all of that, we're going to tweak the sampling distribution. We're not going to assume it's a normal distribution because this is a bad estimate. We're going to assume that it's something called a t-distribution. And a t-distribution is essentially, the best way to think about is it's almost engineered so it gives a better estimate of your confidence intervals and all of that when you do have a small sample size. It looks very similar to a normal distribution. It has some mean, so this is your mean of your sampling distribution still. But it also has fatter tails. And the way I think about why it has fatter tails is when you make an assumption that this is a standard deviation for-- let me take one more step. So normally what we do is we find the estimate of the true standard deviation, and then we say that the standard deviation of the sampling distribution is equal to the true standard deviation of our population divided by the square root of n. In this case, n is equal to 7. And then we say OK, we never know the true standard, or we seldom know-- sometimes you do know-- we seldom know the true standard deviation. So if we don't know that the best thing we can put in there is our sample standard deviation. And this right here, this is the whole reason why we don't say that this is just a 95 probability interval. This is the whole reason why we call it a confidence interval because we're making some assumptions. This thing is going to change from sample to sample. And in particular, this is going to be a particularly bad estimate when we have a small sample size, a size less than 30. So when you are estimating the standard deviation where you don't know it, you're estimating it with your sample standard deviation, and your sample size is small, and you're going to use this to estimate the standard deviation of your sampling distribution, you don't assume your sampling distribution is a normal distribution. You assume it has fatter tails. And it has fatter tails because you're essentially underestimating-- you're underestimating the standard deviation over here. Anyway, with all of that said, let's just actually go through this problem. So we need to think about a 95% confidence interval around this mean right over here. So a 95% confidence interval, if this was a normal distribution you would just look it up in a Z-table. But it's not, this is a t-distribution. We're looking for a 95% confidence interval. So some interval around the mean that encapsulates 95% of the area. For a t-distribution you use t-table, and I have a t-table ahead of time right over here. And what you want to do is use the two-sided row for what we're doing right over here. And the best way to think about it is that we're symmetric around the mean. And that's why they call it two-sided. It would be one-sided if it was kind of a cumulative percentage up to some critical threshold. But in this case, it's two-sided, we're symmetric. Or another way to think about it is we're excluding the two sides. So we want the 95% in the middle. And this is a sampling distribution of the sample mean for n is equal to 7. And I won't go into the details here, but when n is equal to 7 you have 6 degrees of freedom, or n minus 1. And the way that t-tables are set up, you go and find the degrees of freedom. So you don't go to the n, you go to the n minus 1. So you go to the 6 right here. So if you want to encapsulate 95% of this right over here, and you have an n of 6, you have to go 2.447 standard deviations in each direction. And this t-table assumes that you are approximating that standard deviation using your sample standard deviation. So another way to think of it you have to go 2.447 of these approximated standard deviations. Let me it right here. So you have to go 2.447-- this distance right here is 2.447 times this approximated standard deviation. And sometimes you'll see this in some statistics book. This thing right here, this exact number, is shown like this. They put a little hat on top of the standard deviation to show that it has been approximated using the sample standard deviation. So we'll put a little hat over here, because frankly, this is the only thing that we can calculate. So this is how far you have to go in each direction. And we know what this value is. We know what the sample distribution is. So let's get our calculator out. So we know our sample standard deviation is 1.04. And we want to divide that by the square root of 7. So we get 0.39. So this right here is 0.39. And so if we want to find the distance around this population mean that encapsulates 95% of the population or of the sampling distribution, we have to multiply 0.39 times 2.447, so let's do that. So times 2.447 is equal to 0.96. So this is equal to-- so this distance right here is 0.96, and then this distance right here is 0.96. So if you take a random sample, and that's exactly what we did when we found these 7 samples. When we took these 7 samples and took their mean, that mean can be viewed as a random sample from the sampling distribution. And so the probability, and so we can view it, we could say that there's a 95% chance-- and we have to actually caveat everything with a confident, because we're doing all of these estimations here. So it's not a true precise 95% chance. We're just confident that there's a 95% chance that our random population, our random sampling mean right here, so that 2.34, which we can kind of use-- we just picked that 2.34 from this distribution right here. So there's a 95% chance that 2.34 is within 0.96 of the true sampling distribution mean, which we know is also the same thing as the population mean. Or we can just rearrange the sentence and say that there is a 95% chance that the mean, the true mean, which is the same thing as a sampling distribution mean, is within 0.96 of our sample mean, of 2.34. So at the low end, so if you go 2.36 minus-- if you go 2.34 minus 0.96-- that's the low end of our confidence interval, 1.38. And the high end of our confidence interval, 2.34 plus 0.96 is equal to 3.3. So our 95% confidence interval is from 1.38 to 3.3.