Sal uses a large sample to test if more than 30% of US households have internet access. Created by Sal Khan.
We want to test the hypothesis that more than 30% of U.S. households have internet access with a significance level of 5%. We collect a sample of 150 households, and find that 57 have access. So to do our hypothesis test, let's just establish our null hypothesis and our alternative hypothesis. So our null hypothesis is that the hypothesis is not correct. Our null hypothesis is that the proportion of U.S. households that have internet access is less than or equal to 30%. And our alternative hypothesis, is what our hypothesis actually is, is that the proportion is greater than 30%. We see it over here. We want to test the hypothesis that more than 30% of U.S. households have internet access. That's that right here. This is what we're testing. We're testing the alternative hypothesis. And the way we're going to do it is we're going to assume a P-value based on the null hypothesis. We're going to assume a proportion based on the null hypothesis for the population. And the given that assumption, what is the probability that 57 out of 150 of our samples actually have internet access. And if that probability is less than 5%, if it's less than our significance level, then we're going to reject the null hypothesis in favor of the alternative one. So let's think about this a little bit. So we're going to start off assuming-- we're going to assume the null hypothesis is true. And in that assumption we're going to have to pick a population proportion or a population mean-- we know that for Bernoulli distributions do the same thing. And what I'm going to do is I'm going to pick a proportion so high so that it maximizes the probability of getting this over here. And we actually don't even know what that number is. And actually so that we can think about a little more intelligent, let's just find out what our sample proportion even is. We had 57 people out of 150 having internet access. So 57 households out of 150. So our sample proportion is 0.38, so let me write that over here. Our sample proportion is equal to 0.38. So when we assume our null hypothesis to be true, we're going to assume a population proportion that maximizes the probability that we get this over here. So the highest population proportion that's within our null hypothesis that will maximize the probability of getting this is actually if we are right at 30%. So if we say our population proportion, we're going to assume this is true. This is our null hypothesis. We're going to assume that it is 0.3 or 30%. And I want you understand that-- 29% would have been a null hypothesis. 28% that would have been a null hypothesis. But for 29% or 28%, the probability of getting this would have been even lower. So it wouldn't have been as strong of a test. If we take the maximum proportion that still satisfies our null hypothesis, we're maximizing the probability that we get this. So if that number is still low, if it's still less than 5%, we can feel pretty good about the alternative hypothesis. So just to refresh ourselves we're going to assume a population proportion of 0.3, and if we just think about the distribution-- sometimes it's helpful to draw these things, so I will draw it. So this is what the population distribution looks like based on our assumption, based on this assumption right over here. Our population distribution has-- or maybe I should write 30% have internet access. And I'll signify that with a 1. And then the rest don't have internet access. 70% do not have internet access. This is just a Bernoulli distribution. We know that the mean over here is going to be the same thing as the proportion that has internet access. So the mean over here is going to be 0.3, same thing as 30%. This is the population mean. And maybe I should write this way. The mean assuming our null hypothesis, the population mean assuming our null hypothesis is 0.3. And then the population standard deviation. Let me write this over here in yellow. The population standard deviation assuming our null hypothesis. And we've seen this when we first learned about Bernoulli distributions. It is going to be the square root of the percentage of the population that has internet access, so 0.3 times the proportion of the population that does not have internet access, times 0.7 right over here. So this is the square root of 0.21. And we could deal with this later using our calculator. Now, with that out of the way, we want to figure out the probability of getting a sample proportion that has a 0.38. So let's look at the distribution of sample proportions. So you could literally look at every combination of getting 150 households from this, and you would actually get a binomial distribution. And we've also seen this before. You would actually get a binomial distribution where you'd get a bunch of bars like that. But if your n is suitably large, and in particular-- and this is kind of the test for it-- the test if n times p-- and in this case we're saying p is 30%-- if n times p is greater than 5, and n times 1 minus p is greater than 5, you can assume that the distribution of the sample proportion or the sample proportion distribution is going to be normal. So if you looked at all of the different ways you could sample 150 households from this population, you get all of these bars. But since our n is pretty big, it's 150, and 150 times 0.3 is obviously greater than 5. 150 times 0.7 is also greater than 5. You can approximate that with a normal distribution. So let me do that. So you can approximate it with a normal distribution. So this is a normal distribution right over there. Now the mean of the distribution of the proportion data that we're assuming is a normal distribution is going to be-- and remember, working under the context that the null hypothesis is true. So this mean is going to be-- this mean right here-- so the mean of our sample proportions is going to be the same thing as our population mean. So this is going to be 0.3, same value as that. And the standard deviation--- this comes straight from the central limit theorem. So the standard deviation of our sample proportions, the standard deviation is going to be the square root-- let me put it this way-- it's going to be our population standard deviation, the standard deviation we're assuming with our null hypothesis divided by the square root of the number of samples we have. And in this case we have 150 samples. It's going to be 150 samples and we can calculate this. This value on top we just figured out is the square root of 0.21. So this is the square root of 0.21 over the square root of 150. And I can get the calculator out to calculate this. So I'll just do it the way I wrote it. The square root of 0.21-- and I'm going to divide that, so whatever answer is I'm going to divide that by the square root of 150. So it's 0.037. So we figured out the standard deviation here of our-- or the distribution of our sample proportions is going to be-- let me write this down, I'll scroll over to the right a little bit-- it is 0.037. I think I'm falling off the screen a little bit. So we'll just say 0.037. Now to figure out the probability of having a sample proportion of 0.38, we just have to figure out how many standard deviations that is away from our mean, or essentially calculate a Z-statistic for our sample, because a Z-statistic or a Z-score is really just how many standard deviations you are away from the mean. And then figure out whether the probability of getting that Z-statistic is more or less than 5%. So let's figure out how many standard deviations we are away from the mean. So just remind ourselves, this sample proportion we got we can view as just a sample from this distribution of all of the possible sample proportions. So how many standard deviations away from the mean is this? So if we take our sample proportion, subtract from that the mean of the distribution of sample proportions and divide it by the standard deviation of the distribution of the sample proportions, we get 0.38, 0.38 minus 0.3. All of that over this value which we just figured out was 0.037. So what does that give us? The numerator over here is a 0.08. The denominator is 0.037. So let's figure this out. So our numerator is 0.08 divided by this last number right here, which is the 0.037. So second answer and we get 2.1-- I'll just round it to 2.14 standard deviations. So this is equal to-- this right here is equal to 2.14 standard deviations. Or we could say that our Z-statistic, right, we could call this our Z-score or our Z-statistic, the number of standard deviations we are away from our mean is 2.14. We're at 2.14, and to be exact, we're 2.14 standard deviations above the mean. We're going to care about a one-tailed distribution. Now is the probability of getting this more or less than 5%? If it's less than 5% we're going to reject the null hypothesis in favor of our alternative. So how do we think about that? Well let's think about just a normalized normal distribution. Or maybe you could call the Z-distribution if you want. If you look at a normal distribution, a completely normalized normal distribution, it's mean is at 0. And essentially each of these values are essentially Z-scores. Because a value of 1 literally means you are 1 standard deviation away from this mean over here. So we need to find a critical Z-value right over here. Let me call that a critical Z-- we could even say critical Z-score or critical Z-value-- so that the probability of getting a Z-value higher than that is 5%. So that this whole area right here is 5%. And that's because that's what our significance level is. Anything that has a lower than 5% a chance of occurring, for us will be validation to reject our null hypothesis. Or another way of thinking about it is that area's 5%, this whole area right over here is 95%. And once again, this is a one-tailed test, because we only care about values greater than this. Z-values greater than that will make us reject the null hypothesis. And to figure out this critical Z-value you can literally just go to a Z-table. And we say OK, the probability of being a Z-value less than that is 95%. And that's exactly the number that this gives us. The cumulative probability of getting a value less than that. So if we just scan this, we're looking for 95%. We have 0.9495, we have 0.9505. So I'll go with this just to make sure we're a little bit closer. So this Z-value, and the z-value here is 1.6, and the next digit is 5. 1.6 5. So this critical Z-value is equal to 1.65. So the probability of getting a Z-value less than 1.65, or even in a completely normalized normal distribution, the probability of getting a value less than 1.65. Or in any normal distribution, the probability of being less than 1.65 standard deviations away from the mean is going to be 95%. So that's our critical Z-value. Now does Z-value, or the Z-statistic, for our actual sample is 2.14. Our actual Z-value we got is 2.14. It's sitting all the way out here some place. So the probability of getting that was definitely less than 5%. And actually we could even say what's the probability of getting that or a more extreme result. And if you figured out this area, and you could actually figure it out by looking at a Z-table, you could figure out the P-value of this result. But anyway, the whole exercise here is just to figure out if can reject the null hypothesis with a significance level 5%. We can. This is a more extreme result than our critical Z-value, so we can reject the null hypothesis in favor of our alternative.