Main content

## Statistics and probability

# Large sample proportion hypothesis testing

Sal uses a large sample to test if more than 30% of US households have internet access. Created by Sal Khan.

## Video transcript

We want to test the hypothesis
that more than 30% of U.S. households have internet access
with a significance level of 5%. We collect a sample of 150
households, and find that 57 have access. So to do our hypothesis test,
let's just establish our null hypothesis and our alternative
hypothesis. So our null hypothesis is that
the hypothesis is not correct. Our null hypothesis is that
the proportion of U.S. households that have internet
access is less than or equal to 30%. And our alternative hypothesis,
is what our hypothesis actually is, is
that the proportion is greater than 30%. We see it over here. We want to test the hypothesis
that more than 30% of U.S. households have internet
access. That's that right here. This is what we're testing. We're testing the alternative
hypothesis. And the way we're going to do it
is we're going to assume a P-value based on the
null hypothesis. We're going to assume a
proportion based on the null hypothesis for the population. And the given that assumption,
what is the probability that 57 out of 150 of our samples
actually have internet access. And if that probability is less
than 5%, if it's less than our significance level,
then we're going to reject the null hypothesis in favor
of the alternative one. So let's think about
this a little bit. So we're going to start off
assuming-- we're going to assume the null hypothesis
is true. And in that assumption we're
going to have to pick a population proportion or a
population mean-- we know that for Bernoulli distributions
do the same thing. And what I'm going to do is I'm
going to pick a proportion so high so that it maximizes
the probability of getting this over here. And we actually don't even
know what that number is. And actually so that we can
think about a little more intelligent, let's just find
out what our sample proportion even is. We had 57 people out of 150
having internet access. So 57 households out of 150. So our sample proportion
is 0.38, so let me write that over here. Our sample proportion
is equal to 0.38. So when we assume our null
hypothesis to be true, we're going to assume a population
proportion that maximizes the probability that we get
this over here. So the highest population
proportion that's within our null hypothesis that will
maximize the probability of getting this is actually
if we are right at 30%. So if we say our population
proportion, we're going to assume this is true. This is our null hypothesis. We're going to assume that
it is 0.3 or 30%. And I want you understand that--
29% would have been a null hypothesis. 28% that would have been
a null hypothesis. But for 29% or 28%, the
probability of getting this would have been even lower. So it wouldn't have been
as strong of a test. If we take the maximum
proportion that still satisfies our null hypothesis,
we're maximizing the probability that we get this. So if that number is still low,
if it's still less than 5%, we can feel pretty good
about the alternative hypothesis. So just to refresh ourselves
we're going to assume a population proportion of 0.3,
and if we just think about the distribution-- sometimes it's
helpful to draw these things, so I will draw it. So this is what the population
distribution looks like based on our assumption,
based on this assumption right over here. Our population distribution
has-- or maybe I should write 30% have internet access. And I'll signify
that with a 1. And then the rest don't
have internet access. 70% do not have internet
access. This is just a Bernoulli
distribution. We know that the mean over here
is going to be the same thing as the proportion that
has internet access. So the mean over here is going
to be 0.3, same thing as 30%. This is the population mean. And maybe I should
write this way. The mean assuming our null
hypothesis, the population mean assuming our null
hypothesis is 0.3. And then the population
standard deviation. Let me write this over
here in yellow. The population standard
deviation assuming our null hypothesis. And we've seen this when we
first learned about Bernoulli distributions. It is going to be the square
root of the percentage of the population that has internet
access, so 0.3 times the proportion of the population
that does not have internet access, times 0.7
right over here. So this is the square
root of 0.21. And we could deal with this
later using our calculator. Now, with that out of the way,
we want to figure out the probability of getting a sample proportion that has a 0.38. So let's look at the
distribution of sample proportions. So you could literally look at
every combination of getting 150 households from this, and
you would actually get a binomial distribution. And we've also seen
this before. You would actually get a
binomial distribution where you'd get a bunch of
bars like that. But if your n is suitably large,
and in particular-- and this is kind of the test for
it-- the test if n times p-- and in this case we're saying
p is 30%-- if n times p is greater than 5, and n times 1
minus p is greater than 5, you can assume that the distribution
of the sample proportion or the sample
proportion distribution is going to be normal. So if you looked at all of the
different ways you could sample 150 households from this
population, you get all of these bars. But since our n is pretty big,
it's 150, and 150 times 0.3 is obviously greater than 5. 150 times 0.7 is also
greater than 5. You can approximate that with
a normal distribution. So let me do that. So you can approximate it with
a normal distribution. So this is a normal distribution
right over there. Now the mean of the distribution
of the proportion data that we're assuming is a
normal distribution is going to be-- and remember, working
under the context that the null hypothesis is true. So this mean is going to be--
this mean right here-- so the mean of our sample proportions
is going to be the same thing as our population mean. So this is going to be 0.3,
same value as that. And the standard deviation---
this comes straight from the central limit theorem. So the standard deviation of
our sample proportions, the standard deviation is going to
be the square root-- let me put it this way-- it's going to
be our population standard deviation, the standard
deviation we're assuming with our null hypothesis divided by
the square root of the number of samples we have. And in this
case we have 150 samples. It's going to be 150 samples
and we can calculate this. This value on top we just
figured out is the square root of 0.21. So this is the square root
of 0.21 over the square root of 150. And I can get the calculator
out to calculate this. So I'll just do it the
way I wrote it. The square root of 0.21-- and
I'm going to divide that, so whatever answer is I'm going to
divide that by the square root of 150. So it's 0.037. So we figured out the standard
deviation here of our-- or the distribution of our sample
proportions is going to be-- let me write this down, I'll
scroll over to the right a little bit-- it is 0.037. I think I'm falling off the
screen a little bit. So we'll just say 0.037. Now to figure out the
probability of having a sample proportion of 0.38, we just have
to figure out how many standard deviations that is
away from our mean, or essentially calculate a
Z-statistic for our sample, because a Z-statistic or a
Z-score is really just how many standard deviations you
are away from the mean. And then figure out whether
the probability of getting that Z-statistic is more
or less than 5%. So let's figure out how many
standard deviations we are away from the mean. So just remind ourselves, this
sample proportion we got we can view as just a sample from
this distribution of all of the possible sample
proportions. So how many standard
deviations away from the mean is this? So if we take our sample
proportion, subtract from that the mean of the distribution
of sample proportions and divide it by the standard
deviation of the distribution of the sample proportions, we
get 0.38, 0.38 minus 0.3. All of that over this
value which we just figured out was 0.037. So what does that give us? The numerator over
here is a 0.08. The denominator is 0.037. So let's figure this out. So our numerator is 0.08 divided
by this last number right here, which
is the 0.037. So second answer and we get
2.1-- I'll just round it to 2.14 standard deviations. So this is equal to-- this right
here is equal to 2.14 standard deviations. Or we could say that our
Z-statistic, right, we could call this our Z-score or our
Z-statistic, the number of standard deviations we are away
from our mean is 2.14. We're at 2.14, and to be exact,
we're 2.14 standard deviations above the mean. We're going to care about a
one-tailed distribution. Now is the probability
of getting this more or less than 5%? If it's less than 5% we're
going to reject the null hypothesis in favor of
our alternative. So how do we think about that? Well let's think about just
a normalized normal distribution. Or maybe you could call the
Z-distribution if you want. If you look at a normal
distribution, a completely normalized normal distribution, it's mean is at 0. And essentially each
of these values are essentially Z-scores. Because a value of 1 literally
means you are 1 standard deviation away from this
mean over here. So we need to find a critical
Z-value right over here. Let me call that a critical Z--
we could even say critical Z-score or critical Z-value--
so that the probability of getting a Z-value higher
than that is 5%. So that this whole area
right here is 5%. And that's because that's what
our significance level is. Anything that has a lower than
5% a chance of occurring, for us will be validation to reject
our null hypothesis. Or another way of thinking about
it is that area's 5%, this whole area right
over here is 95%. And once again, this is a
one-tailed test, because we only care about values
greater than this. Z-values greater than that will
make us reject the null hypothesis. And to figure out this critical
Z-value you can literally just go
to a Z-table. And we say OK, the probability
of being a Z-value less than that is 95%. And that's exactly the number
that this gives us. The cumulative probability
of getting a value less than that. So if we just scan this,
we're looking for 95%. We have 0.9495, we
have 0.9505. So I'll go with this just
to make sure we're a little bit closer. So this Z-value, and the z-value
here is 1.6, and the next digit is 5. 1.6 5. So this critical Z-value
is equal to 1.65. So the probability of getting
a Z-value less than 1.65, or even in a completely
normalized normal distribution, the probability
of getting a value less than 1.65. Or in any normal distribution,
the probability of being less than 1.65 standard deviations
away from the mean is going to be 95%. So that's our critical
Z-value. Now does Z-value, or the
Z-statistic, for our actual sample is 2.14. Our actual Z-value
we got is 2.14. It's sitting all the way
out here some place. So the probability of
getting that was definitely less than 5%. And actually we could even say
what's the probability of getting that or a more
extreme result. And if you figured out this
area, and you could actually figure it out by looking at a
Z-table, you could figure out the P-value of this result. But anyway, the whole exercise
here is just to figure out if can reject the null hypothesis
with a significance level 5%. We can. This is a more extreme result
than our critical Z-value, so we can reject the null
hypothesis in favor of our alternative.