Main content

## Estimating a population proportion

Current time:0:00Total duration:15:02

# Margin of errorÂ 1

## Video transcript

Say I live in a country of a 100
million people and there's a presidential election
coming up. And in that presidential
election there are two candidates. There's candidate A,
and candidate B. And there's some reality-- let's
say I live in a very decisive country and everyone is
going to vote for either-- and everyone participates in
election and everyone is going to vote for either candidate
A or candidate B. And so there's some percentage,
there's some reality there, that p-- let me
write it over here-- maybe 1 minus p percent-- let me do
the p first. There's some reality that maybe p percent
will vote for B, and I could switch them around
if I wanted. So p percent are going to vote
for B, and the rest of the people are going to vote for A,
so maybe 1 minus p percent are going to vote for A. And you might already recognize
that this is a Bernoulli Distribution. There's one of two values
for a sample I can get. And right here, the values I
said you're either voting for candidate A or you're voting
for candidate B. It's very hard to deal
with those values. You can't calculate a mean
between A and B and all of that-- those are letters,
they're not numbers. So to make it manipulatable
mathematically we're going to say sampling someone who's
going to vote for A is equivalent to sampling a 0,
and sampling someone who's going to vote for B is
equivalent to sampling a 1. And if you do that with a
Bernoulli Distribution, we learned in the video on
Bernoulli Distributions, that the mean of this distribution
right here is going to be equal to p. And it's a pretty
straightforward proof for how we got that. So the mean of this
distribution, which will actually be not a value that
this distribution can take on, is going to be some place over
here and it is going to be equal to p. Now my country has a
100 million people. It is practically, or is
definitely impossible for me to be able to go and ask all
100 million people who are they going to vote for. So I won't be able to exactly
figure out what these parameters are going to be. What my mean is, what
p is going to be. But instead of doing that, what
I'm going to do is do a random survey. I'm going to sample this
population, look at that data, and then get an estimate
of what p really is. Because this is what I
really care about. I really care about p. So I'm going to try to estimate
p with a sample, and then we're also going to think
about how good of an estimate that is. So I am going to randomly
survey, or sample, 100 people. And let's say I got the
following results. Let's say that 57 people say
that they were going to vote for person A. Let me write it this way. So 57 people say they're going
to vote for A, or that's equivalent to getting
57 samples of 0. And then the rest of the people,
once again, very decisive population, no one is
undecided, the rest of the people, so 43 people say they're
going to vote for B. Or that's the equivalent
of sampling 43 1's. Now given this sample here, what
is my sample mean and my sample variance? My sample mean right here, well
that's just going to be the average of these 0's and
1's So I've got 57 0's, so it's going to be 57 times
0 plus my 43 1's. So the sum of all of my samples,
so it's 43 1's, plus 43 times 1, over the total
number of samples I took, over 100. So what does this get me? So 57 times 0 is 0. 43 times 1 divided
by 100 is 0.43. That is my sample mean, the
mean of just the 100 data points that I actually got. Now what is my sample
variance? Sample variance is going to be
equal to the sum of my squared distances to the mean divided
by my samples minus 1. Remember, this is a sample
variance, and we want to get the best estimator of the real
variance of this distribution. And to do that you don't divide
by 100, you're going to divide by 100 minus 1. We learned that many,
many videos ago. So I have 57. So I had 57 samples of 0. We'll do it in that same
yellow color-- 57 samples of 0. And so each of those samples
are 0 minus 0.43 away from the mean. Each of those samples are 0. You subtract 0.43-- this
is the difference between 0 and 0.43. And if I want the squared
distance, I square it-- that's how we calculate variance. There's 57 of those. And then there's 43 times that
I sampled a 1 in my sample population-- 43 times I sampled
a 1, and the 1 is 1 minus 0.43 away from the mean
because that is the mean, and I want to square
that distance. And then I don't want to
just divide it by n. I don't want to just divided by
100-- remember, I'm trying to estimate the true
population mean. In order for this to be the best
estimator of that, and I gave you the intuition of why
many, many videos ago, we divide by 100 minus 1 or 99. Let's get the calculator out
to actually figure out our sample variance. So let me get the calculator
out, and we have-- I'll do the numerator first. I have 57 times
0 minus 0.43 squared, plus 43 times 1 minus
0.43 squared. And then all of that divided
by 100 minus 1, or 99-- divided by 99 is equal
to 0.2475. So my variance, my sample
variance, is equal to 0.2475. And if I want to figure out my
sample standard deviation I just take the square
root of that. My sample standard deviation is
just going to be the square root of my sample variance. So I take the square root of
that value that I just had, which is 0.497. So actually let me just
round that up as 0.50. So my sample standard
deviation is 0.50. Now if you just look at this,
you say OK, well your best estimate of the percentage of
people voting for A or B is really what you just saw here. Your best estimate or your best
estimate of the mean is that 43% of people are going
to vote for B and everyone else is going to vote for A. But an interesting question
is how good a of a sample is that? Let's take it to
the next level. Let's try to think of an
interval around 43% for which we are 95%, that we're
reasonably confident, roughly 95% sure that the real mean
is in that interval. Let me make it very clear. Let me draw. So when we get our sample mean
we are sampling from the sampling distribution of
the sampling mean. Let me draw that. The sampling distribution
of the sample mean. So since we're sampling from a
discrete distribution it's actually going to be a discrete
distribution, but it's going to have 100
possible values. This can take on 100 different
values here. Really anything between
0 and 1. But I'll draw it kind of
continuous because it would be hard for me to draw 100
different bars. If I did, you'd have a bar
there, you'd have a bar there. The odds that your sample mean
would be 1, it would be a very low probability, and then you
would have one more bar, a bar like that, a bar like
that, but that takes forever to draw. So I'm just going to approximate
it with this normal curve right over there. So the sampling distribution
of the sample mean-- let me write it over here. So this is the sampling
distribution of the sample mean. It has some mean here. It has a mean, and I can denote
it with the mu sub x bar-- this tells us this is
the mean of the sample distribution. But we know from many, many
videos that this is going to be the same thing as the mean of
the population mean that we are sampling from, that each
sample comes from, each of these 100 samples come from. So this is going to be equal
to mu, which is going to be equal to p. Now this variance over here,
the variance of this distribution-- let me draw it
like this, or even better let's do the standard
deviation of this distribution. The standard deviation of this
distribution, that distance right over here, the standard
deviation of the sampling distribution of the sample
mean-- we've seen it multiple times already-- it's going to
be this standard deviation-- it's going to be the standard
deviation of our population distribution. So that standard deviation
is going to be that distance over there. So there's some standard
deviation associated with this distribution. It's going to be that standard
deviation divided by the square root of our
sample size. And we saw many videos ago why
that, at least experimentally makes sense, or why it
intuitively makes sense. So it's going to be the
square root of 100. So it's going to be this
guy divided by 10. Now we do not know
what this guy is. The only way to figure out
what that guy is is to actually survey 100 million
people, which would have been impossible. So to estimate the standard
deviation of this, we will use our sampling standard deviation
as our best estimate for the population standard
deviation. So we could say-- and remember,
this is an estimate. We cannot come up with the exact
number for this just from a sample. But we can estimate it. Because this is our best
estimator for this standard deviation, and if we divide it
by 10, we will have our best estimator for the standard
deviation of the sampling distribution of the
sampling mean. So remember, this is
just an estimate. It is just an estimate. So you kind of have to take
everything after this point with a little bit of
a grain of salt. So it's going to be roughly
equal to or an estimate of it is going to be 0.5. And remember, every time we take
a different sample from here this number is
going to change. So this isn't like something
in stone. This is dependent
on our sample. So it's going to wiggle around a
little bit depending on what numbers we actually
get in our sample. But it's going to be 0.50. This is the s right here, this
0.50 divided by 10, which is equal to 0.05. So our best estimate of this
standard deviation is 0.05, or you could even view it as 5%. Now what I want to do is come up
with an interval around the sample mean where I'm reasonably
confident using all my estimates and all that that
there's a-- let me say I'm really confident that there's
a 95% chance that the true mean is within two standard
deviations-- or let me put it this way, there's a 95%
chance that the true mean is in that interval. So let me write this down. I want to find an interval
such that I am reasonably confident-- and I'm putting
this kind of touchy-feely language over here because it's
all around the fact that I don't know for a fact that
the standard deviation is 0.05, I'm just estimating. But I'm reasonably confident
that there is a 95% chance that the true mean of the
population, which is the same thing as the proportion of the
population who are going to vote for person B, or the
proportion of the population that are going to be a 1. So this is also, we just
have to remember that mu is equal to p. That there's a 95% chance
that the true p is in that interval. And actually, since I've already
gone 14 minutes into this video, I'm going to pause
this video, I'm going to stop this video here, and maybe I'll
even let you think about it just based on everything
we've done so far. We figured out the sample mean--
sorry, we figured out the sample mean right
over here. We've figured out an estimate
for the-- and remember, this is just a sampling mean. We don't know the true-- this
is the mean of our sample. We don't know the true mean of
the sampling distribution, and we also don't know the true
standard deviation of the sampling distribution. But we were able to estimate
it with the sample standard deviation. Now everything that we have so
far, and based on what we've seen before on confidence
intervals and all that, how can we find an interval such
that roughly-- and I'm saying roughly because we had to
estimate the standard deviation-- that there's a 95%
chance that the true mean of our population, or the p, the
proportion of the population saying 1, is in that interval? And we're going to do that
in the next video.