Main content

### Course: AP®︎/College Statistics > Unit 10

Lesson 2: Confidence intervals for proportions- Conditions for valid confidence intervals for a proportion
- Conditions for confidence interval for a proportion worked examples
- Reference: Conditions for inference on a proportion
- Conditions for a z interval for a proportion
- Critical value (z*) for a given confidence level
- Finding the critical value z* for a desired confidence level
- Example constructing and interpreting a confidence interval for p
- Calculating a z interval for a proportion
- Interpreting a z interval for a proportion
- Determining sample size based on confidence and margin of error
- Sample size and margin of error in a z interval for p

© 2024 Khan AcademyTerms of usePrivacy PolicyCookie Notice

# Example constructing and interpreting a confidence interval for p

Check conditions, calculate, and interpret a confidence interval to estimate a population proportion.

## Want to join the conversation?

- When calculating the standard error, is it better to calculate the unbiased standard deviation of the sample and divide that by the positive root of the sample size,

here sqrt( (30*(0.4)^2 + 20*(0.6)^2) / 49 / 50) = 0.0699854...,

or use the formula of the standard error of the sample distribution, using p hat as an estimate of p ?

here sqrt(0.4 * 0.6 / 50) = 0.0692820...

The two don't give the same result. Sal uses both methods in different videos without saying why, so some extra explanation would be helpful!(32 votes)- Actually, your results are
**equal**if you round them up to the tenth, which are both 0.07.

The similarity between the two formulas is they are**both**the formulas for calculating the**standard deviation of the sample distribution**(**sigma_x_bar**). Which is directly used to calculate the CI (confidence interval).

The difference is that the first one is for dealing with**continuous**random variable (like weight, height) and the second one is for dealing with**binary**random variable (or Bernoulli random variable - success/failure).

Up to now, the two formulas give roughly the same results, but I'd advise that in problems that deal with**continuous**random variable, we use the first formula; and in problems that deal with**binary**random variable, we use the second formula.(2 votes)

- How do we decide what method to use to estimate the Standard Error.

Method 1) Perform correction. So standard error = sqrt(p hat * (1- p hat) / (n-1) )

Method 2) Do not perform correction. Standard error = sqrt(p hat * (1- p hat) / n )(24 votes)- It's worth noting that the 'correction' here is incorrect. The actual correction would be to first find the sample standard deviation:

S = SQRT(n * p-hat * (1- p-hat)/(n-1))

then the unbiased Standard Deviation of the Sample Distribution of p-hat is:

Std deviation of p-hat = S/SQRT(n).

As others below point out, whenever population parameters are unknown, it's probably best to use the correction method to avoid bias. I'll happily defer to any experts who can give a more valid explanation.(1 vote)

- what about using ((std. dev. of sample parameter)/sqrt(n)) instead of standard error?(6 votes)
- that would be for problems that deal with continuous random variable (salary, weight, height, ...), although in this problem they give roughly the same numbers.(1 vote)

- what is the difference between s/sqrt(n) vs sqrt(p*(1-p)/n)? I believe this is the same formula but not too clear why this is the case and when to used each.(4 votes)
- The formulas s/sqrt(n) and sqrt(p*(1-p)/n) are used in different contexts:

s/sqrt(n) is used to calculate the standard error of the sample mean when the population standard deviation (s) is known. This formula is typically used in situations where you have quantitative data and you're estimating the population mean.

sqrt(p*(1-p)/n) is used to calculate the standard error of the sample proportion when dealing with categorical data (e.g., proportion of success or failure). This formula is used when estimating the population proportion from a sample.(1 vote)

- Why Does it matter if the equation is independent or not?(3 votes)
- If our sampled trials are not independent then that means each successive trial will not necessarily be equivalent. Because of this, our inferences could be skewed to the right/left because our "supposed" probability will be overestimate/underestimating the real value.

Hope this helps,

- Convenient Colleague(2 votes)

- What ment was why did Sal make it 99.5% with the 0.50% above and not below the middle 99%?(1 vote)
- Many Z-tables show the area under the curve from -inf up to a point, so if we want to have 99% confidence, it means we want to have 0.5% area left at the left and right side. See also4:33(2 votes)

- When we find z*, why can't we just find that by multiplying the standard deviation by 3? (99 percent is 3 standard deviations in a normal distribution.) In this case, couldn't we just multiply the standard error by 3? I did this, but I didn't get the right answer.(1 vote)
- 3 standard deviation away from the mean would actually cover
**99.7**% of the whole distribution (according to the empirical rule).

Although 0.7% doesn't seem to be much, but looking at the z-table, the smallest z-score that covers 99.0% is around 2.32 and 2.33 while a z-score of 3 covers 99.87% of the whole distribution already. The difference in z-score here is around 0.68 or 0.67 which is**a lot**.

My bottom line is when the confidence level isn't one of the 3 thresholds in the empirical rule, look up the z-score in a z-table. You can even just lookup a z-table for the z-score anyway for better accuracy.(1 vote)

- Since we have to use an estimate of the population standard deviation, rather than the actual population standard deviation, shouldn't we be using the t-statistic rather than the z-statistic?(1 vote)
- Yes, when the population standard deviation is unknown and needs to be estimated from the sample, it's more appropriate to use the t-distribution rather than the z-distribution. The t-distribution takes into account the additional uncertainty introduced by estimating the population standard deviation from the sample. However, for large sample sizes (typically n > 30), the t-distribution converges to the standard normal distribution, so using the z-statistic is often acceptable.(1 vote)

- Can somone demonstrate a problem for me?(1 vote)
- Problem: Della wants to estimate the proportion of songs on her mobile phone that are by a female artist. She takes a simple random sample of 50 songs and finds that 20 of them are by a female artist. Based on this sample, what is a 99% confidence interval for the proportion of songs by a female artist on her phone?

Solution:

Step 1: Check conditions

Random: Della took a simple random sample, so this condition is met.

Normal: We need at least 10 successes and 10 failures. In this case, Della has 20 successes and 30 failures, so this condition is met.

Independence: Della's sample size (50) is less than 10% of her total songs (500), so we can consider the observations independent.

Step 2: Calculate the confidence interval

Sample proportion (p-hat) = 20/50 = 0.4

Standard error of the sample proportion = sqrt((p-hat * (1 - p-hat)) / n) = sqrt((0.4 * 0.6) / 50) ≈ 0.08165

Critical value (z-star) for a 99% confidence level corresponds to leaving 0.5% in each tail of the standard normal distribution, which is approximately 2.576.

Now, we can construct the confidence interval:

Lower bound = p-hat - (z-star * standard error) = 0.4 - (2.576 * 0.08165) ≈ 0.1965

Upper bound = p-hat + (z-star * standard error) = 0.4 + (2.576 * 0.08165) ≈ 0.6035

Therefore, the 99% confidence interval for the proportion of songs by a female artist on Della's phone is approximately [0.1965, 0.6035].(1 vote)

- Why didn't he use the value of .9901 from the table if he wants 99% confidence? Why did he choose what he did from the table?(1 vote)
- Z scores give the area below a point on a curve, so if we want the critical z score for 99 percent confidence, we want the z score that gives the area under that curve, and includes that 99 percent, which, if normal distributions weren't symmetrical, would be, the z score for 0.99, but since normal distributions are symmetrical, we have the piece that is not included in the 99 percent split between both sides, which means that we would need to subtract the other side because, z scores give area, under a point.

I hope this answered your question(0 votes)

## Video transcript

- [Instructor] We're told
that Della has over 500 songs on her mobile phone, and she wants to estimate what proportion of the songs
are by a female artist. She takes a simple random sample, that's what SRS stands for,
of 50 songs on her phone and finds that 20 of the songs sampled are by a female artist. Based on this sample, which of the following is a 99% confidence interval for the proportion of songs on her phone that
are by a female artist? So like always pause this video and see if you can figure it out on your own. Della has a library of
500 songs right over here. And she's trying to
figure out the proportion that are sung by a female artist. She doesn't have the time
to go through all 500 songs to figure out the true
population proportion, p. So instead she takes a sample of 50 songs, n is equal to 50, and from that she calculates
a sample proportion, which we could denote with p hat. And she finds that 20 out of the 50 are sung by a female, 20 out of the 50 which
is the same thing as 0.4. And then she wants to construct
a 99% confidence interval. So before we even go about constructing the confidence interval, you wanna check to make
sure that we're making some valid assumptions or
using a valid technique. So before we actually calculate
the confidence interval, let's just make sure that
our sampling distribution is not distorted in some way, and so that we can with confidence make a confidence interval. So the first condition is to
make sure that your sample is truly random. And they tell us that it's
a simple random sample, so we'll take their word for it. The next condition is to assume that your sampling distribution
of the sample proportions is approximately normal. And there you wanna be
confident or you wanna see that in your sample you
have at least 10 successes and at least 10 failures. Well here we have 20 successes which means well 50 minus 20, we have 30 failures. So both of those are more than 10, and so meets that condition. And then the last condition is, sometimes called the independence test or the independence rule or the 10% rule. If you were doing this
sample with replacement, so if she were to look at one song, test whether it's a female
or not and then put it back in her pile and then look at another song, then each of those observations
would truly be independent. But we don't know that. In fact we'll assume that she
didn't do it with replacement. And so if you don't do
it with replacement, you can assume rough independence for each observation of a song
if this is no more than 10% of the population. And so it looks like it is
exactly 10% of the population, so Della just squeezes through on our independence test right over there. So that out of the way
let's just think about what the confidence
interval's going to be. Well it's going to be her sample proportion plus or minus, there's going to be some critical value, and this critical value
is going to be dictated by our confidence level we wanna have, and then that critical value times the standard deviation of
the sampling distribution of the sample proportions
which we don't know. And so instead of having that, we use the standard error
of the sample proportion. And in this case it would be p hat times one minus p hat all of that over n our sample size, all of that over 50. So what's this going to be? We're gonna get p hat, our
sample proportion here, is 0.4 plus or minus, I'll save the z star
here, our critical value for a little bit. We're gonna use a z-table for that. And so we're gonna have
0.4 right over there, one minus 0.4 is times 0.6 all of that over 50. So we can already look at some choices that look interesting here. This choice and this choice
both look interesting, and the main thing we
have to reason through is which one has a correct critical value. Do we wanna go 1.96 standard
errors above and below our sample proportion? Or do we wanna go 2.576 standard errors above and below our sample proportion? And the key is the 99% confidence level. Now if we have a 99% confidence level, one way to think about it is, so let me just do my best shot at drawing a normal distribution here. And so if you want a 99% confidence level, that means you wanna contain the 99%, the middle 99%, under the curve
right over here, that area. And so if this is 99%, then this right over
here is going to be 0.5%, and this right over here is 0.5%. We want the z value that's
going to leave 0.5% above it. And so that's actually going to be 99.5% is what we wanna look up on the table. And that's because many z-tables, including the one that
you might see on something like an AP Stats exam, they will have the area
up to and including, up to and including, a certain value. And so they're not going to
leave this free right over here. So let's just look up
99.5% on our z-table. All right, so let me move
this down so you can see it. All right that's our z-table. Let's see we're at 99.
okay it's gonna be right in this area right over here. And so that is 2.5, looks like 2.57, or 2.5, 2.58 around that. And so this right over here is about 2.57, it's between 2.57 and 2.58, which gives us enough information
to answer this question. It's definitely not going to
be this one right over here. We have 2.576, which is
indeed between 2.57 and 2.58. So let's remind ourselves, we've been able to construct
our confidence interval right over here. But what does that actually mean? That means that if we were
to repeatedly take samples of size 50 and repeatedly use this technique to construct
confidence intervals, that roughly 99% of those intervals constructed this way are going to contain our
true population parameter.