Statistics and probability
Comparing population proportions 1
Sal uses an election example to compare population proportions. Created by Sal Khan.
Want to join the conversation?
- I have a question about how Sal derived the variance for the men and women groups, starting at5:31or so of this video. It seems that he takes the formula for variance in bernoulli distribution for populations and uses that to calculate the sampling distribution variance. Since you have data from the sample itself, why wouldn't you calculate the sample's variance first (i.e., s^2), then use s^2? It seems like this was done in an earlier video for bernoulli distributions? I thought it would work something like this for the men's sample: s^2=(358x(0-.358)^2)+(642x(.642-0)^2)/1000-1. ... I think that the video showing this was entitled, "Estimating Population Proportion," Any thoughts on this greatly appreciated!(7 votes)
- At about7:15Sal put an n in the sampling distribution for the women, but we already know that it is 1000, correct? We put that number in on the men's side. Help, I'm trying not to get confused!!!!! lol...(5 votes)
- Yes, n=1,000 for the women as well; it's corrected in the next video when it comes time to calculate it.(5 votes)
- Say you had a distribution where p1 = .30, so (1-p1) = .70, would the mean still be p1 aka .3? or is it whatever the larger number is?(2 votes)
- It would still be p1(4 votes)
- A bit confused as to why variance of p1 - p2 is the sum of variance(p1) and variance(p2)?(2 votes)
- Why are you using p-bar and not p-hat? Please help, I am very confused by this. Also why are we not using x-bar?(3 votes)
- Are the male and female voters populations or sample of populations? Note, we do not use the unbiased formula for the standard deviation. And and in later videos in this series, we do not use the standard error of the mean formula.(2 votes)
- when we take 2.5 % of alpha
if our z value is more than 1.96 (but less than 5%) than we reject null hypothesis(1 vote)
- using 1.96, you put the 95% in the middle section of the normal distribution, and you have 5% left over, 2.5% in each tail. For 95% confidence, alpha is 0.05, but alpha/2 = 0.025. To use some of the typical normal dist tables, you have to look up the probability from a z-value down to negative infinity, so you are looking up the probability 0.95 + 0.025 (the lower tail probability), and that will give you z=1.96. It is still the 95% confidence level, but you have to work with how the z-table is constructed.(2 votes)
- Does the sample numbers must be the same? Could N.Men=2000 and N.Women=1000? Can we still compare them?(1 vote)
- If the mean of Bernoulli is p, then why is the mean that sal takes at4:16642/1000?(1 vote)
- in maths what we say we write in formulae. in this video you put sigma of two proportion with minus sign in suffix. this difference can be shown only in paired condition, not in independent samples. many books has this problem but as per my understandings this is wrong . you can put there only + sign. if i am wrong kindly email me. I may get more insight for variance behaviour. thanks(1 vote)
Let's say there's an election coming up and I want to figure out if there's a meaningful difference between the proportion of men and the proportion of women that are going to vote for a candidate. So let's look at the population distributions here. So we have the men, some proportion are going to vote for the candidate. We'll call that P1. So this is the proportion that will vote for the candidate. And the rest of the men will not vote for the candidate. So 1 minus P1 will not vote for the candidate. And then for the women, you're going to see something similar. So this is the women right over here. And some proportion will vote for the candidate. We don't know if it's the same as P1, we don't know if it's same as the men, so we'll call it P2. And then the rest of the women will not vote for the candidate. 1 minus P2. So the not voting are zeroes, the ones that are voting are ones. And these are both Bernoulli distributions and we know, just because this'll be useful later on, that the means of this distribution are the same as the proportion that will vote for it. So the mean of the men, or the proportion of the men that will vote, so we'll call that mean one, is equal to P1. I should do everything in yellow. So the mean of this distribution is P1. The variance of this distribution, we'll call that variance one, is just these two proportions multiplied by each other. So it's P1 times 1 minus P1. And we saw this many many videos ago when we learned about Bernoulli distributions. And we're going to see the exact same thing with the women. The mean of this Bernoulli distribution is going to be P2. And then the variance of this Bernoulli distribution is going to be these two proportions multiplied. So P2 times 1 minus P2. Now, what I want to do, and I think I said this at the beginning of the video, is I want to figure out if there's a meaningful difference between the way that the men will vote and the women will vote. I want to figure out, let me write this, is this meaningful? So is there a meaningful difference here? And what we're going to do in this video is try to come up with a 95% confidence interval for this parameter. This difference of parameters is still a parameter. We don't know what the true difference of these two population parameters are. Or these two population proportions. But we're going to try to come up with a 95% confidence interval for that difference. And the way we do that, we go out and we find 1,000 men likely to vote. And 1,000 women likely to vote. So let's write this down. So we get 1,000 men. When we survey the 1,000 men, let's say 642 say that they will vote for the candidate. So they are ones. And then the remainder, 358, I'll just say the remainder. So the rest are zeros. That we do the same thing with women. We survey 1,000 women who are likely to vote. But we survey them randomly. And let's say 591 say that they will vote for the candidate. And the rest say that they will not vote for the candidate. So just here based on our sample proportions, or our sample means, it looks like there is a difference. But we still have to come up with our confidence interval. And let's just make sure we understand what we just did. So we could figure out a sample proportion over here for the men. Which is really just the sample mean of this sample right over here. We have 642 ones, the rest are zero. So we have 642 in the numerator. We have 1,000 samples. 642 divided by 1,000 is 0.642. So you could view this is a sample mean or as a sample proportion. If you do the same thing for the women, the sample proportion is going to be 0.591. Or you could even just view this as the sample mean of the sample of 1,000 women. Where the ones voting for it are one, the rest are zero. And just to visualize it properly, let me draw the sampling distribution for the sample proportions. We have a large sample size. And especially because the proportions that we're dealing with aren't close to one or zero, and we have a large sample size, the sampling distribution will be approximately normal. Let me write this. So it's going to have some mean over here. So the mean of the sampling distribution of the sample proportion. And we've seen in multiple times. It's going to be the same thing as the mean of the population. And the mean of the population is actually the true population proportion. So this is going to be equal to P1. This is something that we don't to know about. And then the variance of this, and we've seen this several times already, the variance of this distribution, I have to put a one here, we're dealing with the men. The variance of this distribution by the central limit theorem is going to be the variance of this distribution up here, which is P1 times 1 minus P1 over our sample size, over 1,000. And we can do the exact same thing for the women. So this is the sampling distribution. This is for P2 bar, or this sample mean over here. Let me put a one over here. Remember, this is all for the men. And then this over here is all for the women. Can't forget those twos over there. And so this distribution is going to have some mean. Let me draw it right over here. So mu sub P2 with a bar over it. So the mean of the sampling distribution for this sample proportion, for the women, which is going to be the same thing as the mean of the population, which we already saw is going to be equal to P2. And then the variance for this distribution, for this sampling and distribution over here, is going to be this variance over here divided by our sample size. So P2 times 1 minus P2. All of that over n. Now, our whole goal is to get a 95% confidence interval for that. And so what we're going to do is we're going to think about the sampling distribution, not for this, and not the sampling distribution for this. But we're going to think about the sampling distribution for the difference of this sample proportion and this sample proportion. We've seen it already. We're talking about proportions, but it's really the same exact ideas that we did when we just compared sample means generally. So let's look at that. Let's look at this distribution. And just to be clear, when we got this sample mean here, this sample proportion, we just sampled it. You could view it as taking a sample from this distribution over here. When we got this sample proportion, it was like taking a sample from this over here. We took 1,000 samples from this, when we took their mean. Where it's equivalent to taking a sample from the sampling distribution. Now, this distribution over here is going to be the distribution of all of the differences of the sampling proportions, or of the sample proportions. So it will look like this. It will have some mean value. I should do this in a different color. I'll do it in green. Yellow and blue make green. So I'll call this the sampling distribution of this statistic, of P1 minus P2. And so it has some mean over here. The sample of P1 minus the sample mean, or the sample proportion, of P2. And we know, from things that we've done in the last several videos, that this is going to be the exact same thing as this mean minus this mean. Which is the exact same thing as P1 minus P2. So this is going to be equal to P1 minus P2. And the variance of this distribution, P1 minus P2, just like this, is going to be the sum of the variances of these two distributions. So it's going to be this thing over here, I'll just copy and paste it, plus this variance over here. There's no radical sign, because we're not taking the standard deviation. We're focused on the variance right now. So plus this thing right over here. So let me copy and let me paste it. So that's going to be the variance. And if you want the standard deviation, you can literally just get rid of this. You're taking the square root of both sides. So you take the square root of the variance, you get the standard deviation, that's why I got rid of that to the second power. And you want to take a square root of the right-hand side just like that. Now, all I did right now was just to kind of conceptually set things up in our brain. What we now need to do is actually tackle the confidence interval. We actually need to come up with a 95% confidence interval for P1 minus P2. Or a 95% confidence interval for this mean right over here. And because I'm trying to make my best effort not to make videos too long, I'll do part two in the next video, where we actually solve the confidence interval.