If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Comparing population proportions 2

Sal continues the election example for population proportions. Created by Sal Khan.

Want to join the conversation?

  • blobby green style avatar for user Pietro Mercurio
    I did get almost everything... just the last sentence still sounds obscure to me... Why should we point out, with the results we get, that man would vote for candidate one and not zero?
    (26 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user cchugbo
      That's not quite the conclusion Sal reached. There's only one candidate of interest in this case, and we conclude that it appears that men are more likely to vote for that person than are women. Recall that p is the proportion who vote for the candidate, and that p(sub 1) minus p(sub 2) is the difference between the proportions of men who vote for the candidate and women who do likewise. If p(sub 1) minus p(sub 2) is positive (between 0.008 to 0.094), it suggests more men than women are likely to vote for the candidate.
      (47 votes)
  • male robot hal style avatar for user Bastiaan Manintveld
    At the end Sal says that men are definitly more likely to vote on the candidate, ok, I get that. But these numbers are so small (0.8% to 9.4%), and you aren't really sure (because it's a confidence interval). Can you really say that it's a significant difference?
    Because the way I see it is that it's still just a sample and the difference isn't really that big, so it really doesn't tell us much.
    (7 votes)
    Default Khan Academy avatar avatar for user
  • leaf green style avatar for user FatRatSnatch
    if you ended up with partially negative confidence interval would it still be statistically significant if simply a larger portion of it was on the positive side (thus showing men favoured the "1" candidate)? eg. from -0.04 to 0.065

    At what point does it lose statistical significance?
    (5 votes)
    Default Khan Academy avatar avatar for user
    • leaf green style avatar for user chr.paetzold
      Remember what the confidence interval represents here. In this case it tells us, that the DIFFERENCE between the percentage of men and the percentage of women voting for candidate X is (with a chance of 95%) between two values (e. g. -0.04 and 0.065 as you suggested). A negative difference means just that the number you subtract is bigger than the number you subtract from. In our case this means that the percentage of men votig for candidate X is bigger than the percentage of women doing that. So it still makes sense (has statistical significance), because it says that the EVENT that a little more men than women (percentage-whise) would vote for candidate X is still in the 95%-confidence-interval, so not that unlikely.
      (4 votes)
  • aqualine seed style avatar for user Sunny Shah
    Why is Sal not taking "corrected standard deviation"? I expected him to multiply variance by (1000/999).
    (5 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user kozlazos
    can someone explain me why he changed from 95% to 97.5% to find z?
    (5 votes)
    Default Khan Academy avatar avatar for user
  • leafers ultimate style avatar for user Daniel Yokoyama
    The variance presented on the video for the Bernoulli distribution is the population variance, however what we have is only a sample, so shouldn't it be, men for example, S^2_1=(642(1-0.642)^2+(1000-642)(0-0.642)^2)/999?
    (3 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user Parthiban Rajendran
    is almost wrong. It is not there is 95% chance the true population mean difference is within the calculated statistical mean difference. It is that, if we take many more such statistic, and CI each time, 95% of those CIs would contain true population mean difference.
    (3 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user misterkush
    If we are asked to draw a 95% confidence interval, why can we not just use the empirical rule to know that the mean must be within 2 standard deviations? Why use the z table at all?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • leaf blue style avatar for user Dr C
      The Empirical Rule is an approximation. It's certainly useful, but if we're going to the trouble of making a confidence interval, we may as well be precise.

      Additionally, the Empirical Rule corresponds to the Z distribution. Using this for the confidence interval means that you assume you know the population standard deviation. More often, we cannot asume this, and we need to use the t-distribution, for which there is no Empirical Rule.
      (4 votes)
  • purple pi purple style avatar for user Janelle Cleaves
    Has Sal posted any videos on exactly how to read and use a z or t-table? If not, that would be very helpful.
    (3 votes)
    Default Khan Academy avatar avatar for user
    • duskpin ultimate style avatar for user Laurel
      For a z-table, you look at how many standard deviations your value is from the mean (its z-score), which should have at least hundredths, look for the row that has the same ones and tenths as your z-score (if your score is 1.56, look for the row starting 1.5) then look for the column with the same hundredths as your score (if your score is 1.56, look for the column with 0.06 at the top). The value in the box in that column and row is the probability of a random score falling in the area below your score. You can also read it the other way, as Sal does in this video, by looking for a percentage and finding the z-score from there.

      I'm not completely sure how to read a t-table, but I think you first look at the top two rows to decide which type of t-distribution you have (one-sided means a wonky distribution, asymmetrical, and two-sided means symmetrical, resembling a normal distribution but, as Sal calls it, with 'fatter tails'). From there, you find the column with the percentage you want and follow it down to the row with the appropriate degrees of freedom (listed on the far left column).
      (0 votes)
  • blobby green style avatar for user katoriak
    Just to make the notation clear. Is the sample proportions shouldn't be written with a hat instead of a bar over it?
    (1 vote)
    Default Khan Academy avatar avatar for user

Video transcript

Where we left off in the last video, we were trying to figure out if there's a meaningful difference between the proportion of men voting for a candidate and the proportion of women. We sampled 1,000 men, sampled 1,000 women, and we got a sample proportion for each of them. We got 0.642 for the men and 0.591 for the women. But our goal is to get a 95% confidence interval. So just based on our actual sample, we got-- let me write it over here. We got our sample proportion for the men minus-- let me do this in a neutral color. We got our sample proportion for the men minus our sample proportion for the women being 0.642 minus 0.591, that's 0.051. I just subtracted this from that. So what we want to do when we want a confidence interval, we want to be confident. I'll always have to say that because it's not going to be super precise. We want to be confident that there's a 95% chance that this thing right here-- remember, when we took the two sample proportions and took their difference, it's like taking a sample from the sampling distribution of the statistic. So we want a 95% chance that the true mean or the true value of this, that P1 minus P2 is within some range, let's say is within d, I'll say d for distance, of the actual difference that we got with our samples. Within d of 0.051. And I write this multiple times, but I always write it this way. I don't just give the formula that you normally see in books. It's very easy to memorize if you do, but this way, you actually see why this confidence interval makes sense. If there's a 95% chance that P1 minus P2, the actual true proportions, the difference of the true proportions, is within d of the difference between our sample proportions, this statement right here is the same thing that there's a 95% chance that 0.051 is within d of this actual parameter, P1 minus P2, which is the same thing as the mean. So we need to figure out some distance around this mean, where if we take a random sample from this, and this is a random sample from this distribution, it has a 95% chance of being within d of this mean, because if it's within d of the mean, then there's also a 95% chance that the mean is within d of our sample, and then we'll have our confidence interval. Our confidence interval would be this value plus d and this value minus d. So what are these? What is the distance d? Well, in a normalized normal distribution, I got a Z-table right over here, and we can assume everything is normal, especially the sampling distributions because our n is so big and also our proportion is not close to zero or one. It's nice and close to the middle, so we don't end up with all these weird cases near the edges. We say, OK, how do we contain the middle 95%? How many standard deviations in a normal distribution do we need to be away from the mean in order to contain 95% of the probability? Now these Z-tables, and we've done it multiple times, give you cumulative distribution. We're looking for this Z-value right over here. If it's containing 95%, you're going to have 2.5% over here and you're going to have 2.5% over here. So from a Z-table's point of view, this Z-table gives you the cumulative probability up to that Z-value. So what we're looking for is actually 97.5%. We're looking for something that contains all of this over here. If we get the Z-value and then apply it on both sides, then we're going to have something that contains 95%. So let's look up the 97.5. 97.5 is right over there, and that is 1.96 standard deviations. So this is 1.96 for a normalized standard deviation, or a Z-score of 1.96. So if we looked to this normal distribution right over here, this distance that we care about is going to be 1.96 times the standard deviation of this distribution, so it's going to be 1.96 times all of this business. 1.96 times the standard deviation of this distribution. And so we just need to calculate this and multiply it by 1.96. Now, we have a problem. We don't know the true parameters P1 and P2. We don't know the true population parameters. We don't know P1 and P2. That's part of the problem. We're trying to figure out if there's a meaningful difference between P1 and P2. But we've seen it multiple times. Since our sample size is a large, we can estimate P1 and P2 with our sample proportions. So we could change this to approximately and we can use our sample proportions. And we know what those values are. And actually this n over here was 1,000. So let's figure that out. Let's just get the calculator out. It's just going to be one big calculation here. So what we have is the square root, and then in parentheses, our sample proportion for the men is 0.642, and then we're going to multiply that times 1 minus 0.642, close parentheses. That's that over there divided by 1,000. And then we're going to add to that plus-- do the same thing for the women. Our sample proportion is 0.591 times 1 minus 0.591. So that's this term right over here divided by 1,000. Once again, I need to get the parentheses right. And then we just need to close the parentheses, this original parentheses, because we're taking the square root of everything. So we get 0.021, or maybe we'll say 0.022. So this value right here is approximately 0.022. So going back to our question, or this distance that we care about, this value is going to be approximately, or our best estimate of it, is 0.022. So let's just multiply that. 0.022 times 1.96 gives 0.043. I'll just round it. So this right here is equal to 0.043. And just like that, we have our confidence interval. We know that there's a 95% chance that the true difference of the proportions is within 0.043 of the actual difference of our sample proportions that we got. Or if we actually want to get an interval, we take this value minus 0.043. So let's do that. So we could have 0.051 minus 0.043 is going to give us 0.008. And then if we add it, so 0.051 plus 0.043, it gives us 0.094. So the 95% confidence interval between the proportions of men and the proportion of women who are going to vote for the candidate for P1 minus P2 is 0.008 to 0.094. I have it right here on the calculator. And we're done. So it does seem we're confident that there's a 95% chance that men are more likely to vote for the candidate than women.