Estimating a population proportion
Current time:0:00Total duration:15:02
Margin of error 1
Say I live in a country of a 100 million people and there's a presidential election coming up. And in that presidential election there are two candidates. There's candidate A, and candidate B. And there's some reality-- let's say I live in a very decisive country and everyone is going to vote for either-- and everyone participates in election and everyone is going to vote for either candidate A or candidate B. And so there's some percentage, there's some reality there, that p-- let me write it over here-- maybe 1 minus p percent-- let me do the p first. There's some reality that maybe p percent will vote for B, and I could switch them around if I wanted. So p percent are going to vote for B, and the rest of the people are going to vote for A, so maybe 1 minus p percent are going to vote for A. And you might already recognize that this is a Bernoulli Distribution. There's one of two values for a sample I can get. And right here, the values I said you're either voting for candidate A or you're voting for candidate B. It's very hard to deal with those values. You can't calculate a mean between A and B and all of that-- those are letters, they're not numbers. So to make it manipulatable mathematically we're going to say sampling someone who's going to vote for A is equivalent to sampling a 0, and sampling someone who's going to vote for B is equivalent to sampling a 1. And if you do that with a Bernoulli Distribution, we learned in the video on Bernoulli Distributions, that the mean of this distribution right here is going to be equal to p. And it's a pretty straightforward proof for how we got that. So the mean of this distribution, which will actually be not a value that this distribution can take on, is going to be some place over here and it is going to be equal to p. Now my country has a 100 million people. It is practically, or is definitely impossible for me to be able to go and ask all 100 million people who are they going to vote for. So I won't be able to exactly figure out what these parameters are going to be. What my mean is, what p is going to be. But instead of doing that, what I'm going to do is do a random survey. I'm going to sample this population, look at that data, and then get an estimate of what p really is. Because this is what I really care about. I really care about p. So I'm going to try to estimate p with a sample, and then we're also going to think about how good of an estimate that is. So I am going to randomly survey, or sample, 100 people. And let's say I got the following results. Let's say that 57 people say that they were going to vote for person A. Let me write it this way. So 57 people say they're going to vote for A, or that's equivalent to getting 57 samples of 0. And then the rest of the people, once again, very decisive population, no one is undecided, the rest of the people, so 43 people say they're going to vote for B. Or that's the equivalent of sampling 43 1's. Now given this sample here, what is my sample mean and my sample variance? My sample mean right here, well that's just going to be the average of these 0's and 1's So I've got 57 0's, so it's going to be 57 times 0 plus my 43 1's. So the sum of all of my samples, so it's 43 1's, plus 43 times 1, over the total number of samples I took, over 100. So what does this get me? So 57 times 0 is 0. 43 times 1 divided by 100 is 0.43. That is my sample mean, the mean of just the 100 data points that I actually got. Now what is my sample variance? Sample variance is going to be equal to the sum of my squared distances to the mean divided by my samples minus 1. Remember, this is a sample variance, and we want to get the best estimator of the real variance of this distribution. And to do that you don't divide by 100, you're going to divide by 100 minus 1. We learned that many, many videos ago. So I have 57. So I had 57 samples of 0. We'll do it in that same yellow color-- 57 samples of 0. And so each of those samples are 0 minus 0.43 away from the mean. Each of those samples are 0. You subtract 0.43-- this is the difference between 0 and 0.43. And if I want the squared distance, I square it-- that's how we calculate variance. There's 57 of those. And then there's 43 times that I sampled a 1 in my sample population-- 43 times I sampled a 1, and the 1 is 1 minus 0.43 away from the mean because that is the mean, and I want to square that distance. And then I don't want to just divide it by n. I don't want to just divided by 100-- remember, I'm trying to estimate the true population mean. In order for this to be the best estimator of that, and I gave you the intuition of why many, many videos ago, we divide by 100 minus 1 or 99. Let's get the calculator out to actually figure out our sample variance. So let me get the calculator out, and we have-- I'll do the numerator first. I have 57 times 0 minus 0.43 squared, plus 43 times 1 minus 0.43 squared. And then all of that divided by 100 minus 1, or 99-- divided by 99 is equal to 0.2475. So my variance, my sample variance, is equal to 0.2475. And if I want to figure out my sample standard deviation I just take the square root of that. My sample standard deviation is just going to be the square root of my sample variance. So I take the square root of that value that I just had, which is 0.497. So actually let me just round that up as 0.50. So my sample standard deviation is 0.50. Now if you just look at this, you say OK, well your best estimate of the percentage of people voting for A or B is really what you just saw here. Your best estimate or your best estimate of the mean is that 43% of people are going to vote for B and everyone else is going to vote for A. But an interesting question is how good a of a sample is that? Let's take it to the next level. Let's try to think of an interval around 43% for which we are 95%, that we're reasonably confident, roughly 95% sure that the real mean is in that interval. Let me make it very clear. Let me draw. So when we get our sample mean we are sampling from the sampling distribution of the sampling mean. Let me draw that. The sampling distribution of the sample mean. So since we're sampling from a discrete distribution it's actually going to be a discrete distribution, but it's going to have 100 possible values. This can take on 100 different values here. Really anything between 0 and 1. But I'll draw it kind of continuous because it would be hard for me to draw 100 different bars. If I did, you'd have a bar there, you'd have a bar there. The odds that your sample mean would be 1, it would be a very low probability, and then you would have one more bar, a bar like that, a bar like that, but that takes forever to draw. So I'm just going to approximate it with this normal curve right over there. So the sampling distribution of the sample mean-- let me write it over here. So this is the sampling distribution of the sample mean. It has some mean here. It has a mean, and I can denote it with the mu sub x bar-- this tells us this is the mean of the sample distribution. But we know from many, many videos that this is going to be the same thing as the mean of the population mean that we are sampling from, that each sample comes from, each of these 100 samples come from. So this is going to be equal to mu, which is going to be equal to p. Now this variance over here, the variance of this distribution-- let me draw it like this, or even better let's do the standard deviation of this distribution. The standard deviation of this distribution, that distance right over here, the standard deviation of the sampling distribution of the sample mean-- we've seen it multiple times already-- it's going to be this standard deviation-- it's going to be the standard deviation of our population distribution. So that standard deviation is going to be that distance over there. So there's some standard deviation associated with this distribution. It's going to be that standard deviation divided by the square root of our sample size. And we saw many videos ago why that, at least experimentally makes sense, or why it intuitively makes sense. So it's going to be the square root of 100. So it's going to be this guy divided by 10. Now we do not know what this guy is. The only way to figure out what that guy is is to actually survey 100 million people, which would have been impossible. So to estimate the standard deviation of this, we will use our sampling standard deviation as our best estimate for the population standard deviation. So we could say-- and remember, this is an estimate. We cannot come up with the exact number for this just from a sample. But we can estimate it. Because this is our best estimator for this standard deviation, and if we divide it by 10, we will have our best estimator for the standard deviation of the sampling distribution of the sampling mean. So remember, this is just an estimate. It is just an estimate. So you kind of have to take everything after this point with a little bit of a grain of salt. So it's going to be roughly equal to or an estimate of it is going to be 0.5. And remember, every time we take a different sample from here this number is going to change. So this isn't like something in stone. This is dependent on our sample. So it's going to wiggle around a little bit depending on what numbers we actually get in our sample. But it's going to be 0.50. This is the s right here, this 0.50 divided by 10, which is equal to 0.05. So our best estimate of this standard deviation is 0.05, or you could even view it as 5%. Now what I want to do is come up with an interval around the sample mean where I'm reasonably confident using all my estimates and all that that there's a-- let me say I'm really confident that there's a 95% chance that the true mean is within two standard deviations-- or let me put it this way, there's a 95% chance that the true mean is in that interval. So let me write this down. I want to find an interval such that I am reasonably confident-- and I'm putting this kind of touchy-feely language over here because it's all around the fact that I don't know for a fact that the standard deviation is 0.05, I'm just estimating. But I'm reasonably confident that there is a 95% chance that the true mean of the population, which is the same thing as the proportion of the population who are going to vote for person B, or the proportion of the population that are going to be a 1. So this is also, we just have to remember that mu is equal to p. That there's a 95% chance that the true p is in that interval. And actually, since I've already gone 14 minutes into this video, I'm going to pause this video, I'm going to stop this video here, and maybe I'll even let you think about it just based on everything we've done so far. We figured out the sample mean-- sorry, we figured out the sample mean right over here. We've figured out an estimate for the-- and remember, this is just a sampling mean. We don't know the true-- this is the mean of our sample. We don't know the true mean of the sampling distribution, and we also don't know the true standard deviation of the sampling distribution. But we were able to estimate it with the sample standard deviation. Now everything that we have so far, and based on what we've seen before on confidence intervals and all that, how can we find an interval such that roughly-- and I'm saying roughly because we had to estimate the standard deviation-- that there's a 95% chance that the true mean of our population, or the p, the proportion of the population saying 1, is in that interval? And we're going to do that in the next video.