Statistics and probability
- Inferring population mean from sample mean
- Central limit theorem
- Sampling distribution of the sample mean
- Sampling distribution of the sample mean (part 2)
- Standard error of the mean
- Example: Probability of sample mean exceeding a value
- Mean and standard deviation of sample means
- Sample means and the central limit theorem
- Finding probabilities with sample means
- Sampling distribution of a sample mean example
Estimating the probability that the sample mean exceeds a given value in the sampling distribution of the sample mean. Created by Sal Khan.
Want to join the conversation?
- It seems to me that there is some kind of an underlying assumption here that makes the result suspect (and I'm talking mathematics, not about sweating and all those pesky real-life problems :-) ).
What I'm wondering about is that it seems that by calculating the result while increasing the sample size and the amount of trials maybe as well, it's possible to get arbitrarily close to the situation where the standard deviation is as small as desired. This way we could convince ourselves that the 2.2 L we have reserved for each men is enough because we can get the z-score of 0.2 L arbitrarily low. I'm sort of wondering what would be the correct sample size that would give us the best approximation in the real world.
I realize that the example is simplified for mathematical convenience (which is quite understanable) but it bothers me that just increasing the sample size makes us more certain. I think there is an assumption here that doesn't quite work in real life, but I can't see what it is for now. If anyone can clarify this point, I'd be grateful!
Of course, it's possible that my doubts will be reconciled in a video I haven't yet seen, so maybe this question will become moot. Meanwhile, I'd appreciate it if someone would tell me if I'm mistaken in assuming that increasing the sample size will reduce the SD and this in turn will increase our probability as much as we desire. Also, it seems somehow inappropriate to have a sample size that is larger to or equal to the amount of men (50 men, sample size 50) because the whole idea of a sample will be that it is smaller than the whole population and tries to represent it this way.
Anyway, thanks in general to Sal for showing the intuition behind many mathematical concepts that are often just stated in mathematics books.(7 votes)
- You're right to think about the things you're assuming, when approaching a statistical problem. For this example to work, you have to assume that:
- your sample is random
- each camper's fluid intake is independent of all the others
- the given population mean and standard deviation are accurate, and
- your campers are all drawn from that population.
If all of that is true, then we can estimate how likely the water is to run out (or, rather, how likely it is to find 50 campers whose average consumption is higher than 2.2L). The thing we assume here is that as the sample size increases, the probability that the sample mean will differ greatly from the population mean is lower -- and the reason we can assume this is the central limit theorem. We know that regardless of the population distribution, as the size of the random samples increases, the distribution of sample means approaches a normal distribution.
If the population standard deviation is right, then the SD for samples of 1 camper each is 0.7L. If you're randomly picking two campers, most of the time their consumption will balance out a bit, so σ for samples of 2 campers will be around 0.5L. σ for samples of 4 campers should be around 0.35L. For 50 random campers, Sal's probability estimate is right, if our initial assumptions are true.
You're perfectly right in thinking that you can choose sample sizes to make your sample standard deviation arbitrarily low. This is useful if you want to know how many campers to monitor to make sure your estimates are right, for example - as you increase your sample size, you decrease the likelihood that your sample will be different to the population you're drawing it from.(8 votes)
- I don't understand why we use the sampling distribution of the sample mean to calculate this probability. Why wouldn't we get a more accurate result by just taking the area under the population distribution between 2.2 and infinity?(5 votes)
- The area under the population distribution between 2.2 and infinity will give you the probability of one active individual drinking more than 2.2 Liters of water.
The question is asking what is the probability that 50 active guys drink more than 2.2 liters per person , which is equivalent to the probability 50 guys drink a sum of 110 liters.
Suppose you took a sample 50 guys from this population, some of them drink more than 2 liters of water some less than 2 liters, and take the mean of the 50 amounts dranken, place a dot on the x axis that corresponds to this sample mean, and then repeat this experiment thousands of times. You will get the sample distribution of sample means of size n=50. Why do we want to know this?
Well it turns out
(x1 + x2 + x3 + ... + x50 ) /50 >= 2.2
is equivalent to
(x1 + x2 + x3 + ... + x50 ) >= 50*2.2
(x1 + x2 + x3 + ... + x50 ) >= 110
so if we scale the horizontal axis of the sample distribution of sample means by multiplying 50, we get a sample distribution of 50 guys total amounts of water(10 votes)
- Hi everyone,
As far as I understand, in this exercise you can make your calculations over the sampling distribution of the sample mean because you assume normality on it. That is because of the central limit theorem. My question is why can you be sure that for n=50 (as in the example) you can assume normality in your sampling distribution? Why not n=10, or maybe n=20, or even n=30 (as pointed out as reasonable sample sizes in previous videos)?
Thanks in advance,
- As I understand the sampling distribution, you will (in most cases) never reach a perfect normal distribution, but you will be getting really near to it. The higher the sample size, the better the proximity to the normal distribution. As it is just a sample you will have some diffference to the reality, but in a lot of cases it is too complex or / and expensive to use all possible data (like asking every person in the world if male or female), so the possibility to use a sample (e.g. asking 1000 people) is a really good way to solve this problem.
btw as I know there is the possibility to use confidence intervals to get a even better approximation to the reality.(3 votes)
- If we knew for certain that the population distribution was normal, could we not just take the std error as 0.7 and then the z score as 0.2/0.7?(4 votes)
- It is all about means.(Sorry for my bad English, I'm not a native speaker.)No, we can't. Mean 2L and std.dev 0.7 shows how many liters of water drink one man in average. We whant to know how many liters in average (arithmetically) drinks GROUP collected from 50 this "average" people (they are our samples).(4 votes)
- The first examples (videos on sampling distribution of the sample mean 1 and 2) show large SDs bc it was just a "sample". Then, with repeated sampling the SD decreases (SD^2/n).
But in this problem we are told that mean is 2 and SD is 0.7, and that is, supposedly a true representation of the population (i.e. not a small random sample, but a huge sample). Why are we to treat the SD as the distribution of a sample instead?
In other words, I had a hard time wrapping my head around the fact that a sample size of 50 has a smaller SD than a "population SD".
Maybe "the average male" means a "distribution" made on a single person?(2 votes)
- It's not a "sample of size 50" that has a smaller SD.
We have taken a sample of size 50, but that value σ/√n is not the standard deviation of the sample of 50. Rather, it is the SD of the sampling distribution of the sample mean.
Imagine taking a sample of size 50, calculate the sample mean, call it xbar1.
Then take another sample of size 50, calculate the sample mean, call it xbar2.
Then take another sample of size 50, calculate the sample mean, call it xbar3.
And so on.
If we do this repeatedly, we would start to see a distribution of sample means, all calculated from a different sample of size 50. This distribution of sample means has a smaller SD than the population from which the raw data was derived.
Think of an easier example: height. People have heights in some common range, say 4.5 feet tall to 6.5 feet tall. It's possible to be really tall, right? There are people who are 7 feet tall - or even more - but they're kind of few and far between. It's "rare" to see someone that tall, but possible. Now, imagine a collection of 50 people. What would we need in order to see that the average height of these 50 people is 7 feet? Well ... we would need a LOT of REALLY TALL people. Since getting an individual person who is 7 feet tall is pretty rare, getting a lot of people 7 feet tall (or more) is even more rare. Because of this, it's even more unlikely that the sample mean height of 50 people will be 7 feet or more.
This phenomenon manifests itself in Statistics with the SD of the sampling distribution of the sample mean being smaller than the SD of the population.(6 votes)
- HI Sal,
I did this question a different way and was wondering whether you could tell me why it works this way: basically, I defined a statistic, T = X1, X2, ...., X50. And so the statistic's mean is 50mu and variance 50(sigma^2). I then just did P(T > 110), and then converted this and used the Z table to get the exact same answer as you...(4 votes)
- Sal mentions the Z-Score table at12:56. How were the values on this table calculated - where did they come from?(2 votes)
- What could have been made more clear is that Sal is looking for the probability for X to be between 0 and up to 2+(2.02*0.099)=2.19998 liters of water. Since we know that it is a normal distribution we can write:
X is N(2 , 0.099) and we look for P(0<X<2.19998)
Then, if we don't have a z-table, only a texas instrument we can write:
normalcdf(0 , 2.19998 , 2 , 0.099) = 0.9783..... Which is the number that Sal finds in the z- table
Hope it helps :)(4 votes)
- Hi Sal,
is the sampling distribution "less tightly packed" or "more tightly packed"?(3 votes)
- More tightly packed, because it's harder to pick a sample far away from the mean than to pick one value (of the original distribution) far away from the mean.(1 vote)
- Hi. I'm a bit confused about what the relationship should be between the sample size and the standard deviation in the calculation for the standard error. In this example we are saying our sample size is 50 so we're imagining taking many samples of 50... but the standard deviation (and mean) we have been given has probably come from a much bigger sample. Doesn't that matter?(2 votes)
- I think it's because of the type of question. 50 samples of 1 man in each is sufficient to give a normal distribution. He could have also done 50 samples of 2 men, in which each sample gives a mean between 2 randomly selected people and plotted those..(1 vote)
- I'm interested in how prob. & stat's applies to very mundane situations - those which aren't created in order to illustrate an application, and which might turn out to be somewhat intractable (ideally I would like to see prob. & stat's at work everywhere I look, in a similar way as is chemistry - which applies to absolutely all material objects, and is involved completely in literally everything one sees).
I know that statistical mechanics exists. I'm aware that quantum physics and prob. & stat's are closely connected. Can anyone shed any light on the connections of probability to the observable world or any aspect of it - so that I might experience this connection more fully? I'm interested in any comments that come to mind, general or quantitative, regardless of how simple or complex, basic or advanced, informed or naive.(2 votes)
The average male drinks 2 liters of water when active outdoors with the standard deviation of 0.7 liters. You are planning a full day nature trip for 50 men and will bring 110 liters of water. What is the probability that you will run out of water? So let's think about what's happening here. So there's some distribution of how many liters an average man needs when they're active outdoors. And let me just draw an example. It might look something like this. So they're all going to need at least more than 0 liters, so this would be 0 liters over here. The average male, so the mean of the amount of water a man needs when active outdoors is 2 liters. So 2 liters would be right over here. So the mean is equal to 2 liters. It has a standard deviation of 0.7 liters or 0.7 liters. So the standard deviation-- maybe I'll draw it this way. So this distribution, once again, we don't know whether it's a normal distribution or not. It could just be some type of crazy distribution. So maybe some people need almost close to-- well, everyone needs a little bit of water, but maybe some people need very, very little water. Then you have a lot of people who need that, maybe some people who need more, and no one can drink more than maybe this is like 4 liters of water. So maybe this is the actual distribution. And then one standard deviation is going to be 0.7 liters away. So this is 1, 0.7 liters is-- so this would be 1 liter, 2 liters, 3 liters. So one standard deviation is going to be about that far away from the mean. If you go above it it'll be about that far, if you go below it. So let me draw. This is the standard deviation. That right there is the standard deviation to the right, that's the standard deviation to the left. And we know that the standard deviation is equal to-- I'll write the 0 in front, 0.7 liters. So that's the actual distribution of how much water the average man needs when active. Now what's interesting about this problem, we are planning a full day nature trip for 50 men and will bring 110 liters of water. What is the probability that you will run out? So the probability that you will run out-- let me write this down. The probability that I will or that you will run out is equal or is the same thing as the probability that we use more than 110 liters on our outdoor nature day, whatever we're doing. Which is the same thing as the probability, if we use more than 110 liters, that means that on average, because we had 50 men, so 110 divided by 50 is what? That's 2.-- let me get the calculator out just so we don't make any mistakes here. So this is going to be, the calculator out. So on average, if we have 110 liters that's going to be drunk by 50 men, including ourselves I guess, that means that it's the-- so we would run out if on average more than 2.2 liters is used per man. So this is the same thing as the probability of the average, or maybe we should say the sample mean-- Or let me write it this way, that the average water use per man of our 50 men is greater than, or we could say greater than or equal to, greater-- well I'll say greater than because if we're right on the money then we won't run out of water-- is greater than 2.2 liters per man. So let's think about this. We are essentially taking 50 men out of a universal sample. We got this data, who knows where we got this data from that the average man drinks 2 liters and that the standard deviation is this. Maybe there's some huge study and this was the best estimate of what the population parameters are. That this is the mean and this is the standard deviation. Now we're sampling 50 men. And what we need to do is figure out essentially what is the probability that the mean of the sample, that the sample mean, is going to be greater than 2.2 liters. And to do that we have to figure out the distribution of the sampling mean. And we know what that's called. It's the sampling distribution of the sample mean. And we know that that is going to be a normal distribution. And we know a few of the properties of that normal distribution. So this is a distribution of just all men. And then if you take samples of, say, 50 men, so this will be-- let me write this down. So down here I'm going to draw the sampling distribution of the sample mean when n, so when our sample size is equal to 50. So this is essentially telling us the likelihood of the different means when we are sampling 50 men from this population and taking their average water use. So let me draw that. So let's say that this is the frequency and then here are the different values. Now the mean value of this, the mean-- let me write it-- the mean of the sampling distribution of the sample mean, this x bar-- that's really just the sample mean right over there-- is equal to, if we were to do this millions and millions of times. If we were to plot all of the means when we keep taking samples of 50, and we were to plot them all out, we would show that this mean of distribution is actually going to be the mean of our actual population. So it's going to be the same value, I'm going to do it in that same blue. It's going to be the same value as this population over here. So that is going to be 2 liters. So we still have-- we're still centered at 2 liters. But what's neat about this is that the sampling distribution of the sample mean, so you take 50 people, find their mean, plot the frequency. This is actually going to be a normal distribution regardless of-- this one just has a well-defined standard deviation mean. It's not normal. Even though this one isn't normal, this one over here will be, and we've seen it in multiple videos already. So this is going to be a normal distribution. And the standard deviation-- and we saw this in the last video, and hopefully we've got a little bit of intuition for why this is true. The standard deviation-- actually maybe put it a better way. The variance of the sample mean is going to be the variance. So remember, it's going to be-- this is standard deviation, so it's going to be the variance of the population divided by n. And if you wanted the standard deviation of this distribution right here, you just take the square root of both sides. If you take the square root of both sides of that we have the standard deviation of the sample mean is going to be equal to the square root of this side over here, is going to be equal to the standard deviation of the population divided by the square root of n. And what's this going to be in our case? We know what the standard deviation of the population is. It is 0.7. And what is n? We have 50 men. So 0.7 over the square root of 50. Now let's figure out what that is with the calculator. So we have 0.7 divided by the square root of 50. And we have 0.09-- well I'll say 0.098-- well it's pretty close the 0.99. So I'll just write that down. So this is equal to 0.099. That's going to be the standard deviation of this. It's going to have a lower standard deviation. So the distribution is going to be normal, it's going to look something like this. So this is 3 liters over here, this is 1 liter. The standard deviation is almost a tenth, so it's going to be a much narrower distribution. It's going to look something-- I'm trying my best to draw it-- it's going to look something like this. You get the idea. Where the standard deviation right now is almost 0.1, so it's 0.09, almost a tenth. So it's going to be something-- one standard deviation away is going to look something like that. So we have our distribution. It's a normal distribution. And now let's go back to our question that we're asking. We want to know the probability that our sample will have an average greater than 2.2. So this is the distribution of all of the possible samples. The means of all of the possible samples. Now to be greater than 2.2, 2.2 is going to be right around here. So we essentially are asking we will run out if our sample mean falls into this bucket over here. So we essentially need to figure out what is-- you can even view it as what's this area under this curve there? And to figure that out we just have to figure out how many standard deviations above the mean we are, which is going to be our Z-score. And then we could use a Z-table to figure out what this area right over here is. So we want to know when we're above 2.2 liters, so 2.2 liters-- we could even do it in our head-- 2.2 liters is what we care about. That's right over here. Our mean is 2, so we are 0.2 above the mean. And if we want that in terms of standard deviations, we just divide this by the standard deviation of this distribution over here. And we figured out what that is. The standard deviation of this distribution is 0.099. So if we take-- and you'll see a formula where you take this value minus the mean and divide it by the standard deviation-- that's all we're doing. We're just figuring how many standard deviations above the mean we are. So you just take this number right over here divided by the standard deviation, so 0.099 or 0.099, and then we get-- let's get our calculator. And actually we had the exact number over here. So we can just take 0.2-- we could just take this 0.2 divided by this value over here. On this calculator when I press second answer it just means the last answer. So I'm taking 0.2 divided by this value over there and I get 2.020. So that means that this value, or I should write this probability is the same probability of being 2.02 standard deviations-- or maybe I should write it this way-- more than-- Let me write it down here where I have more space. So this all boils down to the probability of running out of water is the probability that the sample mean will be more than-- just the 50 that we happened to select-- remember, if we take a bunch of samples of 50 and plot all of them we'll get this whole distribution. But the one 50, the group of 50 that we happened to select, the probability of running out of water is the same thing as the probability of the mean of those people, will be more than 2.020 standard deviations above the mean of this distribution, which they're actually the same distribution. So what is that going to be? And here we just have to look up our Z-table. Remember, this 2.02 is just this value right here. 0.2 divided by 0.09. I just had to pause the video because there's some type of fighter jet outside or something. So anyway, hopefully they won't come back. But anyway, so we need to figure out the probability that the sample mean will be more than 2.02 standard deviations above the mean. And to figure that out we go to a Z-table, and you could find this pretty much anywhere. Usually it's in any stat book or on the internet, wherever. And so essentially we want to know the probability-- the Z-table will tell you how much area is below this value. So if you go to z of 2.02-- that was the value that we were dealing with, right. You have 2.02, it was-- so you go for the first digit. We go to 2.0, and it was 2.02. 2.02 is right over there. So we have 2.0, and then in the next digit you go up here. So 2.02 is right over there. So this 0.9783-- let me write it down over here-- this 0.9783-- I want to be very careful. 0.9783, that Z-table, that's not this value over here. This 0.9783 on the Z-table, that is giving us this whole area over here. It's giving us the probability that we are below that value. That we are less than 2.02 standard deviations above the mean. So it's giving us that value over here. So to answer our question, to answer this probability, we just have to subtract this from 1 because these will all add up to 1. So we just take our calculator back out and we just take 1 minus 0.9783 is equal to 0.0217. So this right here is 0.0217. Or another way you could say it, it is a 2.17% probability that we will run out of water. And we are done. Let me make sure I got that number right. So that number it was, yeah, 0.0217, right. So it's a 2.17% chance we run out of water.