Current time:0:00Total duration:9:33
0 energy points
Sals shows an example of calculating standard deviation and bias. Created by Sal Khan.
Video transcript
Let's say that you're a watermelon farmer, and you want to study how dense the seeds are in your watermelon. Perhaps you want to do this because over time, you're trying to breed watermelons that have fewer seeds, and you should see whether you are actually making progress. And you don't want to cut open every watermelon in your watermelon farm or patch or whatever it might be called, because you want to sell most of them. You just want to sample a few watermelons, and then take samples of those watermelons to figure out how dense the seeds are, and hope that you can calculate statistics on those samples that are decent estimates of the parameters for the population. So let's start doing that. So let's say that you take these little cubic inch chunks out of a random sample of your watermelons. And then you count the number of seeds in them. And you have 8 samples like this. So in one of them, you found 4 seeds. In the next, you found 3, 5, 7, 2, 9, 11, and 7. So this is a sample, just to make sure we're visualizing it right. If this is the population of all of the chunks-- I guess we could view this as a cubic inch-- the cubic inch chunks in my entire watermelon farm, I'm sampling a very small sample of them. Maybe I could have had a million over here. A million chunks of watermelon could have been produced from my farm, but I'm only sampling-- so capital N would be 1 million, lowercase n is equal to 8. And once again, you might want to have more samples, but this'll make our math easy. Now, let's think about what statistics we can measure. Well, the first one that we often do is a measure of central tendency. And that's the arithmetic mean. But here, we're trying to estimate the population mean by coming up with the sample mean. So what is the sample mean going to be? Well, all we have to do is add up these points, add up these measurements, and then divide by the number of measurements we have. So let's get our calculator out for that. Actually, maybe I don't need my calculator. Let's see. So 4 plus 3 is 7. 7 plus 5 is 12. 12 plus 7 is 19. 19 plus 2 is 21, plus 9 is 30, plus 11 is 41, plus 7 is 48. So I'm going to get 48 over 8 data points. So this worked out quite well. 48 divided by 8 is equal to 6. So our sample mean is 6. It's our estimate of what the population mean might be. But we also want to think about how much in our population we want to estimate, how much spread is there, or how much do our measurements vary from this mean. So there, we say, well, we can try to estimate the population variance by calculating the sample variance. And we're going to calculate the unbiased sample variance. Hopefully, we're fairly convinced at this point why we divide by n minus 1. So we're going to calculate the unbiased sample variance. And if we do that, what do we get? I'll do this in a different color. It's going to be 4 minus 6 squared plus 3 minus 6 squared plus 5 minus 6 squared plus 7 minus 6 squared plus 2 minus 6 squared plus 9 minus 6 squared plus 11 minus 6 squared plus 7 minus 6 squared, all of that divided by-- not by 8. Remember, we want the unbiased sample variance. We're going to divide it by 8 minus 1. So we're going to divide by 7. Let me give myself a little bit more real estate. The unbiased sample variance-- and I could even denote it by this to make it clear that we're dividing by lowercase n minus 1-- is going to be equal to-- let's see, 4 minus 6 is negative 2. That squared is positive 4. So I did that one. 3 minus 6 is negative 3. That squared is going to be 9. 5 minus 6 squared is 1 squared, which is 1. 7 minus 6 is once again 1 squared, which is 1. 2 minus 6, negative 4 squared is 16. 9 minus 6 squared, well, that's going to be 9. 11 minus 6 squared, that is 25. And then finally, 7 minus 6 squared, that's another 1. And we're going to divide it by 7. Let's see if we can add this up in our heads. 4 plus 9 is 13, plus 1 is 14, 15, 31, 40, 65, 66. So this is going to be equal to 66 over 7. And we could either divide-- we get that's 9 and 3/7. We could write that as 9 and 3/7. Or if we want to write that as a decimal, I could just take 66 divided by 7 gives us 9 point-- I'll just round it. So it's approximately 9.43. Now, that gave us our unbiased sample variance. Well, how could we calculate a sample standard deviation? We want to somehow get added estimate of what the population standard deviation might be. Well, the logic, I guess, is reasonable to say, well, this is our unbiased sample variance. It's our best estimate of what the true population variance is. When we think about population parameters to get the population standard deviation, we just take the square root of the population variance. So if we want to get an estimate of the sample standard deviation, why don't we just take the square root of the unbiased sample variance? So that's what we'll do. So we'll define it that way. We'll call it the sample standard deviation. We're going to define it to be equal to the square root of the unbiased sample variance. It's going to be the square root of this quantity, and we can take our calculator out. It's going to be the square root of what I just typed in. I can do 2nd answer. It'll be the last entry here. So the square root of that is-- and I'll just round. It's approximately equal to 3.07. Now, I'm going to tell you something very counterintuitive. Or at least initially it's counterintuitive, but hopefully you'll appreciate this over time. This we've already talked about in some depth. People have even created simulations to show that this is an unbiased estimate of population variance when we divide it by n minus 1. And that's a good starting point if we're going to take the square root of anything. But it actually turns out that because the square root function is nonlinear, that this sample standard deviation-- and this is how it tends to be defined-- sample standard deviation, that this sample standard deviation, which is the square root of our sample variance, so from i equals 1 to n of our unbiased sample variance, so we divide it by n minus 1. This is how we literally divide our sample standard deviation. Because the square root function is nonlinear, it turns out that this is not an unbiased estimate of the true population standard deviation. And I encourage people to make simulations of that if they're interested. But then you might say, well, we went through great pains to divide by n minus 1 here in order to get an unbiased estimate of the population variance. Why don't we go through similar pains and somehow figure out a formula for an unbiased estimate of the population standard deviation? And the reason why that's difficult is to unbias the sample variance, we just have to divide by n minus 1 instead of n. And that'd work for any probability distribution for our population. It turns out to do the same thing for the standard deviation. It's not that easy. It's actually dependent on how that population is actually distributed. So in statistics, we just define the sample standard deviation. And the one that we typically use is based on the square root of the unbiased sample variance. But when you take that square root, it does give you a biased result when you're trying to use this to estimate the population standard deviation. But it's the simplest, best tool we have.