Statistics and probability
Sample standard deviation and bias
Sal shows an example of calculating standard deviation and bias. Created by Sal Khan.
Want to join the conversation?
- Are there any other ways to obtain an unbiased standard deviation from our sample population, instead of just accepting the fact that the sample variance gives you a biased standard deviation?(69 votes)
- The short answer is "no"--there is no unbiased estimator of the population standard deviation (even though the sample variance is unbiased). However, for certain distributions there are correction factors that, when multiplied by the sample standard deviation, give you an unbiased estimator. Nevertheless, all of this is definitely beyond the scope of the video and, frankly, not that important in the grand scheme of things (i.e. unless you're a technical mathematician, don't worry about it). But it was a good question!(71 votes)
- Sal says here "hopefully we're convinced now why we divide by n-1," but the previous video left off with "next time I'll show you further why we divide by n-1." Is there a video in between that I should be watching, or some other information? I can't help feeling quite confused, and this is not the first time in this course I've felt Sal mentioned something that wasn't explained previously.(61 votes)
- Here is link to the video I think Sal was referencing
- At8:10, what does 'nonlinear' mean by?(23 votes)
- Here is a function y = f(x). When you give a input value x, you will have a output value y through some operation. If this function is linear, it means when you change x by Δx, the change of y (Δy) has a fixed ratio to Δx.
Graphically, if you plot values from function y = f(x) and line them up, you will get a straight line.
Nonlinear functions are those, if you change x with Δx, Δy divided by Δx is not a fixed value. Consequently, the if you plot values from that function and line them up, you won't get a straight line. You may get a curve.(42 votes)
- i didn't get it. how do square root of unbiased sample variance leads to biased standard deviation..? kindly explain.(15 votes)
- I'm not an expert in statistics, but here's my crack at it.
An unbiased process that outputs some value means that the expected value of the process will match some actual value. Basically, as you perform the unbiased process on more and more samples the average value will approach the actual value.
But if you have a set of values who's average is some number and then perform a non-linear operation on them (like sqr root) then their new average value is NOT going to match the old average with the same non-linear operation performed on it.
For example, take the following numbers:
2, 2, 2, 2, 12
Their average is 4.
Here is their sqr roots:
1.41, 1.41, 1.41, 1.41, 3.46
the average value of those sqr roots is 1.82.
But the sqr root of the old average value is 2.
They don't match! We've introduced some bias by performing a non-linear operation.
I imagine it's impossible to remove this bias because the magnitude and direction of the bias probably heavily depends on the population data.(25 votes)
- @4:02, why do we divide by 7 instead of 8? I know he says it is the unbiased sample variance, but what exactly does that mean?(8 votes)
- That means that he is using a better approximation for the variance of the population, given a normal distribution.(13 votes)
- How would I know to divide by n-1 or n? I know this question has been asked before, but I don't really see the reason. Could someone please give me a simple answer?(6 votes)
- The reason has been explained in the wikipedia https://en.wikipedia.org/wiki/Variance#Sample_variance.
n-1 correction is called Bessel's correction. Even though I couldn't understand the proof, I did understand that this is the case that you divide by n-1 instead of n when you have a sample and are estimating for the whole population.(10 votes)
- Instead of squaring the difference from the mean and taking the square root of the sum, isn't it more reasonable to take the mean of the absolute value of the difference from the mean? This way we won't require squares and square roots.(3 votes)
- That is an alternative method, known as the mean absolute deviation. To understand why the variance is more popular, I'd suggest taking a read through an old answer that I wrote up here:
- My boy started to glitch @2:50(5 votes)
- since when is bias used in math?(5 votes)
- why the upper limit of the first class have to overlap to the lower limit of the second class e.g 39.5 and 39.5(4 votes)
Let's say that you're a watermelon farmer, and you want to study how dense the seeds are in your watermelon. Perhaps you want to do this because over time, you're trying to breed watermelons that have fewer seeds, and you should see whether you are actually making progress. And you don't want to cut open every watermelon in your watermelon farm or patch or whatever it might be called, because you want to sell most of them. You just want to sample a few watermelons, and then take samples of those watermelons to figure out how dense the seeds are, and hope that you can calculate statistics on those samples that are decent estimates of the parameters for the population. So let's start doing that. So let's say that you take these little cubic inch chunks out of a random sample of your watermelons. And then you count the number of seeds in them. And you have 8 samples like this. So in one of them, you found 4 seeds. In the next, you found 3, 5, 7, 2, 9, 11, and 7. So this is a sample, just to make sure we're visualizing it right. If this is the population of all of the chunks-- I guess we could view this as a cubic inch-- the cubic inch chunks in my entire watermelon farm, I'm sampling a very small sample of them. Maybe I could have had a million over here. A million chunks of watermelon could have been produced from my farm, but I'm only sampling-- so capital N would be 1 million, lowercase n is equal to 8. And once again, you might want to have more samples, but this'll make our math easy. Now, let's think about what statistics we can measure. Well, the first one that we often do is a measure of central tendency. And that's the arithmetic mean. But here, we're trying to estimate the population mean by coming up with the sample mean. So what is the sample mean going to be? Well, all we have to do is add up these points, add up these measurements, and then divide by the number of measurements we have. So let's get our calculator out for that. Actually, maybe I don't need my calculator. Let's see. So 4 plus 3 is 7. 7 plus 5 is 12. 12 plus 7 is 19. 19 plus 2 is 21, plus 9 is 30, plus 11 is 41, plus 7 is 48. So I'm going to get 48 over 8 data points. So this worked out quite well. 48 divided by 8 is equal to 6. So our sample mean is 6. It's our estimate of what the population mean might be. But we also want to think about how much in our population we want to estimate, how much spread is there, or how much do our measurements vary from this mean. So there, we say, well, we can try to estimate the population variance by calculating the sample variance. And we're going to calculate the unbiased sample variance. Hopefully, we're fairly convinced at this point why we divide by n minus 1. So we're going to calculate the unbiased sample variance. And if we do that, what do we get? I'll do this in a different color. It's going to be 4 minus 6 squared plus 3 minus 6 squared plus 5 minus 6 squared plus 7 minus 6 squared plus 2 minus 6 squared plus 9 minus 6 squared plus 11 minus 6 squared plus 7 minus 6 squared, all of that divided by-- not by 8. Remember, we want the unbiased sample variance. We're going to divide it by 8 minus 1. So we're going to divide by 7. Let me give myself a little bit more real estate. The unbiased sample variance-- and I could even denote it by this to make it clear that we're dividing by lowercase n minus 1-- is going to be equal to-- let's see, 4 minus 6 is negative 2. That squared is positive 4. So I did that one. 3 minus 6 is negative 3. That squared is going to be 9. 5 minus 6 squared is 1 squared, which is 1. 7 minus 6 is once again 1 squared, which is 1. 2 minus 6, negative 4 squared is 16. 9 minus 6 squared, well, that's going to be 9. 11 minus 6 squared, that is 25. And then finally, 7 minus 6 squared, that's another 1. And we're going to divide it by 7. Let's see if we can add this up in our heads. 4 plus 9 is 13, plus 1 is 14, 15, 31, 40, 65, 66. So this is going to be equal to 66 over 7. And we could either divide-- we get that's 9 and 3/7. We could write that as 9 and 3/7. Or if we want to write that as a decimal, I could just take 66 divided by 7 gives us 9 point-- I'll just round it. So it's approximately 9.43. Now, that gave us our unbiased sample variance. Well, how could we calculate a sample standard deviation? We want to somehow get added estimate of what the population standard deviation might be. Well, the logic, I guess, is reasonable to say, well, this is our unbiased sample variance. It's our best estimate of what the true population variance is. When we think about population parameters to get the population standard deviation, we just take the square root of the population variance. So if we want to get an estimate of the sample standard deviation, why don't we just take the square root of the unbiased sample variance? So that's what we'll do. So we'll define it that way. We'll call it the sample standard deviation. We're going to define it to be equal to the square root of the unbiased sample variance. It's going to be the square root of this quantity, and we can take our calculator out. It's going to be the square root of what I just typed in. I can do 2nd answer. It'll be the last entry here. So the square root of that is-- and I'll just round. It's approximately equal to 3.07. Now, I'm going to tell you something very counterintuitive. Or at least initially it's counterintuitive, but hopefully you'll appreciate this over time. This we've already talked about in some depth. People have even created simulations to show that this is an unbiased estimate of population variance when we divide it by n minus 1. And that's a good starting point if we're going to take the square root of anything. But it actually turns out that because the square root function is nonlinear, that this sample standard deviation-- and this is how it tends to be defined-- sample standard deviation, that this sample standard deviation, which is the square root of our sample variance, so from i equals 1 to n of our unbiased sample variance, so we divide it by n minus 1. This is how we literally divide our sample standard deviation. Because the square root function is nonlinear, it turns out that this is not an unbiased estimate of the true population standard deviation. And I encourage people to make simulations of that if they're interested. But then you might say, well, we went through great pains to divide by n minus 1 here in order to get an unbiased estimate of the population variance. Why don't we go through similar pains and somehow figure out a formula for an unbiased estimate of the population standard deviation? And the reason why that's difficult is to unbias the sample variance, we just have to divide by n minus 1 instead of n. And that'd work for any probability distribution for our population. It turns out to do the same thing for the standard deviation. It's not that easy. It's actually dependent on how that population is actually distributed. So in statistics, we just define the sample standard deviation. And the one that we typically use is based on the square root of the unbiased sample variance. But when you take that square root, it does give you a biased result when you're trying to use this to estimate the population standard deviation. But it's the simplest, best tool we have.