Current time:0:00Total duration:13:07
0 energy points
Learn how to calculate standard deviation, how it relates to variance and mean, and the difference between population and sample standard deviation. Created by Sal Khan.
Video transcript
Let's review a little bit of everything we learned so far and hopefully it'll make everything fit together a little bit better. Then we'll do a bunch of calculations with real numbers and I think it'll really hit the point home. So, first of all if we're dealing with a-- let me actually write down, let me make some columns. So if we're dealing with-- let's see, we could call it the concept and then we'll call it whether we're dealing with a population or a sample. So the first statistical concept we came up with was the notion of the mean or the central tendency and we learned of that was one way to measure the average or central tendency of a data set. The other ways were the median and the mode. But the mean tends to show up a lot more, especially when we start talking about variances and, as we'll do in this video, the standard deviation. But the mean of a population we learned-- we use the greek letter Mu-- is equal to the sum of each of the data points in the population. That's an i. Let me make sure it looks like an I. So you're going to sum up each of those data points. You're going to start with the first one and you're going to go to the nth one. We're assuming that there are n data points in the population. And then you divide by the total number that you have. And this is like the average that you're used to taking before you learned any of the statistics stuff. You add up all the data points and you divide by the number there are. The sample is the same thing. We just use a slightly different terminology. The mean of a sample-- and I'll do it in a different color-- just write it as x with a line on top. And that's equal to the sum of all the data points in the sample. So each of the xi in the sample. But we're serving the sample is something less than a population. So you start with the first one still. And then you go to the lower case n where we assume that lowercase n is less than the big N. If this was the same thing then we're actually taking the average or we're taking the mean of the entire population. And then you divide by the number of data points you added. You get to n. Then we said OK, how far-- this give us the central tendency. It's one measure of the central tendency. But what if we wanted to know how good of an indicator this is for the population or for the sample? Or, on average, how far are the data points from this mean? And that's where we came up with the concept of variance. And I'll arbitrarily switch colors again. Variance. And in a population the variable or the notation for variance is the sigma squared. This means variance. And that is equal to-- you take each of the data points. You find the difference between that and the mean that you calculate up there. You square it so you get the squared difference. And then you essentially take the average of all of these. You take the average of all of these squared distances. So that's-- so you take the sum from i is equal to 1 to n and you divide it by n. That's the variance. And then the variance of a sample mean-- and this was a little bit more interesting and we talked a little bit about it in the last video. You actually want to provide a-- you want to estimate the variance of the population when you're taking the variance of a sample. And in order to provide an unbiased estimate you do something very similar to here but you end up dividing by n minus 1. So let me write that down. So the variance of a population-- I'm sorry, the variance of a sample or samples variance or unbiased sample variance if that's why we're going to divide by n minus 1. That's denoted by s squared. What you do is you take the difference between each of the data points in the sample minus the sample mean. We assume that we don't know the population mean. Maybe we did. If we knew the population mean we actually wouldn't have to do the unbiased thing they were going to do here in the denominator. But when you have a sample the only way to kind of figure out the population mean is to estimate it with sample mean. So we assume that we only have the sample mean. And you're going to square those and then you're going to sum them up from i is equal to 1 to i is equal to n because you have n data points. And if you want an unbiased estimator you divide by n minus 1. And we talked a little bit before why you want this to be a n minus 1 instead of a n. And actually in a couple of videos I'll actually prove this to you. One, I'll prove it maybe experimentally using Excel and then I'll-- which wouldn't be a proof, it'll just give you a little bit of intuition-- and then I'll actually prove it a little bit more formally later on. But you don't have to worry about it right now. The next thing we'll learn is something that you've probably heard a lot of, especially sometimes in class, teachers talk about the standard deviation of a test or-- it's actually probably one of the most use words in statistics. I think a lot of people unfortunately maybe use it or maybe use it without fully appreciating everything that it involves. But the goal we'll eventually hopefully appreciate all that involves soon. But the standard deviation-- and once you know variance it's actually quite straightforward. It's the square root of the variance. So the standard deviation of a population is written as sigma which is equal to the square root of the variance. And now I think you understand why a variance is written as sigma squared. And that is equal to just the square root of all that. It's equal to the square root-- I'll probably run out of space-- of all of that. So the sum-- I won't write at the top or the bottom, that makes it messy-- if xi minus Mu squared, everything over n. And then if you wanted the standard deviation of a sample-- and it actually gets a little bit interesting because the standard deviation of a sample, which is equal to the square root of the variance of a sample-- it actually turned out that this is not an unbiased estimator for this-- and I don't want to get to technical for it right now-- that this is actually a very good estimate of this. The expected value of this is going to be this. And I'll go into more depth on expected values in the future. But it turns out that this is not quite the same expected value as this. But you don't have to worry about it for now. So why even talk about the standard deviation? Well, one, the units work out a little better. If let's say all of our data points were measured in meters, right? If we were taking a bunch of measurements of length then the units of the variance would be meter squared. right? Because we're taking meters minus meters. This would be a meter. Then you're squaring. You're getting meters squared. And that's kind of a strange concept if you say you know the average dispersion from the center is in meter squares. Well first, when you take the square root of it you get this-- you get something that's again in meters. So you're kind of saying, oh well the standard deviation is x or y meters. And then we'll learn a little bit it if you can actually model your data as a bell curve or if you assume that your data has a distribution of a bell curve then this tells you some interesting things about where all of the probability of finding someone within one or two standard deviations of the of the mean. But anyway, I don't want to go to technical right now. Let's just calculate a bunch. Let's calculate. Let's see, if I had numbers 1, 2, 3, 8, and 7. And let's say that this is a population. So what would its mean be? So I have 1 plus 2 plus 3. So it's 3 plus 3 is 6. 6 plus 8 is 14. 14 plus 7 is 21. So the mean of this population-- you sum up all the data points. You get 21 divided by the total number of data points, 1, 2, 3, 4, 5. 21 divided by 5 which is equal to what? 4.2. Fair enough. Now we want to figure out the variance. And we're assuming that this is the entire population. So the variance of this population is going to be equal to the sum of the squared differences of each of these numbers from 4.2. I'm going to have to get my calculator out. So it's going to be 1 minus 4.2 squared plus 2 minus 4.2 squared plus 3 minus 4.2 squared plus 8 minus 4.2 squared plus 7 minus 4.2 squared. And it's going to be all of that-- I know it looks a little bit funny-- divided by the number of data points we have-- divided by 5. So let me take the calculator out. All right. Here we go. Actually maybe I should have used the graphing calculator that I have. Let me see if I can get this thing-- if I could get this. There you go. Yeah, I think the graphing one will be better because I can see everything that I'm writing. OK, so let me clear this. So I want to take 1 minus 4.2 squared plus 2 minus 4.2 squared plus 3 minus 4.2 squared plus 8 minus 4.2 squared, where I'm just taking the sum of the squared distances from the mean squared, one more, plus 7 minus 4.2 squared. So that's the sum. The sum is 38.8. So the numerator is going to be equal to 38.8 divided by 5. So this is the sum of the squared distances, right? Each of these-- just so you can relate to the formula-- each of that is xi minus the mean squared. And so if we take the sum of all of them-- this numerator is the sum of each of the xi minus the mean squared from i equals 1 to n. And that ended up to be 38.8. And I just calculated like that. I just took each to the data points minus the mean squared, add them all up, and I got 38.8. And I went and divided by n which is 5. So this n up here is actually also 5. Right? And so 38.8 divided by 5 is 7.76. So the variance-- let me scroll down a little bit-- the variance is equal to 7.76. Now if this was a sample of a larger distribution, if this was a sample-- if the 1, 2, 3, 8, and 7, weren't the population-- if it was a sample from a larger population, instead of dividing by 5 we would have divided by 4. And we would have gotten the variance as 38.8 divided by n minus 1, which is divided by 4. So then we would have gotten the variance-- we would have gotten the sample variance 9.7 if you divided by n minus 1 instead of n. But anyway, don't worry about that right now. That's just a change of n. But once you have the variance, it's very easy to figure out the standard deviation. You just take the square root of it. The square root of 7.76-- 2.78. Let's say 2.79 is the standard deviation. So this gives us some measure of, on average, how far the numbers are away from the mean which was 4.2. And it gives it in kind of the units of the original measurement. Anyway, I'm all out of time. I'll see you in the next video. Or actually, let's figure out-- we said if this was a sample, if those numbers were sample and not the population, that we figured out that the sample variance was 9.7. And so then the sample standard deviation is just going to be the square root of that. The square root of 9.7 seven which would be 3.1. 3.11. Anyway, hopefully that makes it a little bit more concrete. We've been dealing with these sigma notation variables and all that so far. So when you actually do it with numbers you see it's hopefully not that difficult. Anyway, see you in the next video.