Current time:0:00Total duration:11:18
0 energy points
Studying for a test? Prepare with these 14 lessons on Displaying and describing data.
See 14 lessons
Video transcript
This video, here, is a groundbreaking video, for multiple reasons. One, I'm going to introduce you to the variance of a sample, which is interesting in its own right. And I'm attempting to record this video in HD, and hopefully you can see it bigger and clearer than ever before. But we'll see how all of that goes. So this is a bit of an experiment, so bear with me. But so, just before we go into the variance of a sample, I think it's instructive to review the variance the population, and we can compare their formulas. The variance of a population-- and it's this Greek letter, sigma. Lowercase sigma squared. That means variance. I know it's weird that a variable already has a squared in it. You're not squaring the variable. This is the variable. Sigma squared means variance. Actually, let me write that down. That equals variance. And that is equal to-- you take each data point-- and we'll call them x sub i. You take each data point, find out how far it is from the mean of the population-- the mean of the population. You square it, and then you take the average of all of those. So you take the average. You sum them all up. You go from i is equal to 1. So from the first point all the way to the N-th point. And then to average, you sum them all up, and then you divide by N. So the variance is the average of the squared distances of each point from the mean. And just to give you the intuition again, it essentially says, on average, roughly how far away are each of the points from the middle? That's the best way to think about the variance. Now what if we're dealing-- this was for a population, right? And we said, if we wanted to figure out the variance of men's heights in the country, it would be very hard to figure out the variance for the population. You would have to go and, essentially, measure everyone's height-- 250 million people. Or what if it's for some population where it's just completely impossible to have the data, or some random variable? And we'll go more into that later. So a lot of times, you actually want to estimate this variance by taking the variance of a sample. Same way that you could never get the mean of a population, but maybe you want to estimate it by getting the mean of a sample. And we learned that in that first video. This is-- if that's the whole population, that's millions of data points-- or even data points in the future that you'll never be able to get, because it's a random variable. So this is the population. You might just want to estimate things by looking at a sample. And this is actually what most of inferential statistics is all about. Figuring out descriptive statistics about the sample, and making inferences about the population. Let me try this drug on 100 people. And if it seems to have statistically significant results, this drug will probably work on the population as a whole. So that's what it's all about. So it's really important to understand this notion of a sample versus a population, and being able to find statistics on a sample that, for the most part, can describe the population or help us estimate-- they call it parameters for the population. So what's the mean of a-- let me rewrite these definitions. What's the mean of a population? I'll do it in that purple-- purple for population. The mean of a population, you just take each of the data points. So you take each of the data points in the population-- xi. You sum them up. You start with the first data point, and you go all the way to the N-th data point, and you divide by N. You sum them all up and divide them by N. That's the mean. So then you plug it into this formula, and you just can see how far each point is from that central point, from that mean. And you get the variance. Now what happens if we do it for a sample? Well, if we want to estimate the mean of a population by somehow calculating a mean for a sample, the best thing I can think of-- and really these are kind of engineered formulas. These are human beings saying, well, what is the best way to sample it? Well, all we can do is really take an average of our sample. And that's the sample mean. And we learned in the first video that notation-- the formula's almost identical to this. It's just the notation is different. Instead of writing mu, you write x with a line over it. Sample mean is equal to-- Once again, you take each of the data points now in the sample, not in the whole population. You sum them up, from the first one and then to the n-th one, right? They're saying that there are n data points in this sample. And then you divide it by the number of data points you have. Fair enough. It's really the same formula. The way I took the mean of a population, I said, well, if I just have a sample, let me just take the mean the same way. And it might-- it's probably a good estimate of the mean of the population. Now, it gets interesting when we talk about variance. So your natural reaction is, OK, I have this sample. If I want to estimate the variance of the population, why don't I just apply this same formula, essentially, to the sample? So I could say-- and this is actually a sample variance. They use a formula-- s squared. So sigma is kind of the Greek-letter equivalent of s. So now when we're dealing with a sample, we just write the s there. So this is sample variance. Let me write that down. Sample variance. So we might just say, well, maybe a good way to take the sample variance is, do it the same way. Let's take the distance of each of the points in the sample, find out how far it is from our sample mean. Right here, we used the population mean. But now we'll just use the sample mean, because that's all we can have. We don't know what the population mean is without looking at the whole population. Take the square of that. That makes it positive. It has other properties which we'll go over later. And then take the average of all of these squared distances. So you take it from-- you sum them all up. And there's n of them to sum up, right-- lowercase n. And you divide by lowercase n. You say, well, you know, this is a good estimate. Whatever this variance is, that might be a good estimate for the population as a whole. And actually this is what some people often refer to, when they talk about sample variance. And sometimes it'll actually be referred to as this. They'll put a little lowercase n there. And the reason why I do that is because we divided by n. So you say, Sal, what's the problem here? And then the problem-- and I'll give you the intuition, because this is actually something that used to boggle my mind. And I'm still, frankly, struggling with the intuition behind it. Well, I have the intuition, but more of kind of rigorously proving it to myself, that this is definitely the case. But think about this. If I have a bunch of numbers-- and I'll draw a number line, here. If I draw a number line here-- let's say I have a bunch of numbers in my population. So let's say-- I'm just going to randomly put a bunch of numbers in my population. And the ones to the right are bigger than the ones to the left. And if I were to take a sample of them, right? Maybe I take-- and the sample, it's random. You actually want to take a random sample. You don't want to be skewed in any way. So maybe I take this one, this one, this one, and that one, right? And then if I were to take the mean of that number, that number, that number, and that number, it'll be someplace in the middle. It might be someplace over there. And then, if I wanted to figure out the sample variance using this formula, I'd say, OK, this distance squared plus this distance squared plus this distance squared plus that distance squared, and average them all out. And then I would get this number. And that probably would be a pretty good approximation for the variance of this entire population. The population of the mean is probably going to be-- I don't know, it might be pretty close to this. If we actually took all of the data points and averaged them, maybe they're, like, here someplace. And then, if you figure out the variance, it probably would be pretty close to the average of all of these lines, right, of the sample variance distances. Fair enough. So you say, hey, Sal, this looks pretty good now. But there's one little catch. What if-- I mean, there's always a probability that, instead of picking these fairly well-distributed numbers in my sample, what if I happened to pick this number, this number, and that number, and, let's say, that number, as my sample? Well, whatever your sample is, your sample mean's always going to be in the middle of it, right? So in this case, your sample mean might be right here. So all of these numbers, you might say, OK, this number's not too far from that number. That number's not too far. And then that number's not too far. So your sample variance, when you do it this way, it might turn out a little bit low. Because all of these numbers, they're almost, by definition, going to be pretty close to the mean of each other. But in this case, your sample is kind of skewed, and the actual mean of the population is out here someplace. So the actual variance of the sample, if you had actually known the mean-- I know this is all a little confusing-- if you had actually known the mean, you would've said, oh, wow. You would have found these distances, which would have been a lot more. The whole point of what I'm saying is, when you take a sample, there's some chance that your sample mean is pretty close to the population mean. Maybe your sample mean is here, and your population mean is here. And then this formula would probably work out pretty well, at least given your sample data points, of figuring out what the variance is. But there's a reasonable chance that your sample mean-- your sample mean is always going to be within your data sample, right? It's always going to be the center of your data sample. But it's completely possible that the population mean is outside of your data sample. It might have just been-- you know, you just happened to pick ones that don't contain the actual population mean. And then this sample variance, calculated this way, will actually underestimate the actual population variance, because they're always going to be closer to their own mean than they are to the population mean. And if you're understanding, frankly, even 10% of this, you are a very advanced statistics student. But I'm saying all of this to just give you, hopefully, some intuition to realize that this will often underestimate. This formula will often underestimate the actual population variance. And there's a formula-- and this is actually proven more rigorously than I'll do it-- that is considered to be a better-- or, they'll call it, unbiased-- estimate of the population variance, or the unbiased sample variance. And sometimes it's just denoted by the s squared again. Sometimes it's denoted by this-- s n minus 1 squared-- and I'll show you why. It's almost the same thing. You take each of the data points, figure out how far they are from the sample mean, you square them, and then you take the average of those squared, except for one slight difference. i equals 1 to i equals n. Instead of dividing by n, you divide by a slightly smaller number. You divide by n minus 1. So when you divide by n minus 1 instead of dividing by n, you're going to get a slightly larger number here. And it turns out that this is actually a much better estimate. And one day, I'm going to write a computer program to at least prove it to myself, experimentally, that this is a better estimate of the population variance. And you would calculate it the same way. You just divide by n minus 1. The other way to think about it-- and actually-- no, no. I'm all out of time. I'll leave you there, now, and then, in the next video, we'll do a couple calculations, just so you don't get too overwhelmed with these ideas, because we're getting a little bit abstract. See you in the next video.