Sal's old statistics videos
Statistics: Sample Variance Using the variance of a sample to estimate the variance of a population
⇐ Use this menu to view and help create subtitles for this video in many different languages.
You'll probably want to hide YouTube's captions if using these subtitles.
- This video here is a groundbreaking video
- for multiple reasons.
- One, I'm going to introduce you to the variance of a sample,
- which is interesting it in its own right.
- And I'm attempting to record this video in HD.
- And hopefully you can see it bigger and clearer
- than ever before.
- But we'll see how all of that goes.
- So this is a bit of an experiment, so bear with me.
- But so, just before we go into the variance of a sample, I
- think it's instructive to review the variance
- of a population.
- And we can compare their formulas.
- The variance of a population-- And it's this greek
- letter sigma.
- Lowercase sigma squared.
- That means variance.
- I know it's weird that a variable already
- has a square in it.
- You're not squaring the variable.
- This is the variable.
- Sigma squared mean variance.
- Actually, let me write that down.
- That equals variance.
- And that is equal to-- You take each data point-- And
- we'll call them x sub i.
- You take each data of point, find out how far it is from the
- mean of the population, you square it, and then you take
- the average of all of those.
- So you take the average, you sum them all up.
- You go from i is equal to 1.
- So from the first point, all the way to the nth point.
- And then, to average, you sum them all up and
- then you divide by n.
- So the variance is the average of these squared distances
- of each point from the mean.
- And just to give you the intuition again, it essentially
- says, on average, roughly how far away are each of the
- points from the middle.
- That's the best way to think about the variance.
- Now what if we're dealing-- This was for
- a population, right?
- And we said if we wanted to figure out the variance of
- men's heights in the country, it'd be very hard to
- figure out the variance for the population.
- You would have to go and, essentially, measure
- everyone's height.
- 250 million people.
- Or what if it's for some population where it's just
- completely impossible to have the data or some
- random variable.
- And we'll go more into that later.
- So a lot of times you actually want to estimate this variance
- by taking the variance of a sample.
- Same way that you could never get the mean of a population,
- but maybe you want to estimate it by getting the
- mean of a sample.
- And we learned that in that first video.
- If this is-- If that's the whole population.
- That's millions of data points, or even data points in the
- future that you'll never be able to get because it's
- a random variable.
- So this is the population.
- You might just want to estimate things by looking at a sample.
- And this is actually what most of inferential
- statistics is all about.
- Figuring out descriptive statistics about the sample
- and make inferences about the population.
- Let me try this drug on 100 people and if it seems to have
- statistically significant results, this drug will
- probably work on the population as a whole.
- So that's what it's all about.
- So it's really important to understand this notion of a
- sample versus a population.
- And being able to find statistics on a sample that,
- for the most part, can describe the population or help us
- estimate, they call it, parameters for the population.
- So what's the mean of a-- Let me rewrite these definitions.
- What's the mean of a population?
- I'll do that purple.
- Purple for population.
- The mean of a population.
- You just take each of the data points in the population, x i.
- You sum them up.
- You start with the first data point and you go all the
- way to the nth data point.
- And you divide by n.
- You sum them all up and divide by n.
- That's the mean.
- So then you plug it into this formula.
- And you can see how far each point is from that central
- point, from that mean.
- And you get the variance.
- Now what happens if we do it for a sample?
- Well, if we want to estimate the mean of a population by
- somehow calculating a mean for a sample, the best thing I can
- think of-- And really these are kind of engineered formulas.
- These are human beings saying, well what is the best
- way to sample it?
- Well all we can do is really take an average of our sample.
- And that's the sample mean.
- And we learned in the first video that that notation--
- The formula is almost identical of this.
- It's just the notation's different.
- Instead of writing mu, you write x with a line over it.
- Sample mean is equal to-- Once again, you take each of the
- data points now in the sample, not in the whole population.
- You sum them up from the first one and then to
- the nth one, right?
- They're saying that there are n data points in this sample.
- And then you divide it by the number of data points you have.
- Fair enough.
- It's really the same formula.
- The way I took the mean of a population, I said, well, if I
- just have a sample, let me just take the mean the same way.
- And it's probably a good estimate of the mean
- of the population.
- Now it gets interesting when we talk about variance.
- So your natural reaction is OK, I have this sample.
- If I want to estimate the variance of the population, why
- don't I just apply this same formula essentially
- to the sample?
- So I could say-- And this is actually a sample variance.
- They use the formula s squared.
- So sigma is kind of the greek letter equivalent of s.
- So now when we're dealing with the sample, we
- just write the s there.
- So this is sample variance.
- Let me write that down.
- Sample variance.
- This is-- So we might just say, well maybe a good way to take
- the sample variance is do it the same way.
- Let's take the distance of each of the points in the sample.
- Find out how far it is from our sample mean.
- Here we used the population mean, but now we'll just use
- the sample mean because that's all we can have.
- We don't know what the population mean is
- without looking at the whole population.
- Take the square of that.
- That makes it positive and it has other properties,
- which we'll go over later.
- And then take the average of all of these squared distances.
- So you take it from-- You sum them all up.
- And there's n of them to some up, right?
- Lowercase n.
- And you divide by lowercase n.
- And you say, well this is a good estimate.
- Well whatever this variance is, that might be a good estimate
- for the population whole.
- Actually this is what some people often refer to when they
- talk about sample variance.
- And sometimes it'll actually be referred to as this.
- They'll put a little lowercase n there.
- And the reason why they do that is because we divided by n.
- And you say, Sal what's the problem here?
- And the problem-- And I'll give you the intuition because this
- is actually something that used to boggle my mind.
- And I'm still frankly struggling with the
- intuition behind it.
- Well I have the intuition, but more of kind of rigorously
- proving it to myself that this is definitely the case.
- But think about this.
- If I have a bunch of numbers, and I'll draw
- a number line here.
- If I draw a number line here-- So let's say you know that--
- And let's say I have a bunch of numbers in my population.
- So let's say-- I'm just going to randomly put a bunch
- numbers in my population.
- And the ones to the right are bigger than the
- ones to the left.
- And if I were to take a sample of them, maybe I take--
- The sample, it's random.
- You actually want to take a random sample.
- You don't want to be skewed in any way.
- So maybe I take this one, this one, this one,
- and that one, right?
- And then if I were to take the mean of that number, that
- number, that number, that number.
- It will be someplace in the middle.
- It might be someplace over there.
- And then if I wanted to figure out the sample variance using
- this formula, I'd say OK this distance squared plus this
- distance squared plus this distance squared plus that
- distance squared and average them all out.
- And then I would get this number.
- And that probably would be a pretty good approximation for
- the variance of this entire population.
- The population of the mean is probably going to
- be-- I don't know.
- It might be pretty close to this.
- If we actually took all of the data points and averaged them,
- maybe they're like here someplace.
- And then if you figure out the variance, it probably would be
- pretty close to the average of all of these lines, right?
- All of the sample variance distances, right?
- Fair enough.
- So you say, hey Sal.
- This looks pretty good now.
- But there's one little catch.
- What if-- There's always a probability that instead of
- picking these kind of fairly well-distributed numbers in my
- sample, what if I happen to pick this number, this number,
- and that number as my-- and let's say that number
- as my sample?
- Well whatever your sample is, your sample mean is
- always going to be in the middle of it, right?
- So in this case, your sample mean might be right here.
- So all of these numbers, you might say OK this number is not
- too far from that number, that number's not too far, and then
- that number's not too far.
- So your sample variance, when you do it this way, might
- turn out a little bit low.
- Because all of these numbers, they're pretty-- they're,
- almost by definition, going to be pretty close to the
- mean of each other.
- But in this case, your sample is kind of skewed and the
- actual mean of the population is out here someplace.
- So the actual variance of the sample, if you had actually
- known the mean-- I know this is all a little confusing.
- If you had actually known the mean, you would
- have said oh wow.
- You would have found these distances, which would
- have been a lot more.
- The whole point of what I'm saying is, when you take a
- sample, there's some chance that your sample mean is pretty
- close to the population mean, right?
- Maybe your sample mean is here and your population
- mean is here.
- And then this formula will probably work out pretty well,
- at least given your sample data points and figuring out
- what the variance is.
- But there's a reasonable chance that your sample mean-- Your
- sample is always going to be within your data sample, right?
- It's always going to be the center of your data sample.
- But it's completely possible that the population mean is
- outside of your data sample.
- It might have just been you just happen to pick ones
- that don't contain the actual population mean.
- And then this sample variance calculated this way will
- actually underestimate the actual population
- variance, right?
- Because they're always going to be closer to their own mean
- than they are to the population mean.
- And if you're understanding, frankly, even like 10%
- of this, you are a very advanced statistics student.
- But I'm saying all of this to just give you, hopefully, some
- intuition to realize that this will often underestimate.
- This formula will often underestimate the actual
- population variance.
- And there's a formula, and this is actually proven more
- rigorously than I will do it, that is considered to be a
- better, or they'll call it an unbiased, estimate of the
- population variance.
- Or the unbiased sample variance.
- And sometimes it's just denoted by the s squared again.
- Sometimes it's denoted by this s n minus 1 squared.
- And I'll show you why.
- It's almost the same thing.
- You take each of the data points, figure out how far they
- are from the sample mean.
- You square them.
- And then you take the average of those squared, except
- for one slight difference.
- i equals 1 to i equals n.
- Instead of dividing by n, you divide by a slightly
- smaller number.
- You divide by n minus 1.
- So when you divide my n minus 1 instead of dividing by
- n, you're going to get a slightly larger number here.
- And it turns out that this is actually a
- much better estimate.
- And one day I'm going to write a computer program to at least
- prove it to myself experimentally that this is a
- better estimate of the population variance.
- And you would calculate it the same way.
- You just divide by n minus 1.
- The other way to think about it-- And actually, no.
- I'm all out of time.
- I'll leave you there now.
- And then in the next video, we'll do a couple of
- calculations just so you don't get too overwhelmed
- with these ideas.
- Because we're getting a little bit abstract.
- See you in the next video.
Be specific, and indicate a time in the video:
At 5:31, how is the moon large enough to block the sun? Isn't the sun way larger?
|
Have something that's not a question about this content? |
This discussion area is not meant for answering homework questions.
Discuss the site
For general discussions about Khan Academy, visit our Reddit discussion page.
Flag inappropriate posts
Here are posts to avoid making. If you do encounter them, flag them for attention from our Guardians.
abuse
- disrespectful or offensive
- an advertisement
not helpful
- low quality
- not about the video topic
- soliciting votes or seeking badges
- a homework question
- a duplicate answer
- repeatedly making the same post
wrong category
- a tip or feedback in Questions
- a question in Tips & Feedback
- an answer that should be its own question
about the site
Share a tip
Suggest a fix
Have something that's not a tip or feedback about this content?
This discussion area is not meant for answering homework questions.