Main content

## Comparing two means

Current time:0:00Total duration:12:18

# Difference of sample means distribution

## Video transcript

I want to build on what
we did in the last video a little bit. Let's say we have two
random variables. So I have random variable x. And let me draw its probability
distribution. And actually, it doesn't
have to be normal. But I'll just draw it as
a normal distribution. So this is the distribution
of random variable x. This is the mean. The population mean of
random variable x. And then it has some type
of standard deviation. Actually, let me just focus
on the variance. So it has some variance right
here for random variable x. This is x, the distribution
for x. Let's say we have another
random variable. Random variable y. Let's do the same
thing for it. Let's draw its distribution. And let me draw the parameters
for that distribution. So it has some true mean, some
population mean for the random variable y. And it has some variance
right over here. And I've drawn it
roughly normal. Once again, we don't have to
assume that it's normal. Because we're going to assume,
when we go to the next level, that when we take the samples,
we're taking enough samples that the central limit theorem
will actually apply. But with that said, let's
think about the sampling distributions of each of
these random variables. So let's think about the
sampling distribution of the sample mean of x. Let's say the sample size
over here is going to be equal to n. So what is that going
to look like? Well it's going to be
some distribution. And we're assuming that n is
a fairly large number. So this is going to be a
normal distribution. Or it can be approximated with
a normal distribution. Let me shift it over
a little bit. I'm going to draw it a
little bit narrow. Let me draw the mean. So the population mean of the
sampling distribution is going to be denoted with this x bar,
that tells us the distribution of the means when the
sample size is n. And we know that this is going
to be the same thing as the population mean for that
random variable. And we know from the central
limit theorem that the variance of the sampling
distribution or, often called the standard error of the mean,
is going to be equal to the population variance
divided by this n right over here. And if you wanted the standard
deviation of this, you just take the square root
of both sides. Let's do the same thing
for random variable y. Let's take the sampling
distribution of the sample mean. But here, we're talking about
y, random variable y. And let's just say it has
a different sample size. It doesn't have to be
a different one. But it just shows you that it
doesn't have to be the same. So it has a sample size of m. Let me draw its distribution
right over here. Once again, it'll be a narrower
distribution than the population distribution. And it will be approximately
normal, assuming that we have a large enough sample size. And the mean of the sampling
distribution of the sample mean is going to be the same
thing as the population mean. We've seen that multiple
times. And its variance for the sample
means, or the standard error of the mean. Actually, this isn't
the standard error. Standard error would be the
square root of this. So if I called this standard
error of the mean, that's wrong. The standard error of the mean
is the square root of this. It's the standard deviation. This is the variance
of the mean. Don't want to confuse you. So the variance of the mean here
is going to be the exact same thing. It's going to be the variance
of the population divided by our sample size. And everything we've done so
far is complete review. It's a little different, because
I'm actually doing it with two different
random variables. And I'm doing it with
two different random variables for a reason. Because now I'm going to define
a new random variable. We could just call it z. But z is equal to
the difference of our sample means. It's equal to the x sample mean
minus the y sample mean. So what does that really mean? Well, to get a sample mean,
or at least for this distribution, you're taking
n samples from this population over here. Maybe n is 10. You're taking 10 samples
and finding its mean. That sample mean is
a random variable. Let's say you take 10 samples
from here and you get 9.2 when you find their mean. That 9.2 can be viewed as a
sample from this distribution right over here. Same thing if this
right here is m. Or if m right here is 12. You're taking 12 samples,
taking its mean. And that sample mean, maybe it's
15.2, could be viewed as a sample from this
distribution. As a sample from the sampling
distribution. So what z is, z is a random
variable where you're taking n samples from this distribution
up here, this population distribution, taking its mean. Then you're taking m samples
from this population distribution up here,
taking its mean. And then finding the difference
between that mean and that mean. So it's another random
veritable. But what is the distribution
of the z? So let's draw it. Well there's a couple
of things we immediately know about z. And we kind of came up with
this in the last video. Instead of writing z, I'm just
going to write the mean of x bar, which is a sample from the
sampling distribution of x, or the sample mean of x,
minus the sample mean of y. We saw this in the last video. In fact, I think I still
have the work up here. Yeah, I still have the
work right up here. The mean of the difference
is going to be the difference of the means. The mean of the difference
is the same thing is the difference of the means. So the mean of this new
distribution right over here is going to be the same thing as
the mean of our sample mean minus the mean of our
sample mean of y. And this might seem a little
abstract in this video. In the next video, we're
actually going to do this with concrete numbers. And hopefully it'll make a
little bit more sense. And just so you know where we're
going with this, the whole point of this is so that
we can eventually do some inferential statistics about
differences of means. How likely is a difference of
means of two samples, random chance or not random chance? Or what is a confidence
interval of the difference of means? That's what this is all
building up to. So anyway, we know
the mean of this distribution right over here. And what's the variance
of this distribution? We came up with that result
in the last video. If we're taking essentially the
difference of two random variables, the variance is going
to be the sum of those two random variables. And the whole point of that
video is to show that it's not the difference of the
variances, it's the sum of the variances. The variance of this new
distribution-- and I haven't drawn the distribution yet--
The variance of this new distribution, I'll just write x
bar minus y bar, is going to be equal to the sum of the
variances of each of these distributions. The variance of x bar plus
the variance of y bar. Actually, let me just
draw this here. Just so we can visualize
another distribution. Although, all I'm going to
draw is another normal distribution. Let me scroll down
a little bit. So the mean over here, the mean
of x bar minus y bar, is going to be equal to
the difference of these means over here. I don't have to rewrite it. Let me draw the curve. And notice, I'm drawing a fatter
curve than either one. And why am I doing that? Because the variance here is the
sum of the variances here. So we're going to have
a fatter curve. It's going to have a bigger
variance, or a bigger standard deviation than either
of these. So then we have some variance
here, variance of x bar minus y bar. Now what are these, in terms
of the original population distribution? We came up with those results
right over here. We know what the standard
deviation is. We know that this thing is the
same thing as the variance of the population distribution
divided by n. We've done this multiple,
multiple times. What's this going
to be equal to? This is right here is the same
thing as the variance of our population distribution. And the x just means this is
for random variable x. But there's no bar on top. This is the actual population
distribution, not the sampling distribution of the
sample mean. So that divided by n. And then if we want the variance
of the sampling distribution for y, let me do
that in a different color. I'll use blue, because that was
what we were using for the y random variable. That's going to be equal to
this thing over here. And we've done this
multiple times. Same exact logic as this. The population distribution
for y divided by m. And so once again, I'll just
write this out front. This is the variance
of the differences of the sample means. And now if you wanted the
standard deviation of the differences of the sample means,
you just have to take the square root of both
sides of this. You take the square root of
this, you get the standard deviation of the difference of
the sample means is equal to the square root of the
population distribution of x. Or the variance of the
population distribution of x divided by n plus the variance
of the population distribution of y divided by m. And this is just neat. Because it kind of looks
a little bit like a distance formula. I'll throw that out there as we
get more sophisticated with our statistics and try to
visualize what all of this kind of stuff means in
more advanced topics. But the whole point of this is,
now we can make inferences about a difference of means. If we have two samples, and we
take the means of both of those samples and we find some
difference, we can make some conclusions about
how likely that difference was just by chance. And we're going to do that
in the next video.