Current time:0:00Total duration:13:20

0 energy points

# ANOVA 2: Calculating SSW and SSB (total sum of squares within and between)

Analysis of Variance 2 - Calculating SSW and SSB (Total Sum of Squares Within and Between). Created by Sal Khan.

Video transcript

In the last video, we
were able to calculate the total sum of squares
for these nine data points right here. And these nine data
points are grouped into three different groups, or
if we want to speak generally, into m different groups. What I want to do
in this video is to figure out how much of
this total sum of squares is due to variation within
each group versus variation between the actual groups. So first, let's figure
out the total variation within the group. So let's call that the
sum of squares within. So let's calculate the
sum of squares within. I'll do that in yellow. Actually, I already used
yellow, so let me do blue. So the sum of squares within. Let me make it clear. That stands for within. So we want to see how
much of the variation is due to how far each
of these data points are from their central tendency,
from their respective mean. So this is going to
be equal to-- let's start with these guys. So instead of taking the
distance between each data point and the mean
of means, I'm going to find the distance
between each data point and that group's
mean, because we want to square the total sum
of squares between each data point and their respective mean. So let's do that. So it's 3 minus-- the mean here
is 2-- squared, plus 2 minus 2 squared, plus 2 minus 2
squared, plus 1 minus 2 squared. 1 minus 2 squared
plus-- I'm going to do this for
all of the groups, but for each group, the
distance between each data point and its mean. So plus 5 minus 4,
plus 5 minus 4 squared, plus 4 minus 4 squared--
sorry, the next point was 3-- plus 3 minus 4 squared,
plus 4 minus 4 squared. And then finally, we
have the third group. But we're finding that
all of the sum of squares from each point to its
central tendency within that, but we're going to
add them all up. And then we find
the third group. So we have 5 minus-- oh, its
mean is 6-- 5 minus 6 squared, plus 6 minus 6 squared,
plus 7 minus 6 squared. And what is this going to equal? So this is going to
be equal to-- up here, it's going to be
1 plus 0 plus 1. So that's going to
be equal to 2 plus. And then this is going to be
equal to 1, 1 plus 1 plus 0-- so another 2--
plus this is going to be equal to 1 plus 0 plus 1. 7 minus 6 is 1 squared is 1. So plus. So that's 2 over here. So this is going to be
equal to our sum of squares within, I should say, is 6. So one way to think about it--
our total variation was 30. And based on this
calculation, 6 of that 30 comes from a variation
within these samples. Now, the next thing
I want to think about is how many degrees of freedom
do we have in this calculation? How many independent data
points do we actually have? Well, for each of
these-- so over here, we have n data points in one. In particular, n is 3 here. But if you know n
minus 1 of them, you can always figure
out the nth one if you know the
actual sample mean. So in this case, for
any of these groups, if you know two of
these data points, you can always
figure out the third. If you know these two, you can
always figure out the third if you know the sample mean. So in general, let's figure out
the degrees of freedom here. For each group,
when you did this, you had n minus 1
degrees of freedom. Remember, n is the
number of data points you had in each group. So you have n minus
1 degrees of freedom for each of these groups. So it's n minus 1, n
minus 1, n minus 1. Or let me put it this
way-- you have n minus 1 for each of these groups,
and there are m groups. So there's m times n minus
1 degrees of freedom. And in this case in particular,
each group-- n minus 1 is 2. Or in each case, you had
2 degrees of freedom, and there's three
groups of that. So there are 6
degrees of freedom. And in the future, we might
do a more detailed discussion of what degrees of
freedom mean, and how to mathematically
think about it. But the best-- the simplest
way to think about it is really, truly
independent data points, assuming you
knew, in this case, the central statistic
that we used to calculate the squared
distance in each of them. If you know them already,
the third data point could actually be calculated
from the other two. So we have 6 degrees
of freedom over here. Now, that was how much
of the total variation is due to variation
within each sample. Now let's think about
how much of the variation is due to variation
between the samples. And to do that, we're
going to calculate. Let me get a nice color here. I think I've run
out all the colors. We'll call this sum
of squares between. The B stands for between. So another way to
think about it-- how much of this
total variation is due to the variation
between the means, between the central
tendency-- that's what we're going to
calculate right now-- and how much is due to
variation from each data point to its mean? So let's figure out how
much is due to variation between these guys over here. Actually, let's think about
just this first group. For this first group,
how much variation for each of these guys is due to
the variation between this mean and the mean of means? Well, so for this
first guy up here-- I'll just write it
all out explicitly-- the variation is going
to be its sample mean. So it's going to be 2 minus
the mean of means squared. And then for this
guy, it's going to be the same thing--
his sample mean, 2 minus the mean
of mean squared. Plus same thing for this guy, 2
minus the mean of mean squared. Or another way to
think about it-- this is equal to-- I'll
write it over here-- this is equal to 3
times 2 minus 4 squared, which is the same thing as 3. This is equal to 3 times 4. Three times 4 is equal to 12. And then we could do
it for each of them. And actually, I want
to find the total sum. So let me just write
it all out, actually. I think that might
be an easier thing to do, because I want to find,
for all of these guys combined, the sum of squares
due to the differences between the samples. So that's from the contribution
from the first sample. And then from the second sample,
you have this guy over here. Oh, sorry. You don't want to calculate him. For this data point,
the amount of variation due to the difference
between the means is going to be 4
minus 4 squared. Same thing for this guy. It's going to be
4 minus 4 squared. And we're not taking
it into consideration. We're only taking its sample
mean into consideration. And then finally, plus
4 minus 4 squared. We're taking this
minus this squared for each of these data points. And then finally, we'll do
that with the last group. With the last group,
sample mean is 6. So it's going to be
6 minus 4 squared, plus 6 minus 4 squared, plus
6 minus 4, plus 6 minus 4 squared. Now, let's think about how
many degrees of freedom we had in this calculation
right over here. Well, in general, I
guess the easiest way to think about is, how
much information did we have, assuming that we
knew the mean of means? If we know the mean
of means, how much here is new information? Well, if you know
the mean of the mean, and you know two of
these sample means, you can always
figure out the third. If you know this
one and this one, you can figure out that one. And if you know that
one and that one, you can figure out that one. And that's because this is the
mean of these means over here. So in general, if you have m
groups, or if you have m means, there are m minus 1
degrees of freedom here. Let me write that. But with that said, well,
and in this case, m is 3. So we could say there's
two degrees of freedom for this exact example. Let's actually, let's calculate
the sum of squares between. So what is this going to be? I'll just scroll down. Running out of space. This is going to be equal to--
this right here is 2 minus 4 is negative 2 squared is 4. And then we have
three 4's over here. So it's 3 times 4, plus
3 times-- what is this? 3 times 0 plus-- what is this? The difference between
each of these-- 6 minus 4 is 2 squared is 4-- so
that means we have 3 times 4, plus 3 times 4. And we get 3 times 4 is 12,
plus 0, plus 12 is equal to 24. So the sum of
squares, or we could say, the variation due
to what's the difference between the groups,
between the means, is 24. Now, let's put it all together. We said that the
total variation, that if you looked at
all 9 data points, is 30. Let me write that over here. So the total sum of
squares is equal to 30. We figured out
the sum of squares between each data point
and its central tendency, its sample mean--
we figured out, and when you totaled
it all up, we got 6. So the sum of squares
within was equal to 6. And in this case, it was
6 degrees of freedom. Or if we wanted to
write it generally, there were m times n minus
1 degrees of freedom. And actually, for the
total, we figured out we have m times n minus
1 degrees of freedom. Actually, let me just
write degrees of freedom in this column right over here. In this case, the number
turned out to be 8. And then just now we
calculated the sum of squares between the samples. The sum of squares between
the samples is equal to 24. And we figured out that it had
m minus 1 degrees of freedom, which ended up being 2. Now, the interesting
thing here-- and this is why this analysis of variance
all fits nicely together, and in future videos we'll
think about how we can actually test hypotheses using
some of the tools that we're thinking
about right now-- is that the sum
of squares within, plus the sum of
squares between, is equal to the total
sum of squares. So a way to think about is
that the total variation in this data right
here can be described as the sum of the
variation within each of these groups, when
you take that total, plus the sum of the
variation between the groups. And even the degrees
of freedom work out. The sum of squares between
had 2 degrees of freedom. The sum of squares
within each of the groups had 6 degrees of freedom. 2 plus 6 is 8. That's the total
degrees of freedom we had for all of
the data combined. It even works if you
look at the more general. So our sum of
squares between had m minus 1 degrees of freedom. Our sum of squares
within had m times n minus 1 degrees of freedom. So this is equal to m
minus 1, plus mn minus m. These guys cancel out. This is equal to mn minus
1 degrees of freedom, which is exactly the total
degrees of freedom we had for the total
sum of squares. So the whole point
of the calculations that we did in the last
video and in this video is just to appreciate that
this total variation over here, this total variation
that we first calculated, can be viewed as the sum
of these two component variations-- how much
variation is there within each of the samples plus
how much variation is there between the means
of the samples? Hopefully that's
not too confusing.