Main content

## Statistics and probability

### Course: Statistics and probability > Unit 16

Lesson 1: Analysis of variance (ANOVA)# ANOVA 2: Calculating SSW and SSB (total sum of squares within and between)

Analysis of Variance 2 - Calculating SSW and SSB (Total Sum of Squares Within and Between). Created by Sal Khan.

## Want to join the conversation?

- Thanks Sal for the great video.

Would you please help in elaborating the degrees of freedom part? I find it really interesting but I think that unfortunately I did not grasp their meaning in its entirety.

04:46- "In the future we might do a more detailed discussion of what degrees of freedom mean and how to ..."

Thank you

(55 votes)- Well the way I explained it to myself:

If we have a defined population we can easily calculate the mean.

5,4,3,2,5 = 5+4+3+2+5/5=3,8

We can always find the "one missing" population member if we now that the mean is 3,8 and that there are 5 members and if we know the 4 members.

(5+4+3+2+x)/5=3,8 (multiply both sides by 5)

5+4+3+2+x=19

x=19-5-4-3-2

x=5 (fifth member of our population was indeed 5)

If we tried the same thing with two missing members it would not have came out the same.

(5+4+x+x)/4=5 multiply both sides by 4

5+4+2x=20

2x=20-9

2x=11

x=5,5

Clearly not correct.(7 votes)

- These videos on ANOVA follow the videos on regression. I was wondering if someone can post a statement that clarifies the distinction between the respective goals of regression and ANOVA analysis, i.e., you conduct regression analysis when your objective is
*__, but you need ANOVA when your goal is ____*. thanks in advance.(10 votes)- Regression: You have two quantitative (numerical) variables, and you want to know the relationship and/or predict values of one of them. For example, you know a person's height, you want to predict his/her weight.

ANOVA: You are interested in a numerical variable (the response), and you want to see if there are difference in this variable over several groups. For example, you want to see if gas milage differs between sedans, minivans, and SUVs.(21 votes)

- For the SSB, I thought it was the mean of each group divided by the Grand mean--all squared, and so would be (2-4)**2 + (4-4)**2 + (6-4)**2. Why is it (2-4) repeated three times?(10 votes)
- One can understand the repetition of the deviations between group mean and the grand mean for each group by considering ANOVA for groups with different sample sizes. Repeating the deviations for every sample allows us to place some weightage to the sample size of that group. This is important as larger sample sizes give better estimates and lesser variances from the true statistic (population mean. proportion or variance). Hence by taking the number of samples into account, we explain how much we are relying on the data available from the sample in order to estimate the variances of the whole data. Therefore the variances of each group needs to include its sample size as well.(1 vote)

- When calculating SSB and SSW do the groups have to have the same degrees of freedom? In your example all the groups have three data points and 2 degrees of freedom, is it ok to have group A with 6 data points group B with 3 and group C with 5?(5 votes)
- Yes, it is possible to have groups with different sample sizes. This would be called an "unbalanced" design (versus a "balanced" design when all groups have the same sample size).

For the unbalanced design, the notation gets a little bit more complicated, but it all still works out. Some things that Sal mentions won't be strictly as they sound. For example, the "mean of means" that he references cannot literally be calculated as the mean of the group means. We would need to calculate a weighted mean of the group means, or directly average all the data points from all the groups.(8 votes)

- Whenever I hear the term "n-1 degrees of freedom" when there are really "n" independent samples, that always sounds strange to me. At3:40, SK is talking about the "n-1 degrees of freedom" b/c we "know" the mean. But, in calculating the mean itself, "n" independent samples were used. If any of those "n" random variables were different, the mean itself would be different. Any one of those "n" variables can have a result in all of the values in SK's blackboard above.**

Now, that being said, when SK talks about "n-1 degrees of freedom", he appears to be talking about the ERROR only FROM the mean. As in, we calculated the variance (a measure of error), and in calculating that "error", we used the mean as an intermediate value. But, GIVEN that value, there are only "n-1" degrees of freedom. However, how can we lose sight of the fact that calculating the mean ITSELF required "n" independent variables?

To summarize, there are really "n" independent variables / degrees of freedom in the WHOLE calculation. Why do we restrict the degrees of freedom in "knowing" the mean, when the mean itself was calculated with a set of values with "n" degrees of freedom - and NOT "n-1" degrees of freedom?

** I challenge anyone to tell me how they can keep all the final values (e.g. SST = 30) by holding only "n-1" of the values the same in the 3x3 matrix without any other constrains (i.e. mean). If I take the last variable (which you didn't hold), and change it, the mean itself and all the other following stats would change.

Thanks.(5 votes)- I think the (n-1)d.o.f. comes from this:

Hope you can follow my line of thought here:

-If we say the mean HAS to be ie. 25

-We have 3 variables (X1, X2 and X3) and we're allowed to change these variables as we wish as long as the mean remains 25

-Then say we change X1 and X2 as we wish, we can move them freely around, but here's the caveat, in order for the mean to remain at 25, when we change X1 and X2, we have to use X3 to adjust for the movement in X1 and X2 such that our mean remains at 25. And with this in mind, we can't really say that X3 is a free variable, since it's tied to the mean. Thus we get (d.o.f. = n-1)

Also to note, this bears a high resemble of "solving for x"-type equations, and thus it's not only for the mean that this works.

Atleast that's how I see it, would love to be corrected if I'm wrong, since then my understanding needs to be updated :)(2 votes)

- Is the Sum of Squares Within variables also known as the Sum of Squares Estimated

Is the Sum of Squares Between variables also known as the Sum of Squares for Residuals(3 votes)- Sum of Squares Within = Sum of Squared Residual(3 votes)

- We are calculating SSW as defined in the video but it is not intuitive to take 3*(2-4)^2 + 3*(4-4)^2 + 3*(6-4)^2 (according to the example given in the video) to measure squared sum between group.

(2-4)^2 + (4-4)^2 + (6-4)^2 is more intuitive but incorrect,why?

In measuring variance between group, why we are going up to element level 3*(2-4)^2 and not just performing (2-4)^2 for one group?

group mean represents a group. So if we want to take variation among group just take squared sum of (group_mean - mean_of_group_means).

Can someone please explain?

Instead if we calculate (2-4)^2+(4-6)^2+(2-6)^2 or (x1bar-x2bar)^2+(x2bar-x3bar)^2+(x3bar-x1bar)^2 then it can also measure variation between the groups and also it is more intuitive. I am not sure about this method but in this case it worked =24 same as calculated in the video..

I know that this method is not calculating variation from centre of means and just calculating distance between means of different groups.

Please clarify?(1 vote)- What if one group had a mean that was a long way away from the grand mean, but it only had few observations, while another group mean was very close to the grand mean, and had many observations? Groups with more observations
*should*have higher weight. That is one intuitive reason why the "point wise" calculations are the way they are.

There are a number of ways to understand this through the mathematics, but that will get into some more complicated formulas.(4 votes)

- When we use statistical analysis chi-Square(2 votes)
- How would we calculate SSwithin if only variances are given for each group?(1 vote)
- SSwithin can be calculated as: (n-1)*[ s1^2 + s2^2 + ... ]

To get this, look at what we do for SSwithin. For every point, we subtract the*group mean*from each value, square it, and add them all up. For notation, I'm going to use Mi as the group means.

SUM( SUM( (Xij - Mi)^2 ) )

The outside sum is just going across the groups, so let's look at each group separately: SUM( (Xij - Mi)^2 )

This looks very close to being a variance, doesn't it? In fact, all we need to is divide by (n-1) and we'd have the variance for that group! But we already*have*the variance, and we want the*sum*. So we can just "undo" the division by (n-1) that the variances have. For the ith group, we'd take (n-1)*Si^2 to get this sum. From there, we just have to do the same thing to every group, and add up the results.

If we have equal sample sizes for each group, they are all "n", so the denominator in the variance is (n-1) for each, meaning we can factor that out and just multiply (n-1) by the sum of all the variances. Hence, the formula at the top.(3 votes)

- Hi. Could this be expanded into three dimensions, i.e. done with a data cube?

When I watched the video I thought of some hourly data I've worked on before, which has a variation within the week, between different weeks and between different years. It would be interesting to have a good measure of how much of the total variation each part contributes.

I did a quick test in Excel to see if I could figure it out, but I couldn't quite get the sums to work out som that the sum of the parts was equal to the total. Does anyone know if it's possible to do in a similar manner to what the video describes?(1 vote)- This video is about Analysis of Variance (ANOVA), or more specifically, One-Factor ANOVA. In this setting, we are thinking about two variables:

1. Something that we are measuring, like height, weight, MPG of an automobile, etc. These result in the 9 numbers that Sal was working with in this video.

2. A "factor", which is a variable comprising two or more groups. For example, say we want to compare the average MPG of an automobile, and we are looking at several groups: Sedans, Minivans, and SUVs/Trucks. This would be the three groups / columns that Sal had, group 1, 2, and 3 (or Green, Purple, Pink, if you prefer).

It is absolutely possible to add a second factor (i.e. another set of groups) into the analysis. For instance, say we think that in addition to differences in MPG between car types, we think there may also be differences between Asian, American, and European cars in terms of MPG. We could build that into the model as well, and it would be called "Two-factor ANOVA". We can extend this as much as we needed, though the more factors that we add, the more complicated it will be to actually understand the results.

That being said, your brief description would not be able to get dressed by having several factors. Since your factors are overlapping time periods (within week, between week, between years), the groups are not what we call independent. One problem is this: A collection of weeks belongs to a certain year. Say year A was during a recession, and year B was not. Then year B will look better, but*so will all of the weeks associated with year B*. The weeks are what we call "nested" within the years. Different weeks "belong" to specific years.(3 votes)

## Video transcript

In the last video, we
were able to calculate the total sum of squares
for these nine data points right here. And these nine data
points are grouped into three different groups, or
if we want to speak generally, into m different groups. What I want to do
in this video is to figure out how much of
this total sum of squares is due to variation within
each group versus variation between the actual groups. So first, let's figure
out the total variation within the group. So let's call that the
sum of squares within. So let's calculate the
sum of squares within. I'll do that in yellow. Actually, I already used
yellow, so let me do blue. So the sum of squares within. Let me make it clear. That stands for within. So we want to see how
much of the variation is due to how far each
of these data points are from their central tendency,
from their respective mean. So this is going to
be equal to-- let's start with these guys. So instead of taking the
distance between each data point and the mean
of means, I'm going to find the distance
between each data point and that group's
mean, because we want to square the total sum
of squares between each data point and their respective mean. So let's do that. So it's 3 minus-- the mean here
is 2-- squared, plus 2 minus 2 squared, plus 2 minus 2
squared, plus 1 minus 2 squared. 1 minus 2 squared
plus-- I'm going to do this for
all of the groups, but for each group, the
distance between each data point and its mean. So plus 5 minus 4,
plus 5 minus 4 squared, plus 4 minus 4 squared--
sorry, the next point was 3-- plus 3 minus 4 squared,
plus 4 minus 4 squared. And then finally, we
have the third group. But we're finding that
all of the sum of squares from each point to its
central tendency within that, but we're going to
add them all up. And then we find
the third group. So we have 5 minus-- oh, its
mean is 6-- 5 minus 6 squared, plus 6 minus 6 squared,
plus 7 minus 6 squared. And what is this going to equal? So this is going to
be equal to-- up here, it's going to be
1 plus 0 plus 1. So that's going to
be equal to 2 plus. And then this is going to be
equal to 1, 1 plus 1 plus 0-- so another 2--
plus this is going to be equal to 1 plus 0 plus 1. 7 minus 6 is 1 squared is 1. So plus. So that's 2 over here. So this is going to be
equal to our sum of squares within, I should say, is 6. So one way to think about it--
our total variation was 30. And based on this
calculation, 6 of that 30 comes from a variation
within these samples. Now, the next thing
I want to think about is how many degrees of freedom
do we have in this calculation? How many independent data
points do we actually have? Well, for each of
these-- so over here, we have n data points in one. In particular, n is 3 here. But if you know n
minus 1 of them, you can always figure
out the nth one if you know the
actual sample mean. So in this case, for
any of these groups, if you know two of
these data points, you can always
figure out the third. If you know these two, you can
always figure out the third if you know the sample mean. So in general, let's figure out
the degrees of freedom here. For each group,
when you did this, you had n minus 1
degrees of freedom. Remember, n is the
number of data points you had in each group. So you have n minus
1 degrees of freedom for each of these groups. So it's n minus 1, n
minus 1, n minus 1. Or let me put it this
way-- you have n minus 1 for each of these groups,
and there are m groups. So there's m times n minus
1 degrees of freedom. And in this case in particular,
each group-- n minus 1 is 2. Or in each case, you had
2 degrees of freedom, and there's three
groups of that. So there are 6
degrees of freedom. And in the future, we might
do a more detailed discussion of what degrees of
freedom mean, and how to mathematically
think about it. But the best-- the simplest
way to think about it is really, truly
independent data points, assuming you
knew, in this case, the central statistic
that we used to calculate the squared
distance in each of them. If you know them already,
the third data point could actually be calculated
from the other two. So we have 6 degrees
of freedom over here. Now, that was how much
of the total variation is due to variation
within each sample. Now let's think about
how much of the variation is due to variation
between the samples. And to do that, we're
going to calculate. Let me get a nice color here. I think I've run
out all the colors. We'll call this sum
of squares between. The B stands for between. So another way to
think about it-- how much of this
total variation is due to the variation
between the means, between the central
tendency-- that's what we're going to
calculate right now-- and how much is due to
variation from each data point to its mean? So let's figure out how
much is due to variation between these guys over here. Actually, let's think about
just this first group. For this first group,
how much variation for each of these guys is due to
the variation between this mean and the mean of means? Well, so for this
first guy up here-- I'll just write it
all out explicitly-- the variation is going
to be its sample mean. So it's going to be 2 minus
the mean of means squared. And then for this
guy, it's going to be the same thing--
his sample mean, 2 minus the mean
of mean squared. Plus same thing for this guy, 2
minus the mean of mean squared. Or another way to
think about it-- this is equal to-- I'll
write it over here-- this is equal to 3
times 2 minus 4 squared, which is the same thing as 3. This is equal to 3 times 4. Three times 4 is equal to 12. And then we could do
it for each of them. And actually, I want
to find the total sum. So let me just write
it all out, actually. I think that might
be an easier thing to do, because I want to find,
for all of these guys combined, the sum of squares
due to the differences between the samples. So that's from the contribution
from the first sample. And then from the second sample,
you have this guy over here. Oh, sorry. You don't want to calculate him. For this data point,
the amount of variation due to the difference
between the means is going to be 4
minus 4 squared. Same thing for this guy. It's going to be
4 minus 4 squared. And we're not taking
it into consideration. We're only taking its sample
mean into consideration. And then finally, plus
4 minus 4 squared. We're taking this
minus this squared for each of these data points. And then finally, we'll do
that with the last group. With the last group,
sample mean is 6. So it's going to be
6 minus 4 squared, plus 6 minus 4 squared, plus
6 minus 4, plus 6 minus 4 squared. Now, let's think about how
many degrees of freedom we had in this calculation
right over here. Well, in general, I
guess the easiest way to think about is, how
much information did we have, assuming that we
knew the mean of means? If we know the mean
of means, how much here is new information? Well, if you know
the mean of the mean, and you know two of
these sample means, you can always
figure out the third. If you know this
one and this one, you can figure out that one. And if you know that
one and that one, you can figure out that one. And that's because this is the
mean of these means over here. So in general, if you have m
groups, or if you have m means, there are m minus 1
degrees of freedom here. Let me write that. But with that said, well,
and in this case, m is 3. So we could say there's
two degrees of freedom for this exact example. Let's actually, let's calculate
the sum of squares between. So what is this going to be? I'll just scroll down. Running out of space. This is going to be equal to--
this right here is 2 minus 4 is negative 2 squared is 4. And then we have
three 4's over here. So it's 3 times 4, plus
3 times-- what is this? 3 times 0 plus-- what is this? The difference between
each of these-- 6 minus 4 is 2 squared is 4-- so
that means we have 3 times 4, plus 3 times 4. And we get 3 times 4 is 12,
plus 0, plus 12 is equal to 24. So the sum of
squares, or we could say, the variation due
to what's the difference between the groups,
between the means, is 24. Now, let's put it all together. We said that the
total variation, that if you looked at
all 9 data points, is 30. Let me write that over here. So the total sum of
squares is equal to 30. We figured out
the sum of squares between each data point
and its central tendency, its sample mean--
we figured out, and when you totaled
it all up, we got 6. So the sum of squares
within was equal to 6. And in this case, it was
6 degrees of freedom. Or if we wanted to
write it generally, there were m times n minus
1 degrees of freedom. And actually, for the
total, we figured out we have m times n minus
1 degrees of freedom. Actually, let me just
write degrees of freedom in this column right over here. In this case, the number
turned out to be 8. And then just now we
calculated the sum of squares between the samples. The sum of squares between
the samples is equal to 24. And we figured out that it had
m minus 1 degrees of freedom, which ended up being 2. Now, the interesting
thing here-- and this is why this analysis of variance
all fits nicely together, and in future videos we'll
think about how we can actually test hypotheses using
some of the tools that we're thinking
about right now-- is that the sum
of squares within, plus the sum of
squares between, is equal to the total
sum of squares. So a way to think about is
that the total variation in this data right
here can be described as the sum of the
variation within each of these groups, when
you take that total, plus the sum of the
variation between the groups. And even the degrees
of freedom work out. The sum of squares between
had 2 degrees of freedom. The sum of squares
within each of the groups had 6 degrees of freedom. 2 plus 6 is 8. That's the total
degrees of freedom we had for all of
the data combined. It even works if you
look at the more general. So our sum of
squares between had m minus 1 degrees of freedom. Our sum of squares
within had m times n minus 1 degrees of freedom. So this is equal to m
minus 1, plus mn minus m. These guys cancel out. This is equal to mn minus
1 degrees of freedom, which is exactly the total
degrees of freedom we had for the total
sum of squares. So the whole point
of the calculations that we did in the last
video and in this video is just to appreciate that
this total variation over here, this total variation
that we first calculated, can be viewed as the sum
of these two component variations-- how much
variation is there within each of the samples plus
how much variation is there between the means
of the samples? Hopefully that's
not too confusing.