Main content

## Analysis of variance (ANOVA)

Current time:0:00Total duration:7:39

# ANOVA 1: Calculating SST (total sum of squares)

## Video transcript

In this video and
the next few videos, we're just really going to be
doing a bunch of calculations about this data set
right over here. And hopefully, just going
through those calculations will give you an
intuitive sense of what the analysis of
variance is all about. Now, the first thing I
want to do in this video is calculate the
total sum of squares. So I'll call that SST. SS-- sum of squares total. And you could view it
as really the numerator when you calculate variance. So you're just going to take
the distance between each of these data points and the
mean of all of these data points, square them,
and just take that sum. We're not going to divide by
the degree of freedom, which you would normally do
if you were calculating sample variance. Now, what is this going to be? Well, the first
thing we need to do, we have to figure out the mean
of all of this stuff over here. And I'm actually going to
call that the grand mean. And I'm going to
show you in a second that it's the same thing as
the mean of the means of each of these data sets. So let's calculate
the grand mean. So it's going to be 3 plus 2
plus 1 plus 5 plus 3 plus 4 plus 5 plus 6 plus 7. And then we have
nine data points here so we'll divide by 9. And what is this
going to be equal to? 3 plus 2 plus 1 is 6. 6 plus-- let me just add. So these are 6. 5 plus 3 plus 4 is 12. And then 5 plus 6 plus 7 is 18. And then 6 plus 12 is 18 plus
another 18 is 36, divided by 9 is equal to 4. And let me show you that
that's the exact same thing as the mean of the means. So the mean of this
group 1 over here-- let me do it in
that same green-- the mean of group 1 over
here is 3 plus 2 plus 1. That's that 6 right over
here, divided by 3 data points so that
will be equal to 2. The mean of group 2,
the sum here is 12. We saw that right over here. 5 plus 3 plus 4 is
12, divided by 3 is 4 because we have
three data points. And then the mean
of group 3, 5 plus 6 plus 7 is 18 divided by 3 is 6. So if you were to take the
mean of the means, which is another way of viewing this
grand mean, you have 2 plus 4 plus 6, which is 12,
divided by 3 means here. And once again, you would get 4. So you could view
this as the mean of all of the data
in all of the groups or the mean of the means
of each of these groups. But either way, now that
we've calculated it, we can actually figure out
the total sum of squares. So let's do that. So it's going to be
equal to 3 minus 4-- the 4 is this 4 right over
here-- squared plus 2 minus 4 squared plus 1 minus 4 squared. Now, I'll do these guys
over here in purple. Plus 5 minus 4 squared plus 3
minus 4 squared plus 4 minus 4 squared. Let me scroll over a little bit. Now, we only have three
left, plus 5 minus 4 squared plus 6 minus 4 squared
plus 7 minus 4 squared. And what does this give us? So up here, this is going
to be equal to 3 minus 4. Difference is 1. You square it. It's actually negative 1,
but you square it, you get 1, plus you get negative 2 squared
is 4, plus negative 3 squared. Negative 3 squared is 9. And then we have here
in the magenta 5 minus 4 is 1 squared is still 1. 3 minus 4 squared is 1. You square it again,
you still get 1. And then 4 minus 4 is just 0. So we could-- well, I'll
just write the 0 there just to show you that we
actually calculated that. And then we have these
last three data points. 5 minus 4 squared. That's 1. 6 minus 4 squared. That is 4, right? That's 2 squared. And then plus 7 minus
4 is 3 squared is 9. So what's this going
to be equal to? So I have 1 plus 4
plus 9 right over here. That's 5 plus 9. This right over
here is 14, right? 5 plus-- yup, 14. And then we also have
another 14 right over here because we have a
1 plus 4 plus 9. So that right over
there is also 14. And then we have 2 over here. So it's going to be
28-- 14 times 2, 14 plus 14 is 28-- plus 2 is 30. Is equal to 30. So our total sum of
squares-- and actually, if we wanted the
variance here, we would divide this by
the degrees of freedom. And we've learned multiple
times the degrees of freedom here so let's say
that we have-- so we know that we have
m groups over here. So let me just write
it as m and I'm not going to prove things
rigorously here, but I want to show
you where some of these strange formulas that
show up in statistics books actually come from without
proving it rigorously. More to give you the intuition. So we have m groups here. And each group
here has n members. So how many total
members do we have here? Well, we had m
times n or 9, right? 3 times 3 total members. So our degrees of
freedom-- and remember, you have however
many data points you had minus 1
degrees of freedom because if you know
the mean of means, if you assume you knew
that, then only 9 minus 1, only eight of these are going
to give you new information because if you know that, you
could calculate the last one. Or it really doesn't
have to be the last one. If you have the other eight,
you could calculate this one. If you have eight of
them, you could always calculate the ninth one
using the mean of means. So one way to think
about it is that there's only eight independent
measurements here. Or if we want to
talk generally, there are m times n-- so that tells
us the total number of samples-- minus 1 degrees of freedom. And if we were actually
calculating the variance here, we would just divide
30 by m times n minus 1 or this is another way of
saying eight degrees of freedom for this exact example. We would take 30 divided
by 8 and we would actually have the variance for
this entire group, for the group of nine
when you combine them. I'll leave you
here in this video. In the next video, we're
going to try to figure out how much of this total
variance, how much of this total squared sum, total
variation comes from the variation within
each of these groups versus the variation
between the groups. And I think you get
a sense of where this whole analysis of
variance is coming from. It's the sense
that, look, there's a variance of this
entire sample of nine, but some of that variance--
if these groups are different in some way--
might come from the variation from being in different groups
versus the variation from being within a group. And we're going to
calculate those two things and we're going to
see that they're going to add up to the
total squared sum variation.