Current time:0:00Total duration:13:20
0 energy points

ANOVA 2: Calculating SSW and SSB (total sum of squares within and between)

Video transcript
In the last video, we were able to calculate the total sum of squares for these nine data points right here. And these nine data points are grouped into three different groups, or if we want to speak generally, into m different groups. What I want to do in this video is to figure out how much of this total sum of squares is due to variation within each group versus variation between the actual groups. So first, let's figure out the total variation within the group. So let's call that the sum of squares within. So let's calculate the sum of squares within. I'll do that in yellow. Actually, I already used yellow, so let me do blue. So the sum of squares within. Let me make it clear. That stands for within. So we want to see how much of the variation is due to how far each of these data points are from their central tendency, from their respective mean. So this is going to be equal to-- let's start with these guys. So instead of taking the distance between each data point and the mean of means, I'm going to find the distance between each data point and that group's mean, because we want to square the total sum of squares between each data point and their respective mean. So let's do that. So it's 3 minus-- the mean here is 2-- squared, plus 2 minus 2 squared, plus 2 minus 2 squared, plus 1 minus 2 squared. 1 minus 2 squared plus-- I'm going to do this for all of the groups, but for each group, the distance between each data point and its mean. So plus 5 minus 4, plus 5 minus 4 squared, plus 4 minus 4 squared-- sorry, the next point was 3-- plus 3 minus 4 squared, plus 4 minus 4 squared. And then finally, we have the third group. But we're finding that all of the sum of squares from each point to its central tendency within that, but we're going to add them all up. And then we find the third group. So we have 5 minus-- oh, its mean is 6-- 5 minus 6 squared, plus 6 minus 6 squared, plus 7 minus 6 squared. And what is this going to equal? So this is going to be equal to-- up here, it's going to be 1 plus 0 plus 1. So that's going to be equal to 2 plus. And then this is going to be equal to 1, 1 plus 1 plus 0-- so another 2-- plus this is going to be equal to 1 plus 0 plus 1. 7 minus 6 is 1 squared is 1. So plus. So that's 2 over here. So this is going to be equal to our sum of squares within, I should say, is 6. So one way to think about it-- our total variation was 30. And based on this calculation, 6 of that 30 comes from a variation within these samples. Now, the next thing I want to think about is how many degrees of freedom do we have in this calculation? How many independent data points do we actually have? Well, for each of these-- so over here, we have n data points in one. In particular, n is 3 here. But if you know n minus 1 of them, you can always figure out the nth one if you know the actual sample mean. So in this case, for any of these groups, if you know two of these data points, you can always figure out the third. If you know these two, you can always figure out the third if you know the sample mean. So in general, let's figure out the degrees of freedom here. For each group, when you did this, you had n minus 1 degrees of freedom. Remember, n is the number of data points you had in each group. So you have n minus 1 degrees of freedom for each of these groups. So it's n minus 1, n minus 1, n minus 1. Or let me put it this way-- you have n minus 1 for each of these groups, and there are m groups. So there's m times n minus 1 degrees of freedom. And in this case in particular, each group-- n minus 1 is 2. Or in each case, you had 2 degrees of freedom, and there's three groups of that. So there are 6 degrees of freedom. And in the future, we might do a more detailed discussion of what degrees of freedom mean, and how to mathematically think about it. But the best-- the simplest way to think about it is really, truly independent data points, assuming you knew, in this case, the central statistic that we used to calculate the squared distance in each of them. If you know them already, the third data point could actually be calculated from the other two. So we have 6 degrees of freedom over here. Now, that was how much of the total variation is due to variation within each sample. Now let's think about how much of the variation is due to variation between the samples. And to do that, we're going to calculate. Let me get a nice color here. I think I've run out all the colors. We'll call this sum of squares between. The B stands for between. So another way to think about it-- how much of this total variation is due to the variation between the means, between the central tendency-- that's what we're going to calculate right now-- and how much is due to variation from each data point to its mean? So let's figure out how much is due to variation between these guys over here. Actually, let's think about just this first group. For this first group, how much variation for each of these guys is due to the variation between this mean and the mean of means? Well, so for this first guy up here-- I'll just write it all out explicitly-- the variation is going to be its sample mean. So it's going to be 2 minus the mean of means squared. And then for this guy, it's going to be the same thing-- his sample mean, 2 minus the mean of mean squared. Plus same thing for this guy, 2 minus the mean of mean squared. Or another way to think about it-- this is equal to-- I'll write it over here-- this is equal to 3 times 2 minus 4 squared, which is the same thing as 3. This is equal to 3 times 4. Three times 4 is equal to 12. And then we could do it for each of them. And actually, I want to find the total sum. So let me just write it all out, actually. I think that might be an easier thing to do, because I want to find, for all of these guys combined, the sum of squares due to the differences between the samples. So that's from the contribution from the first sample. And then from the second sample, you have this guy over here. Oh, sorry. You don't want to calculate him. For this data point, the amount of variation due to the difference between the means is going to be 4 minus 4 squared. Same thing for this guy. It's going to be 4 minus 4 squared. And we're not taking it into consideration. We're only taking its sample mean into consideration. And then finally, plus 4 minus 4 squared. We're taking this minus this squared for each of these data points. And then finally, we'll do that with the last group. With the last group, sample mean is 6. So it's going to be 6 minus 4 squared, plus 6 minus 4 squared, plus 6 minus 4, plus 6 minus 4 squared. Now, let's think about how many degrees of freedom we had in this calculation right over here. Well, in general, I guess the easiest way to think about is, how much information did we have, assuming that we knew the mean of means? If we know the mean of means, how much here is new information? Well, if you know the mean of the mean, and you know two of these sample means, you can always figure out the third. If you know this one and this one, you can figure out that one. And if you know that one and that one, you can figure out that one. And that's because this is the mean of these means over here. So in general, if you have m groups, or if you have m means, there are m minus 1 degrees of freedom here. Let me write that. But with that said, well, and in this case, m is 3. So we could say there's two degrees of freedom for this exact example. Let's actually, let's calculate the sum of squares between. So what is this going to be? I'll just scroll down. Running out of space. This is going to be equal to-- this right here is 2 minus 4 is negative 2 squared is 4. And then we have three 4's over here. So it's 3 times 4, plus 3 times-- what is this? 3 times 0 plus-- what is this? The difference between each of these-- 6 minus 4 is 2 squared is 4-- so that means we have 3 times 4, plus 3 times 4. And we get 3 times 4 is 12, plus 0, plus 12 is equal to 24. So the sum of squares, or we could say, the variation due to what's the difference between the groups, between the means, is 24. Now, let's put it all together. We said that the total variation, that if you looked at all 9 data points, is 30. Let me write that over here. So the total sum of squares is equal to 30. We figured out the sum of squares between each data point and its central tendency, its sample mean-- we figured out, and when you totaled it all up, we got 6. So the sum of squares within was equal to 6. And in this case, it was 6 degrees of freedom. Or if we wanted to write it generally, there were m times n minus 1 degrees of freedom. And actually, for the total, we figured out we have m times n minus 1 degrees of freedom. Actually, let me just write degrees of freedom in this column right over here. In this case, the number turned out to be 8. And then just now we calculated the sum of squares between the samples. The sum of squares between the samples is equal to 24. And we figured out that it had m minus 1 degrees of freedom, which ended up being 2. Now, the interesting thing here-- and this is why this analysis of variance all fits nicely together, and in future videos we'll think about how we can actually test hypotheses using some of the tools that we're thinking about right now-- is that the sum of squares within, plus the sum of squares between, is equal to the total sum of squares. So a way to think about is that the total variation in this data right here can be described as the sum of the variation within each of these groups, when you take that total, plus the sum of the variation between the groups. And even the degrees of freedom work out. The sum of squares between had 2 degrees of freedom. The sum of squares within each of the groups had 6 degrees of freedom. 2 plus 6 is 8. That's the total degrees of freedom we had for all of the data combined. It even works if you look at the more general. So our sum of squares between had m minus 1 degrees of freedom. Our sum of squares within had m times n minus 1 degrees of freedom. So this is equal to m minus 1, plus mn minus m. These guys cancel out. This is equal to mn minus 1 degrees of freedom, which is exactly the total degrees of freedom we had for the total sum of squares. So the whole point of the calculations that we did in the last video and in this video is just to appreciate that this total variation over here, this total variation that we first calculated, can be viewed as the sum of these two component variations-- how much variation is there within each of the samples plus how much variation is there between the means of the samples? Hopefully that's not too confusing.