Current time:0:00Total duration:13:20
0 energy points

ANOVA 2: Calculating SSW and SSB (total sum of squares within and between)

Analysis of Variance 2 - Calculating SSW and SSB (Total Sum of Squares Within and Between). Created by Sal Khan.
Video transcript
In the last video we were able to calculate the total sum of squares for these 9 data points right here, these 9 data points are grouped into three different groups, or if you wanted to speak generally into "m" different groups. What I want to do in this video is to figure out how much of this total sum of squares how much of this is due to variation within each group versus variation between the actual groups. So first let's figure out the total variation within the groups, so let's call that the sum of squares within, I'll do that in yellow, actually I've already used yellow so let's do this, I'm going to do blue. So the sum of squares within. Let me make that clear, that stands for within. So we want to see how much of a variation is due to how far each of these data points are from their central tendencies, from their respective means. So this is going to be equal to-- let's start with these guys. So instead of taking the distance between each data point and the mean of means I'm going to find the distance between each data point and that group's mean because we want to square the total sum of squares between each data point and their respective means 3 minus the mean here, it's 2. Squared. + 2 minus 2 squared, + 1 minus 2 squared. I'm going to do this for all of the groups, but for each group the distance between it's data point and it's mean so + minus 4 squared, + 3 minus 4 squared, + 4 minus 4 squared and finally we have the third group, and we're finding all of the sum of squares from each point to it's central tendency within that group, we're going to add them all up. And then we find the third group so we have 5 minus 6 squared + 6 minus 6 squared, + 7 minus 6 squared. And what is this going to equal? So this is going to be equal to, so up here it is going to be 1 + 0 + 1, that's going to be equal to 2, + this is going to be equal to 1 + 1 + 0, so another 2, + this is going to be equal to 1 + 0 + 1, so that's 2 over here. Our total sum of squared within is 6. So one way to think about it, our total variation was 30. Based on that calculation 6 of that 30 comes from variation within these samples. Now the next thing I want to think about is how many degrees of freedom do we have in this calculation how many, kind of, independent data points do we actually have, well for each of these, over here, if you know we have 'n' data points for each one, in particular n is 3 here, but if you know n minus one of them, you can always find the 'n'th one, if you know the actual sample mean. So in this case for any of these groups if you know 2 of these data points, you can always figure out the third. If you know these two, you can always figure out the third if you can figure out the sample mean. So in general let's figure out the degrees of freedom here. You have, for each group, when you did this you had 'n' minus one degrees of freedom. Remember 'n' is the number of data points you had in each group, so you have n minus one degrees of freedom for each of these groups, so it's n-1, n-1, n-1, or you have, let me put it this way, you have 'n-1' for each of these groups, and and there are m groups. So there's m times n-1 degrees of freedom. In this particular case, each group, n -1 is two or each case, you have 2 degrees of freedom and there's three groups about the there are 6 degrees of freedom. In the future we may do a more detailed discussion of what degrees of freedom mean how to mathematically think about it. But the simplest way to think about it is really truly independent data points. Assuming you knew in this case the central statistic that we used to calculate the squared distances of each of these, if you know them already the third data point actually could be calculated from the other 2. So we have 6 degrees of freedom over here. Now that was how much of the total variation is due to variation within each sample. Now think about how much of the variation is due to variation between between the sample. And to do that, we're going to calculate-- get a nice color here-- I think I've run out of all the colors-- we'll call it sum of squares between, the B stands for between. So another way to think about it, how much of this total variation is due to the variation between the means, between the central tendency that's what we're going to calculate right now and how much is due to variation from each data points to its mean. Let's figure out how much is due variation between these guys over here. One way to think about it for each of these data points-- let's just think about this first group. For this first group, how much variation for each of these guys is due to the variation between this mean and the mean of means. For the first guy up here-- I'll just write it all out explicitly-- the variation is going to be its sample mean, 2, minus the mean of means, squared. And then for this guy, it's going to be the same thing. His sample mean, 2, minus the mean of means, squared. Plus same thing for this guy. His sample mean, 2, minus the mean of means, squared. Or another way to think about it, this is equal to 3 times 2-4 squared, which is the same thing as 3 times 4. It's equal to 12. I can do it for each of them. I actually want to find the total sum. Let me just write it all out. I think that might be an easier thing to do. For all of these guys combined the sum of squares due to the differences between the samples. So that's from the first sample, the contribution from the first sample. And then from the second sample, you have this guy here, five-- sorry, you don't want to calculate him. For this data point, the amount of variation due to the difference between the means is going to be 4-4 squared Same thing for this guys, would be 4-4 squared. We're not taking it into consideration. We're only taking its sample mean into consideration. And then finally + 4-4 square. We're taking this minus this squared for each of these data points. And then finally we'll do that with the last group. Sample mean is 6, so it's going to be 6-4 squared plus 6-4 squared plus 6-4 squared. Now, let's think about how many degrees of freedom we had in this calculation right over here. Well, in general, I guess the easiest way to think about it is, how much information do we have, assuming that we knew the mean of means? If we know the mean of means, how much here is new information? If you know 2 of these if you know the mean of the means and you know 2 of the sample means, you can always figure out the third. If you know this one and this one, you can figure out that one. If you know that one and that one, you can figure out that one. That's because this is the mean of these means over here. So in general, if you m groups or if you have m means, there are m-1 degrees of freedom here. With that said, in this case m is 3. So we could say, there's 2 degrees of freedom for this exact example. Let's actually calculate the sum of squares between. So what is this going to be? This is going to be equal to, this right here is, 2-4 is -2, squared is 4. And then we have three fours over here, so three times four. Plus 3 times 0, plus 3 times (6-4)2, which is 3 times 4. So plus 3 times 4. And we get 3 times 4 is 12 + 0 + 12, is equal to 24. So the sum of squares, or the variation due to what's the difference between the groups, between the means is 24. Not let's put this altogether. We said that the total variation when you look at all 9 data points, is 30. Let me write that over here. So the total sum of squares is equal to 30. We figured out the sum of squares between each data point and its central tendency, its sample mean, we figure out and we totaled it all up, we got 6 for the sum of squares within. The sum of squares within was equal to 6. In this case, it was 6 degrees of freedom. If we wanted to write generally, there were m times n-1 degrees of freedom. Actually for the total, we figured out we had m times n -1 degrees of freedom. Let me write the degrees of freedom in this column over here. In this case, the number turned out to be 8. And then just now, we calculated the sum of squares between the samples. The sum of squares between the samples is equal to 24 and we figured out that it had m-1 degrees of freedom which ended up being 2. Now the interesting thing here-- this is why this analysis of variance all fits nicely together. In future videos we will think about how we can actually test hypotheses using some of the tools that we're thinking about right now-- is that the sum of squares within plus the sum of squares between is equal to the total sum of squares. So the way to think about is that the total variation in this data right here can be described as the sum of the variation within each of these groups when you take that total plus the sum of the variation between the groups. And even the degrees of freedom work out. The sum of squares between has 2 degrees of freedom. The sum of squares within each of the groups had 6 degrees of freedom. 2+6 is 8. That's the total degrees of freedom we have for all of the data combined. It even works if you look at the more general. Our sum of squares between had m-1 degrees of freedom. Our sum of squares within had m(n-1) degrees of freedom. This is equal to m-1+mn-m. These guys cancel out. This is equal to mn-1 degrees of freedom, which is exactly the total degrees of freedom we have for the total sum of squares. So the whole point of the calculations that we did in the last and this video is just to appreciate that this total variation over here can be viewed as the sum of these two component variations, how much variation within each of the samples plus how much variation is there between the means of the samples. Hopefully that's not too confusing.