Current time:0:00Total duration:7:39
0 energy points
Analysis of Variance 1 - Calculating SST (Total Sum of Squares). Created by Sal Khan.
Video transcript
In this video and the next few videos, we're just really going to be doing a bunch of calculations about this data set right over here. And hopefully, just going through those calculations will give you an intuitive sense of what the analysis of variance is all about. Now, the first thing I want to do in this video is calculate the total sum of squares. So I'll call that SST. SS-- sum of squares total. And you could view it as really the numerator when you calculate variance. So you're just going to take the distance between each of these data points and the mean of all of these data points, square them, and just take that sum. We're not going to divide by the degree of freedom, which you would normally do if you were calculating sample variance. Now, what is this going to be? Well, the first thing we need to do, we have to figure out the mean of all of this stuff over here. And I'm actually going to call that the grand mean. And I'm going to show you in a second that it's the same thing as the mean of the means of each of these data sets. So let's calculate the grand mean. So it's going to be 3 plus 2 plus 1 plus 5 plus 3 plus 4 plus 5 plus 6 plus 7. And then we have nine data points here so we'll divide by 9. And what is this going to be equal to? 3 plus 2 plus 1 is 6. 6 plus-- let me just add. So these are 6. 5 plus 3 plus 4 is 12. And then 5 plus 6 plus 7 is 18. And then 6 plus 12 is 18 plus another 18 is 36, divided by 9 is equal to 4. And let me show you that that's the exact same thing as the mean of the means. So the mean of this group 1 over here-- let me do it in that same green-- the mean of group 1 over here is 3 plus 2 plus 1. That's that 6 right over here, divided by 3 data points so that will be equal to 2. The mean of group 2, the sum here is 12. We saw that right over here. 5 plus 3 plus 4 is 12, divided by 3 is 4 because we have three data points. And then the mean of group 3, 5 plus 6 plus 7 is 18 divided by 3 is 6. So if you were to take the mean of the means, which is another way of viewing this grand mean, you have 2 plus 4 plus 6, which is 12, divided by 3 means here. And once again, you would get 4. So you could view this as the mean of all of the data in all of the groups or the mean of the means of each of these groups. But either way, now that we've calculated it, we can actually figure out the total sum of squares. So let's do that. So it's going to be equal to 3 minus 4-- the 4 is this 4 right over here-- squared plus 2 minus 4 squared plus 1 minus 4 squared. Now, I'll do these guys over here in purple. Plus 5 minus 4 squared plus 3 minus 4 squared plus 4 minus 4 squared. Let me scroll over a little bit. Now, we only have three left, plus 5 minus 4 squared plus 6 minus 4 squared plus 7 minus 4 squared. And what does this give us? So up here, this is going to be equal to 3 minus 4. Difference is 1. You square it. It's actually negative 1, but you square it, you get 1, plus you get negative 2 squared is 4, plus negative 3 squared. Negative 3 squared is 9. And then we have here in the magenta 5 minus 4 is 1 squared is still 1. 3 minus 4 squared is 1. You square it again, you still get 1. And then 4 minus 4 is just 0. So we could-- well, I'll just write the 0 there just to show you that we actually calculated that. And then we have these last three data points. 5 minus 4 squared. That's 1. 6 minus 4 squared. That is 4, right? That's 2 squared. And then plus 7 minus 4 is 3 squared is 9. So what's this going to be equal to? So I have 1 plus 4 plus 9 right over here. That's 5 plus 9. This right over here is 14, right? 5 plus-- yup, 14. And then we also have another 14 right over here because we have a 1 plus 4 plus 9. So that right over there is also 14. And then we have 2 over here. So it's going to be 28-- 14 times 2, 14 plus 14 is 28-- plus 2 is 30. Is equal to 30. So our total sum of squares-- and actually, if we wanted the variance here, we would divide this by the degrees of freedom. And we've learned multiple times the degrees of freedom here so let's say that we have-- so we know that we have m groups over here. So let me just write it as m and I'm not going to prove things rigorously here, but I want to show you where some of these strange formulas that show up in statistics books actually come from without proving it rigorously. More to give you the intuition. So we have m groups here. And each group here has n members. So how many total members do we have here? Well, we had m times n or 9, right? 3 times 3 total members. So our degrees of freedom-- and remember, you have however many data points you had minus 1 degrees of freedom because if you know the mean of means, if you assume you knew that, then only 9 minus 1, only eight of these are going to give you new information because if you know that, you could calculate the last one. Or it really doesn't have to be the last one. If you have the other eight, you could calculate this one. If you have eight of them, you could always calculate the ninth one using the mean of means. So one way to think about it is that there's only eight independent measurements here. Or if we want to talk generally, there are m times n-- so that tells us the total number of samples-- minus 1 degrees of freedom. And if we were actually calculating the variance here, we would just divide 30 by m times n minus 1 or this is another way of saying eight degrees of freedom for this exact example. We would take 30 divided by 8 and we would actually have the variance for this entire group, for the group of nine when you combine them. I'll leave you here in this video. In the next video, we're going to try to figure out how much of this total variance, how much of this total squared sum, total variation comes from the variation within each of these groups versus the variation between the groups. And I think you get a sense of where this whole analysis of variance is coming from. It's the sense that, look, there's a variance of this entire sample of nine, but some of that variance-- if these groups are different in some way-- might come from the variation from being in different groups versus the variation from being within a group. And we're going to calculate those two things and we're going to see that they're going to add up to the total squared sum variation.