Statistics and probability
Analysis of Variance 2 - Calculating SSW and SSB (Total Sum of Squares Within and Between). Created by Sal Khan.
Want to join the conversation?
- Thanks Sal for the great video.
Would you please help in elaborating the degrees of freedom part? I find it really interesting but I think that unfortunately I did not grasp their meaning in its entirety.
04:46- "In the future we might do a more detailed discussion of what degrees of freedom mean and how to ..."
- Well the way I explained it to myself:
If we have a defined population we can easily calculate the mean.
5,4,3,2,5 = 5+4+3+2+5/5=3,8
We can always find the "one missing" population member if we now that the mean is 3,8 and that there are 5 members and if we know the 4 members.
(5+4+3+2+x)/5=3,8 (multiply both sides by 5)
x=5 (fifth member of our population was indeed 5)
If we tried the same thing with two missing members it would not have came out the same.
(5+4+x+x)/4=5 multiply both sides by 4
Clearly not correct.(7 votes)
- These videos on ANOVA follow the videos on regression. I was wondering if someone can post a statement that clarifies the distinction between the respective goals of regression and ANOVA analysis, i.e., you conduct regression analysis when your objective is __, but you need ANOVA when your goal is ____. thanks in advance.(10 votes)
- Regression: You have two quantitative (numerical) variables, and you want to know the relationship and/or predict values of one of them. For example, you know a person's height, you want to predict his/her weight.
ANOVA: You are interested in a numerical variable (the response), and you want to see if there are difference in this variable over several groups. For example, you want to see if gas milage differs between sedans, minivans, and SUVs.(21 votes)
- For the SSB, I thought it was the mean of each group divided by the Grand mean--all squared, and so would be (2-4)**2 + (4-4)**2 + (6-4)**2. Why is it (2-4) repeated three times?(10 votes)
- One can understand the repetition of the deviations between group mean and the grand mean for each group by considering ANOVA for groups with different sample sizes. Repeating the deviations for every sample allows us to place some weightage to the sample size of that group. This is important as larger sample sizes give better estimates and lesser variances from the true statistic (population mean. proportion or variance). Hence by taking the number of samples into account, we explain how much we are relying on the data available from the sample in order to estimate the variances of the whole data. Therefore the variances of each group needs to include its sample size as well.(1 vote)
- When calculating SSB and SSW do the groups have to have the same degrees of freedom? In your example all the groups have three data points and 2 degrees of freedom, is it ok to have group A with 6 data points group B with 3 and group C with 5?(5 votes)
- Yes, it is possible to have groups with different sample sizes. This would be called an "unbalanced" design (versus a "balanced" design when all groups have the same sample size).
For the unbalanced design, the notation gets a little bit more complicated, but it all still works out. Some things that Sal mentions won't be strictly as they sound. For example, the "mean of means" that he references cannot literally be calculated as the mean of the group means. We would need to calculate a weighted mean of the group means, or directly average all the data points from all the groups.(8 votes)
- Whenever I hear the term "n-1 degrees of freedom" when there are really "n" independent samples, that always sounds strange to me. At3:40, SK is talking about the "n-1 degrees of freedom" b/c we "know" the mean. But, in calculating the mean itself, "n" independent samples were used. If any of those "n" random variables were different, the mean itself would be different. Any one of those "n" variables can have a result in all of the values in SK's blackboard above.**
Now, that being said, when SK talks about "n-1 degrees of freedom", he appears to be talking about the ERROR only FROM the mean. As in, we calculated the variance (a measure of error), and in calculating that "error", we used the mean as an intermediate value. But, GIVEN that value, there are only "n-1" degrees of freedom. However, how can we lose sight of the fact that calculating the mean ITSELF required "n" independent variables?
To summarize, there are really "n" independent variables / degrees of freedom in the WHOLE calculation. Why do we restrict the degrees of freedom in "knowing" the mean, when the mean itself was calculated with a set of values with "n" degrees of freedom - and NOT "n-1" degrees of freedom?
** I challenge anyone to tell me how they can keep all the final values (e.g. SST = 30) by holding only "n-1" of the values the same in the 3x3 matrix without any other constrains (i.e. mean). If I take the last variable (which you didn't hold), and change it, the mean itself and all the other following stats would change.
- I think the (n-1)d.o.f. comes from this:
Hope you can follow my line of thought here:
-If we say the mean HAS to be ie. 25
-We have 3 variables (X1, X2 and X3) and we're allowed to change these variables as we wish as long as the mean remains 25
-Then say we change X1 and X2 as we wish, we can move them freely around, but here's the caveat, in order for the mean to remain at 25, when we change X1 and X2, we have to use X3 to adjust for the movement in X1 and X2 such that our mean remains at 25. And with this in mind, we can't really say that X3 is a free variable, since it's tied to the mean. Thus we get (d.o.f. = n-1)
Also to note, this bears a high resemble of "solving for x"-type equations, and thus it's not only for the mean that this works.
Atleast that's how I see it, would love to be corrected if I'm wrong, since then my understanding needs to be updated :)(2 votes)
- Is the Sum of Squares Within variables also known as the Sum of Squares Estimated
Is the Sum of Squares Between variables also known as the Sum of Squares for Residuals(3 votes)
- We are calculating SSW as defined in the video but it is not intuitive to take 3*(2-4)^2 + 3*(4-4)^2 + 3*(6-4)^2 (according to the example given in the video) to measure squared sum between group.
(2-4)^2 + (4-4)^2 + (6-4)^2 is more intuitive but incorrect,why?
In measuring variance between group, why we are going up to element level 3*(2-4)^2 and not just performing (2-4)^2 for one group?
group mean represents a group. So if we want to take variation among group just take squared sum of (group_mean - mean_of_group_means).
Can someone please explain?
Instead if we calculate (2-4)^2+(4-6)^2+(2-6)^2 or (x1bar-x2bar)^2+(x2bar-x3bar)^2+(x3bar-x1bar)^2 then it can also measure variation between the groups and also it is more intuitive. I am not sure about this method but in this case it worked =24 same as calculated in the video..
I know that this method is not calculating variation from centre of means and just calculating distance between means of different groups.
Please clarify?(1 vote)
- What if one group had a mean that was a long way away from the grand mean, but it only had few observations, while another group mean was very close to the grand mean, and had many observations? Groups with more observations should have higher weight. That is one intuitive reason why the "point wise" calculations are the way they are.
There are a number of ways to understand this through the mathematics, but that will get into some more complicated formulas.(4 votes)
- How would we calculate SSwithin if only variances are given for each group?(1 vote)
- SSwithin can be calculated as: (n-1)*[ s1^2 + s2^2 + ... ]
To get this, look at what we do for SSwithin. For every point, we subtract the group mean from each value, square it, and add them all up. For notation, I'm going to use Mi as the group means.
SUM( SUM( (Xij - Mi)^2 ) )
The outside sum is just going across the groups, so let's look at each group separately: SUM( (Xij - Mi)^2 )
This looks very close to being a variance, doesn't it? In fact, all we need to is divide by (n-1) and we'd have the variance for that group! But we already have the variance, and we want the sum. So we can just "undo" the division by (n-1) that the variances have. For the ith group, we'd take (n-1)*Si^2 to get this sum. From there, we just have to do the same thing to every group, and add up the results.
If we have equal sample sizes for each group, they are all "n", so the denominator in the variance is (n-1) for each, meaning we can factor that out and just multiply (n-1) by the sum of all the variances. Hence, the formula at the top.(3 votes)
- Hi. Could this be expanded into three dimensions, i.e. done with a data cube?
When I watched the video I thought of some hourly data I've worked on before, which has a variation within the week, between different weeks and between different years. It would be interesting to have a good measure of how much of the total variation each part contributes.
I did a quick test in Excel to see if I could figure it out, but I couldn't quite get the sums to work out som that the sum of the parts was equal to the total. Does anyone know if it's possible to do in a similar manner to what the video describes?(1 vote)
- This video is about Analysis of Variance (ANOVA), or more specifically, One-Factor ANOVA. In this setting, we are thinking about two variables:
1. Something that we are measuring, like height, weight, MPG of an automobile, etc. These result in the 9 numbers that Sal was working with in this video.
2. A "factor", which is a variable comprising two or more groups. For example, say we want to compare the average MPG of an automobile, and we are looking at several groups: Sedans, Minivans, and SUVs/Trucks. This would be the three groups / columns that Sal had, group 1, 2, and 3 (or Green, Purple, Pink, if you prefer).
It is absolutely possible to add a second factor (i.e. another set of groups) into the analysis. For instance, say we think that in addition to differences in MPG between car types, we think there may also be differences between Asian, American, and European cars in terms of MPG. We could build that into the model as well, and it would be called "Two-factor ANOVA". We can extend this as much as we needed, though the more factors that we add, the more complicated it will be to actually understand the results.
That being said, your brief description would not be able to get dressed by having several factors. Since your factors are overlapping time periods (within week, between week, between years), the groups are not what we call independent. One problem is this: A collection of weeks belongs to a certain year. Say year A was during a recession, and year B was not. Then year B will look better, but so will all of the weeks associated with year B. The weeks are what we call "nested" within the years. Different weeks "belong" to specific years.(3 votes)
In the last video, we were able to calculate the total sum of squares for these nine data points right here. And these nine data points are grouped into three different groups, or if we want to speak generally, into m different groups. What I want to do in this video is to figure out how much of this total sum of squares is due to variation within each group versus variation between the actual groups. So first, let's figure out the total variation within the group. So let's call that the sum of squares within. So let's calculate the sum of squares within. I'll do that in yellow. Actually, I already used yellow, so let me do blue. So the sum of squares within. Let me make it clear. That stands for within. So we want to see how much of the variation is due to how far each of these data points are from their central tendency, from their respective mean. So this is going to be equal to-- let's start with these guys. So instead of taking the distance between each data point and the mean of means, I'm going to find the distance between each data point and that group's mean, because we want to square the total sum of squares between each data point and their respective mean. So let's do that. So it's 3 minus-- the mean here is 2-- squared, plus 2 minus 2 squared, plus 2 minus 2 squared, plus 1 minus 2 squared. 1 minus 2 squared plus-- I'm going to do this for all of the groups, but for each group, the distance between each data point and its mean. So plus 5 minus 4, plus 5 minus 4 squared, plus 4 minus 4 squared-- sorry, the next point was 3-- plus 3 minus 4 squared, plus 4 minus 4 squared. And then finally, we have the third group. But we're finding that all of the sum of squares from each point to its central tendency within that, but we're going to add them all up. And then we find the third group. So we have 5 minus-- oh, its mean is 6-- 5 minus 6 squared, plus 6 minus 6 squared, plus 7 minus 6 squared. And what is this going to equal? So this is going to be equal to-- up here, it's going to be 1 plus 0 plus 1. So that's going to be equal to 2 plus. And then this is going to be equal to 1, 1 plus 1 plus 0-- so another 2-- plus this is going to be equal to 1 plus 0 plus 1. 7 minus 6 is 1 squared is 1. So plus. So that's 2 over here. So this is going to be equal to our sum of squares within, I should say, is 6. So one way to think about it-- our total variation was 30. And based on this calculation, 6 of that 30 comes from a variation within these samples. Now, the next thing I want to think about is how many degrees of freedom do we have in this calculation? How many independent data points do we actually have? Well, for each of these-- so over here, we have n data points in one. In particular, n is 3 here. But if you know n minus 1 of them, you can always figure out the nth one if you know the actual sample mean. So in this case, for any of these groups, if you know two of these data points, you can always figure out the third. If you know these two, you can always figure out the third if you know the sample mean. So in general, let's figure out the degrees of freedom here. For each group, when you did this, you had n minus 1 degrees of freedom. Remember, n is the number of data points you had in each group. So you have n minus 1 degrees of freedom for each of these groups. So it's n minus 1, n minus 1, n minus 1. Or let me put it this way-- you have n minus 1 for each of these groups, and there are m groups. So there's m times n minus 1 degrees of freedom. And in this case in particular, each group-- n minus 1 is 2. Or in each case, you had 2 degrees of freedom, and there's three groups of that. So there are 6 degrees of freedom. And in the future, we might do a more detailed discussion of what degrees of freedom mean, and how to mathematically think about it. But the best-- the simplest way to think about it is really, truly independent data points, assuming you knew, in this case, the central statistic that we used to calculate the squared distance in each of them. If you know them already, the third data point could actually be calculated from the other two. So we have 6 degrees of freedom over here. Now, that was how much of the total variation is due to variation within each sample. Now let's think about how much of the variation is due to variation between the samples. And to do that, we're going to calculate. Let me get a nice color here. I think I've run out all the colors. We'll call this sum of squares between. The B stands for between. So another way to think about it-- how much of this total variation is due to the variation between the means, between the central tendency-- that's what we're going to calculate right now-- and how much is due to variation from each data point to its mean? So let's figure out how much is due to variation between these guys over here. Actually, let's think about just this first group. For this first group, how much variation for each of these guys is due to the variation between this mean and the mean of means? Well, so for this first guy up here-- I'll just write it all out explicitly-- the variation is going to be its sample mean. So it's going to be 2 minus the mean of means squared. And then for this guy, it's going to be the same thing-- his sample mean, 2 minus the mean of mean squared. Plus same thing for this guy, 2 minus the mean of mean squared. Or another way to think about it-- this is equal to-- I'll write it over here-- this is equal to 3 times 2 minus 4 squared, which is the same thing as 3. This is equal to 3 times 4. Three times 4 is equal to 12. And then we could do it for each of them. And actually, I want to find the total sum. So let me just write it all out, actually. I think that might be an easier thing to do, because I want to find, for all of these guys combined, the sum of squares due to the differences between the samples. So that's from the contribution from the first sample. And then from the second sample, you have this guy over here. Oh, sorry. You don't want to calculate him. For this data point, the amount of variation due to the difference between the means is going to be 4 minus 4 squared. Same thing for this guy. It's going to be 4 minus 4 squared. And we're not taking it into consideration. We're only taking its sample mean into consideration. And then finally, plus 4 minus 4 squared. We're taking this minus this squared for each of these data points. And then finally, we'll do that with the last group. With the last group, sample mean is 6. So it's going to be 6 minus 4 squared, plus 6 minus 4 squared, plus 6 minus 4, plus 6 minus 4 squared. Now, let's think about how many degrees of freedom we had in this calculation right over here. Well, in general, I guess the easiest way to think about is, how much information did we have, assuming that we knew the mean of means? If we know the mean of means, how much here is new information? Well, if you know the mean of the mean, and you know two of these sample means, you can always figure out the third. If you know this one and this one, you can figure out that one. And if you know that one and that one, you can figure out that one. And that's because this is the mean of these means over here. So in general, if you have m groups, or if you have m means, there are m minus 1 degrees of freedom here. Let me write that. But with that said, well, and in this case, m is 3. So we could say there's two degrees of freedom for this exact example. Let's actually, let's calculate the sum of squares between. So what is this going to be? I'll just scroll down. Running out of space. This is going to be equal to-- this right here is 2 minus 4 is negative 2 squared is 4. And then we have three 4's over here. So it's 3 times 4, plus 3 times-- what is this? 3 times 0 plus-- what is this? The difference between each of these-- 6 minus 4 is 2 squared is 4-- so that means we have 3 times 4, plus 3 times 4. And we get 3 times 4 is 12, plus 0, plus 12 is equal to 24. So the sum of squares, or we could say, the variation due to what's the difference between the groups, between the means, is 24. Now, let's put it all together. We said that the total variation, that if you looked at all 9 data points, is 30. Let me write that over here. So the total sum of squares is equal to 30. We figured out the sum of squares between each data point and its central tendency, its sample mean-- we figured out, and when you totaled it all up, we got 6. So the sum of squares within was equal to 6. And in this case, it was 6 degrees of freedom. Or if we wanted to write it generally, there were m times n minus 1 degrees of freedom. And actually, for the total, we figured out we have m times n minus 1 degrees of freedom. Actually, let me just write degrees of freedom in this column right over here. In this case, the number turned out to be 8. And then just now we calculated the sum of squares between the samples. The sum of squares between the samples is equal to 24. And we figured out that it had m minus 1 degrees of freedom, which ended up being 2. Now, the interesting thing here-- and this is why this analysis of variance all fits nicely together, and in future videos we'll think about how we can actually test hypotheses using some of the tools that we're thinking about right now-- is that the sum of squares within, plus the sum of squares between, is equal to the total sum of squares. So a way to think about is that the total variation in this data right here can be described as the sum of the variation within each of these groups, when you take that total, plus the sum of the variation between the groups. And even the degrees of freedom work out. The sum of squares between had 2 degrees of freedom. The sum of squares within each of the groups had 6 degrees of freedom. 2 plus 6 is 8. That's the total degrees of freedom we had for all of the data combined. It even works if you look at the more general. So our sum of squares between had m minus 1 degrees of freedom. Our sum of squares within had m times n minus 1 degrees of freedom. So this is equal to m minus 1, plus mn minus m. These guys cancel out. This is equal to mn minus 1 degrees of freedom, which is exactly the total degrees of freedom we had for the total sum of squares. So the whole point of the calculations that we did in the last video and in this video is just to appreciate that this total variation over here, this total variation that we first calculated, can be viewed as the sum of these two component variations-- how much variation is there within each of the samples plus how much variation is there between the means of the samples? Hopefully that's not too confusing.