If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Chi-square test for association (independence)

Chi-square test for association/independence.

Want to join the conversation?

Video transcript

- [Instructor] We're already familiar with the chi-squared statistic. If you're not, I encourage you to review the videos on that. And we've already done some hypothesis testing with the chi-squared statistic, and we've even done some hypothesis testing based on two-way tables. And now we're going to extend that by thinking about a chi-squared test for association between two variables. So let's say that we suspect that someone's foot length is related to their hand length. That these things are not independent. Well, what we can do is set up a hypothesis test. And remember, the null hypothesis in a hypothesis test, is to always assume no news. So what we could say is here is that there is no association. No association between, between foot and hand length. Another way to think about it is that they are independent. And oftentimes what we're doing is called a chi-squared test for independence. And then our alternative hypothesis would be our suspicion there is an association. There is an association. So, foot and hand length are not independent. So what we can then do is go to a population, and we can randomly sample it. And so let's say we randomly sample 100 folks. And for all of those 100 folks, we figure out whether their right hand is longer, their left hand is longer, or both hands are the same. And we also do that for the feet, and we tabulate all of the data. And this is the data that we actually get. Now it's worth thinking about this for a second on how what we just did is different from a chi-squared test for homogeneity. And a chi-squared test for homogeneity, we sample from two different populations where we look at two different groups, and we see whether the distribution of a certain variable amongst those two different groups is the same. Here we are just sampling from one group, but we're thinking about two different variables for that one group. We're thinking about feet length, and we're thinking about hand length. And so you can see here, that 11 folks had both their right hand longer and their right foot longer. Three folks had their right hand longer, but their left foot was longer. And then eight folks had their right hand longer, but both feet were the same. Likewise, we had nine people where their left foot and hand was longer, but you had two people where the left hand was longer, but the right foot was longer. And we can go through all of these. But to do our chi-squared test, we would've said, what would be the expected value of each of these data points if we assumed that the null hypothesis was true? That there was no association between foot and hand length. So to help us do that, I'm going to make a total of our columns here, and also a total of our rows here. Let me draw a line here, so we know what's going on. And so, what are the total number of people who had a longer right hand? Well, it's going to be 11 plus three plus eight, which is 22. The total number of people who had a longer left hand is two plus nine plus 14, which is 25. And then the total number of people whose hands had the same length, 12 plus 13 plus 28, 25 plus 28, that is 53. And then if I were to total this column, 22 plus 25 is 47, plus 53, we get 100, right over here. And then if we total the number of people who had a longer right foot, 11 plus two plus 12, is 13 plus 12, that is 25. Longer left foot, three plus nine plus 13, that's also 25. And then we can either add these up, and we would get 50, or we could say, hey 25 plus 25 plus what is 100? Well, that is going to be equal to 50. Now to figure out these expected values, remember, we're going to figure out the expected values assuming that the null hypothesis is true. Assuming that these distributions are independent. That feet length and hand length are independent variables. Well, if they are independent, which we are assuming, then our best estimate is that 22% have a longer right hand, and our best estimate is that 25% have a longer right foot. And so out of 100, you would expect 0.22 times 0.25 times 100 to have a longer right hand and foot. I'm just multiplying the probabilities, which you would do if these were independent variables. And so 0.22 times 0.25, let's see, one fourth of 22 is 5 1/2, so this is going to be equal to 5.5. Now what number would you expect to have a longer right hand, but a longer left foot? So that would be 0.22 times 0.25 times 100. Well, we already calculated what that would be. That would be 5.5. And then to figure out the expected number that would have a longer right hand, but both feet would be the same length, we could multiply 22 out of 100 times 50 out of 100 times 100, which is going to be half of 22, which is equal to 11. And we can keep going. This value right over here would be 0.25 times 0.25 times 100, 25 times 25 is 625, so that would be 6.25. This value right over here would be 0.25 times 0.25 times 100, which is again, 6.25. And then this value right over here, a couple of ways we can get it. We can multiply 0.25 times 50 times 100, which would get us to 12.5, or we could have said this plus this plus this has to equal 25, so this would be 12.5. And on this expected value, we can figure out because 5.5 plus 6.25 plus this is going to equal 25. So let's see, 5.5 plus 6.25 is 11.75. 11.75 plus 13.25 is equal to 25. Same thing over here. This would be 13.25, 'cause this is 11.75 plus 13.25 is equal to 25. If we add these two together, we get 26.5. 26.5 plus what is equal to 53? Well, it'd be equal to another 26.5. Now once you figure out all of your expected values, that's a good time to test your conditions. The first condition is that you took a random sample. So let's assume we had done that. The second condition is that your expected value for any of the data points has to be at least equal to five. And we can see that all of our expected values are at least equal to five. The actual data points we got do not have to be equal to five. So it's okay that we got a two here, because the expected value here is five or larger. And then the last condition is the independence condition. That either we are sampling with replacement or that we have to feel comfortable that our sample size is no more than 10% of the population. So let's assume that that happened as well. So assuming we met all of those conditions, we are ready to calculate our chi-squared statistic. And so what we're going to do, is for every data point, we're going to find the difference between the data point, 11 minus the expected, minus 5.5, squared over the expected, so I did that one. Now I'll do this one. So plus three minus 5.5 squared over 5.5 plus, now I'll do this one, eight minus 11 squared over 11, then I'll do this one, two minus 6.25 squared over 6.25. And I'll keep doing it. I'm going to do it for all nine of these data points. And I actually calculated this ahead of time to save some time. And so if you do this for all nine of the data points, you're going to get a chi-squared statistic of 11.942. Now before we calculate the P-value, we're going to have to think about what are our degrees of freedom? Now we have a three-by-three table here, so one way to think about it, it's the number of rows minus one, times the number of columns minus one, and this is two times two, which is equal to four. Another way to think about it is if you know four of these data points and you know the totals, then you can figure out the other five data points. And so now we are ready to calculate a P-value. And you can do that using a calculator, and you can do that using a chi-squared table, but let's say we did it using a calculator, and we get a P-value of 0.018 And just to remind ourselves what this is, this is the probability of getting a chi-squared statistic at least this large or larger. And so next, we do what we always do with hypothesis testing. We compare this to our significance level. And we actually should have set our significance level from the beginning. So let's just assume that when we set up our hypotheses here, we also said that we want a significance level of 0.05. You really should do this before you calculate all of this. But then you compare your P-value to your significance level, and we see that this P-value is a good bit less than our significance level. And so one way to think about it is, we got all these expected values assuming that the null hypothesis was true. But the probability of getting a result this extreme or more extreme is less than 2%, which is lower than our significance level. And so this will lead us to reject our null hypothesis and it suggests to us that there is an association between hand length and foot length.