Main content

## Chi square tests for relationships (homogeneity or independence)

# Chi-square test for association (independence)

AP.STATS:

DAT‑3 (EU)

, DAT‑3.K (LO)

, DAT‑3.K.1 (EK)

, DAT‑3.L (LO)

, DAT‑3.L.1 (EK)

, DAT‑3.L.2 (EK)

, VAR‑8 (EU)

, VAR‑8.H (LO)

, VAR‑8.H.1 (EK)

, VAR‑8.I (LO)

, VAR‑8.I.2 (EK)

, VAR‑8.J (LO)

, VAR‑8.J.2 (EK)

, VAR‑8.K (LO)

, VAR‑8.K.1 (EK)

, VAR‑8.L (LO)

, VAR‑8.L.1 (EK)

, VAR‑8.M (LO)

, VAR‑8.M.1 (EK)

, VAR‑8.M.2 (EK)

## Video transcript

- [Instructor] We're already familiar with the chi-squared statistic. If you're not, I encourage you to review the videos on that. And we've already done
some hypothesis testing with the chi-squared statistic, and we've even done
some hypothesis testing based on two-way tables. And now we're going to extend that by thinking about a chi-squared test for association between two variables. So let's say that we suspect
that someone's foot length is related to their hand length. That these things are not independent. Well, what we can do is
set up a hypothesis test. And remember, the null hypothesis in a hypothesis test, is
to always assume no news. So what we could say is here is that there is no association. No association between, between foot and hand length. Another way to think about it
is that they are independent. And oftentimes what
we're doing is called a chi-squared test for independence. And then our alternative
hypothesis would be our suspicion there is an association. There is an association. So, foot and hand length
are not independent. So what we can then do
is go to a population, and we can randomly sample it. And so let's say we
randomly sample 100 folks. And for all of those 100 folks, we figure out whether
their right hand is longer, their left hand is longer,
or both hands are the same. And we also do that for the feet, and we tabulate all of the data. And this is the data that we actually get. Now it's worth thinking
about this for a second on how what we just did is different from a chi-squared test for homogeneity. And a chi-squared test for homogeneity, we sample from two different populations where we look at two different groups, and we see whether the distribution of a certain variable amongst those two different groups is the same. Here we are just sampling from one group, but we're thinking about
two different variables for that one group. We're thinking about feet length, and we're thinking about hand length. And so you can see here,
that 11 folks had both their right hand longer and
their right foot longer. Three folks had their right hand longer, but their left foot was longer. And then eight folks had
their right hand longer, but both feet were the same. Likewise, we had nine
people where their left foot and hand was longer,
but you had two people where the left hand was longer, but the right foot was longer. And we can go through all of these. But to do our chi-squared test, we would've said, what
would be the expected value of each of these data points if we assumed that the null hypothesis was true? That there was no association
between foot and hand length. So to help us do that,
I'm going to make a total of our columns here, and also
a total of our rows here. Let me draw a line here,
so we know what's going on. And so, what are the
total number of people who had a longer right hand? Well, it's going to be
11 plus three plus eight, which is 22. The total number of people
who had a longer left hand is two plus nine plus 14, which is 25. And then the total number
of people whose hands had the same length, 12 plus 13 plus 28, 25 plus 28, that is 53. And then if I were to total this column, 22 plus 25 is 47, plus 53,
we get 100, right over here. And then if we total the number of people who had a longer right
foot, 11 plus two plus 12, is 13 plus 12, that is 25. Longer left foot, three plus nine plus 13, that's also 25. And then we can either add these up, and we would get 50, or we could say, hey 25 plus 25 plus what is 100? Well, that is going to be equal to 50. Now to figure out these expected values, remember, we're going to
figure out the expected values assuming that the null hypothesis is true. Assuming that these
distributions are independent. That feet length and hand length
are independent variables. Well, if they are independent,
which we are assuming, then our best estimate is that
22% have a longer right hand, and our best estimate is that
25% have a longer right foot. And so out of 100, you would expect 0.22 times 0.25 times 100 to have a longer right hand and foot. I'm just multiplying the probabilities, which you would do if these
were independent variables. And so 0.22 times 0.25, let's see, one fourth of 22 is 5 1/2, so this is going to be equal to 5.5. Now what number would you expect to have a longer right hand, but a longer left foot? So that would be 0.22
times 0.25 times 100. Well, we already calculated
what that would be. That would be 5.5. And then to figure out the expected number that would have a longer right hand, but both feet would be the same length, we could multiply 22 out
of 100 times 50 out of 100 times 100, which is
going to be half of 22, which is equal to 11. And we can keep going. This value right over here would be 0.25 times 0.25 times
100, 25 times 25 is 625, so that would be 6.25. This value right over here
would be 0.25 times 0.25 times 100, which is again, 6.25. And then this value right over here, a couple of ways we can get it. We can multiply 0.25 times 50 times 100, which would get us to 12.5, or we could have said
this plus this plus this has to equal 25, so this would be 12.5. And on this expected
value, we can figure out because 5.5 plus 6.25 plus
this is going to equal 25. So let's see, 5.5 plus 6.25 is 11.75. 11.75 plus 13.25 is equal to 25. Same thing over here. This would be 13.25, 'cause this is 11.75 plus 13.25 is equal to 25. If we add these two together, we get 26.5. 26.5 plus what is equal to 53? Well, it'd be equal to another 26.5. Now once you figure out all
of your expected values, that's a good time to
test your conditions. The first condition is that
you took a random sample. So let's assume we had done that. The second condition is
that your expected value for any of the data points has
to be at least equal to five. And we can see that all
of our expected values are at least equal to five. The actual data points we got do not have to be equal to five. So it's okay that we got a two here, because the expected value
here is five or larger. And then the last condition
is the independence condition. That either we are
sampling with replacement or that we have to feel comfortable that our sample size is no more
than 10% of the population. So let's assume that
that happened as well. So assuming we met all
of those conditions, we are ready to calculate
our chi-squared statistic. And so what we're going to do, is for every data point,
we're going to find the difference between the data point, 11 minus the expected, minus 5.5, squared over the expected,
so I did that one. Now I'll do this one. So plus three minus 5.5 squared over 5.5 plus, now I'll do this one, eight minus 11 squared over 11, then I'll do this one, two
minus 6.25 squared over 6.25. And I'll keep doing it. I'm going to do it for all
nine of these data points. And I actually calculated
this ahead of time to save some time. And so if you do this for
all nine of the data points, you're going to get a
chi-squared statistic of 11.942. Now before we calculate the P-value, we're going to have to think about what are our degrees of freedom? Now we have a three-by-three table here, so one way to think about it, it's the number of rows minus one, times the number of columns minus one, and this is two times two,
which is equal to four. Another way to think
about it is if you know four of these data points
and you know the totals, then you can figure out
the other five data points. And so now we are ready
to calculate a P-value. And you can do that using a calculator, and you can do that using
a chi-squared table, but let's say we did
it using a calculator, and we get a P-value of 0.018 And just to remind ourselves what this is, this is the probability of getting a chi-squared statistic at
least this large or larger. And so next, we do what we always do with hypothesis testing. We compare this to our significance level. And we actually should have
set our significance level from the beginning. So let's just assume that when we set up our hypotheses here, we also said that we want a significance level of 0.05. You really should do this before
you calculate all of this. But then you compare your P-value to your significance level, and we see that this P-value is a good bit less than our significance level. And so one way to think about it is, we got all these expected values assuming that the null hypothesis was true. But the probability of getting a result this extreme or more
extreme is less than 2%, which is lower than
our significance level. And so this will lead us to
reject our null hypothesis and it suggests to us that
there is an association between hand length and foot length.

AP® is a registered trademark of the College Board, which has not reviewed this resource.