Current time:0:00Total duration:11:48

0 energy points

Studying for a test? Prepare with these 2 lessons on Inference for categorical data (chi-square tests).

See 2 lessons

# Pearson's chi square test (goodness of fit)

Video transcript

I'm thinking about
buying a restaurant, so I go and ask
the current owner, what is the distribution
of the number of customers you get each day? And he says, oh, I've
already figure that out. And he gives me
this distribution over here, which essentially
says 10% of his customers come in on Monday, 10% on
Tuesday, 15% on Wednesday, so forth, and so on. They're closed on Sunday. So this is 100% of the
customers for a week. If you add that
up, you get 100%. I obviously am a
little bit suspicious, so I decide to see how good
this distribution that he's describing actually
fits observed data. So I actually observe the number
of customers, when they come in during the week,
and this is what I get from my observed data. So to figure out whether
I want to accept or reject his hypothesis right
here, I'm going to do a little bit
of a hypothesis test. So I'll make the null hypothesis
that the owner's distribution-- so that's this thing
right here-- is correct. And then the
alternative hypothesis is going to be that
it is not correct, that it is not a
correct distribution, that I should not feel
reasonably OK relying on this. It's not the correct--
I should reject the owner's distribution. And I want to do this with
a significance level of 5%. Or another way of
thinking about it, I'm going to calculate a
statistic based on this data right here. And it's going to be
chi-square statistic. Or another way to view
it is it that statistic that I'm going to
calculate has approximately a chi-square distribution. And given that it does have
a chi-square distribution with a certain number
of degrees of freedom and we're going to calculate
that, what I want to see is the probability of
getting this result, or getting a result like
this or a result more extreme less than 5%. If the probability of getting
a result like this or something less likely than
this is less than 5%, then I'm going to reject
the null hypothesis, which is essentially just rejecting
the owner's distribution. If I don't get
that, if I say, hey, the probability of getting
a chi-square statistic that is this extreme or more
is greater than my alpha, than my significance level,
then I'm not going to reject it. I'm going to say,
well, I have no reason to really assume
that he's lying. So let's do that. So to calculate the chi-square
statistic, what I'm going to do is-- so here we're assuming
the owner's distribution is correct. So assuming the
owner's distribution was correct, what would have
been the expected observed? So we have expected
percentage here, but what would have been
the expected observed? So let me write this right here. Expected. I'll add another row, Expected. So we would have expected
10% of the total customers in that week to
come in on Monday, 10% of the total
customers of that week to come in on Tuesday, 15%
to come in on Wednesday. Now to figure out what
the actual number is, we need to figure out the
total number of customers. So let's add up these
numbers right here. So we have-- I'll get
the calculator out. So we have 30 plus 14 plus
34 plus 45 plus 57 plus 20. So there's a total
of 200 customers who came into the
restaurant that week. So let me write this down. So this is equal to-- so I
wrote the total over here. Ignore this right here. I had 200 customers
come in for the week. So what was the expected
number on Monday? Well, on Monday, we would
have expected 10% of the 200 to come in. So this would have been 20
customers, 10% times 200. On Tuesday, another 10%. So we would have
expected 20 customers. Wednesday, 15% of 200,
that's 30 customers. On Thursday, we would have
expected 20% of 200 customers, so that would have
been 40 customers. Then on Friday, 30%, that
would have been 60 customers. And then on Friday 15% again. 15% of 200 would have
been 30 customers. So if this distribution
is correct, this is the actual number
that I would have expected. Now to calculate
chi-square statistic, we essentially just take--
let me just show it to you, and instead of
writing chi, I'm going to write capital X squared. Sometimes someone will write the
actual Greek letter chi here. But I'll write the
x squared here. And let me write it this way. This is our
chi-square statistic, but I'm going to write it with
a capital X instead of a chi because this is going
to have approximately a chi-squared distribution. I can't assume
that it's exactly, so this is where we're dealing
with approximations right here. But it's fairly
straightforward to calculate. For each of the days,
we take the difference between the observed
and expected. So it's going to
be 30 minus 20-- I'll do the first one
color coded-- squared divided by the expected. So we're essentially
taking the square of almost you could kind of
do the error between what we observed and expected or
the difference between what we observed and expect, and
we're kind of normalizing it by the expected right over here. But we want to take the
sum of all of these. So I'll just do all
of those in yellow. So plus 14 minus 20 squared
over 20 plus 34 minus 30 squared over 30 plus-- I'll continue
over here-- 45 minus 40 squared over 40 plus 57 minus
60 squared over 60, and then finally, plus 20
minus 30 squared over 30. I just took the observed
minus the expected squared over the expected. I took the sum of
it, and this is what gives us our
chi-square statistic. Now let's just calculate what
this number is going to be. So this is going to be equal
to-- I'll do it over here so you don't run out of space. So we'll do this a new color. We'll do it in orange. This is going to be
equal to 30 minus 20 is 10 squared, which is 100
divided by 20, which is 5. I might not be able to do all
of them in my head like this. Plus, actually, let me
just write it this way just so you can
see what I'm doing. This right here is 100
over 20 plus 14 minus 20 is negative 6 squared
is positive 36. So plus 36 over 20. Plus 34 minus 30 is
4, squared is 16. So plus 16 over 30. Plus 45 minus 40
is 5 squared is 25. So plus 25 over 40. Plus the difference
here is 3 squared is 9, so it's 9 over 60. Plus we have a difference of
10 squared is plus 100 over 30. And this is equal to-- and I'll
just get the calculator out for this-- this is
equal to, we have 100 divided by 20
plus 36 divided by 20 plus 16 divided by 30
plus 25 divided by 40 plus 9 divided by 60 plus 100
divided by 30 gives us 11.44. So let me write that down. So this right here
is going to be 11.44. This is my chi-square
statistic, or we could call it a big
capital X squared. Sometimes you'll have it
written as a chi-square, but this statistic is
going to have approximately a chi-square distribution. Anyway, with that
said, let's figure out, if we assume that it has roughly
a chi-square distribution, what is the probability of getting a
result this extreme or at least this extreme, I guess is another
way of thinking about it. Or another way of saying, is
this a more extreme result than the critical
chi-square value that there's a 5% chance of
getting a result that extreme? So let's do it that way. Let's figure out the
critical chi-square value. And if this is more
extreme than that, then we will reject
our null hypothesis. So let's figure out our
critical chi-square values. So we have an alpha of 5%. And actually the other
thing we have to figure out is the degrees of freedom. The degrees of freedom, we're
taking one, two, three, four, five, six sums, so
you might be tempted to say the degrees
of freedom are six. But one thing to
realize is that if you had all of this
information over here, you could actually figure out
this last piece of information, so you actually have
five degrees of freedom. When you have just kind of
n data points like this, and you're measuring kind of
the observed versus expected, your degrees of freedom
are going to be n minus 1, because you could figure
out that nth data point just based on everything
else that you have, all of the other information. So our degrees of freedom
here are going to be 5. It's n minus 1. So our significance level is 5%. And our degrees of freedom is
also going to be equal to 5. So let's look at our
chi-square distribution. We have a degree
of freedom of 5. We have a significance
level of 5%. And so the critical
chi-square value is 11.07. So let's go with this chart. So we have a
chi-squared distribution with a degree of freedom of 5. So that's this distribution
over here in magenta. And we care about a
critical value of 11.07. So this is right here. Oh, you actually even
can't see it on this. So if I were to keep drawing
this magenta thing all the way over here, if the
magenta line just kept going, over here, you'd have 8. Over here you'd have 10. Over here, you'd have 12. 11.07 is maybe some
place right over there. So what it's saying
is the probability of getting a result at least
as extreme as 11.07 is 5%. So we could write it even here. Our critical chi-square value is
equal to-- we just saw-- 11.07. Let me look at the chart again. 11.07. The result we got
for our statistic is even less likely than that. The probability is less
than our significance level. So then we are going to reject. So the probability
of getting that is-- let me put it this
way-- 11.44 is more extreme than our
critical chi-square level. So it's very unlikely that
this distribution is true. So we will reject
what he's telling us. We will reject
this distribution. It's not a good fit based
on this significance level.