Main content

## Statistics and probability

### Unit 14: Lesson 1

Chi-square goodness-of-fit tests- Chi-square distribution introduction
- Pearson's chi square test (goodness of fit)
- Chi-square statistic for hypothesis testing
- Chi-square goodness-of-fit example
- Expected counts in a goodness-of-fit test
- Conditions for a goodness-of-fit test
- Test statistic and P-value in a goodness-of-fit test
- Conclusions in a goodness-of-fit test

© 2023 Khan AcademyTerms of usePrivacy PolicyCookie Notice

# Pearson's chi square test (goodness of fit)

Sal uses the chi square test to the hypothesis that the owner's distribution is correct. Created by Sal Khan.

## Want to join the conversation?

- A few things were unclear to me here. First, is chi squared always calculated as a difference between expected and observed divided by expected? Where is the derivation or explanation of this?

Secondly, in what scenarios should we use chi-squared vs. other statistics? Is there a limit on number of data points (or in other words, degrees of freedom) for this calculation? I think there should be more explanation on the use cases for this statistic and how its calculated.(14 votes) - I understand that if the chi-square value exceeds the appropriate minimum value in the chi-square distribution table, taken into account the degrees of freedom, you can reject the null hypothesis. (And that the same is true of the reverse, if the chi-square value does not exceed the appropriate minimum value in the chi-square distribution you will accept the null hypothesis). Can some explain to me why this is? I do not understand the theory about the minimum chi-square value to understand
**why**we reject a chi-square value that exceeds the value in the distribution table.(7 votes)- The question you answer with the test can be rephrased like this: "if the shop owner's theory is right (i.e. what percentage of customers come each day), what is then the probability to see the given observations (30 on monday, 14 on Tuesday, etc) or something more unlikely?"

This is the question you answer with the test, and you can calculate that probability exactly (or you can use tables). In this case it is just below 5%.

So, if the hypothesis is right and you make observations for for a weak, then there is almost 5% chance that you see what you see or something even less likely.

The 5% significance criteria is a subjective choice. Some use 1%, some 5%, some 0.5%. If I generally trusted the shop owner, and new that he had kept track of customers for a long period, and was a clever guy, then I would still believe his hypothesis. I mean, after all, 5% corresponds to about 1/20 - it is not a veeery rare observation. On the other hand, if I knew the shop owner was sloppy with numbers, and had a tendency to lie, etc., then I would be more likely to reject the hypothesis on basis of my observations. However, before I confronted him, I think I would observe another week to get more certain knowledge.

A lot of talking, sorry! My point here: you get the probability from the test. That is your result! What significance level to chose depends on the situation. Sometimes it might be life changing - if it was a test for some disease, I would never be satisfied with a 5% risk. Say, the docter tells me: there is only a 5% chance that you have that life threatening disease, given the test result, so you can go home. Then I would ask for another test! But if I was in the line for a super discount offer on black friday, and a clever person had calculated that there were only 5% chance that I would get the item before it was sold out, then I would step out of the line immediately. It depends on consequences, risk, what I already know and many other things!

Long answer, hope it made sense! :-)(6 votes)

- Had we counted legs of visitors instead of visitors, and assuming each has two, our chi-squared would be twice bigger for the same effective statistical question. It is thus incorrect to count legs. They indeed do not get odd values. It also seems that values which are not discrete, such as ammoung of food eaten each day, will result differently depending on the physical units of mass. It is thus also incorrect to use continbuous random variables for chi-statistics?(7 votes)
- This is an intersting question. I don't really know.

But I want to say, that if, say John has two legs, if his two legs has independent random choices( say one leg accatully comes to the resturant at Monday, the other at Friday ), not always making the same decision, then it will make sense to count legs. If not, I don't think you can count legs, Jo.(6 votes)

- How the formula we get, Chi-square = Sum of all (Observed frequency-Expected frequency)2 / Expected frequency(7 votes)
- what is a brain fitness test?(5 votes)
- Can someone introduce me a Statistics book that is written in plain English, in a way that a novice like me can understand and apply in real, practical situtation. Also if it can give me some insights and intuative feelings why statistics tests are the ways they are. It's even better. Thanks a lot!(4 votes)
- I've never read through it myself, but I've heard a lot of people say great things about "The Cartoon Guide to Statistics", by Gonick and Smith. I don't think it has any actual examples, but it is so strong on the plain English and intuition that you might be able to make better sense of your statistic textbook afterwards.(3 votes)

- Are the charts mentioned at the9:00marker something that is given or would I have to fill it in myself while solving the problem. If so, how would i go about doing that?(2 votes)
- You do not memorize the values. If you are doing homework/classwork, the tables are provided at the end of the book. On a test, the table should be attached to the test or the teacher should allow you to look up the critical value for a specified level of significance and degrees of freedom using the text. I hope this helps.(5 votes)

- Just food for thought.

Using a calculator and using chi2_Cdf[11,44, infinity, 5] (the chance of it being on 11,44 or higher on the function) results in a 4.33% chance. And if you think about it, if 11.07 gives 0.05, a higher number (namely 11.44) results in a lower p-value.

So at what point should we start approximating? Because in this example that fact results in a different answer.(1 vote)- I'm not sure what the problem is, if in fact you're trying to point out a problem. It is natural that a higher chi-square value - in your case, 11.44 - has a lower p-value than , say, 11.07. This is because the p-value is the probability that you calculate a statistic as extreme as the one you did purely by chance instead of sample differences. In other words, you are taking the area to the right of the value, and since the distribution decreases as you move to the right, the area should also decrease as you move the left bound closer to the right.(4 votes)

- I want to know more.

I'm interested in goodness of fit test about Poisson& Normal distribution.

Hope there is video about this,thanks.

:D(1 vote)- That's a great question! Think about the Poisson distribution for a bit: we have only non-negative integer values, right? So we'd hypothesize a mean, and then use that hypothesized value to calculate the probability of 0 events, 1 event, 2 events, 3 events, etc etc. Then we'd sort out how many times we observed a 0, or a 1, or a 2, and so on. In this way, we'd have our hypothesized probabilities, and the observed values. From there we can pick up pretty much where the video starts, or at about02:28if you want to skip some of the initial explanation.

For the Normal distribution, the process is largely the same. You first hypothesize a the distribution (normal with specified mean and standard deviation). But since the Normal distribution is continuous, you need to define bins for your random variable, such as 0-1, 1-2, 2-3, 3-4, etc, and then calculate the probability of those bins using the hypothesized mean and standard deviation.(3 votes)

- OK. But my problem is that I don't believe the answer: When you compare the observed and the expected, they seem to be "pretty close". For example, the order of the days is the same (Friday being the heaviest, Tuesday the Lightest, and matching closely in between). Admittedly, my eyeballed "pretty close" is hardly a scientific test, and there is more to fit than biggest-to-smallest arrangement. But still, something seems to be off, because typically what happens is the other way around: the numbers look crappy (they do not appear to fit the expected distribution), but come to find out, they do. In this case... if I were to see the provided data and the expected... It seems so fitting, I wouldn't even test it!(1 vote)
- Here's the thing about hypothesis tests, including the Chi-square test: they doesn't always lend itself to 'eyeballing.' Sometimes, sure, you can look at the data (or the observed vs expected) and see that the null hypothesis will be rejected. The same thing is true of a lot of procedures. Sometimes the descriptive statistics are clear enough that we can anticipate the results.

But it's not always so. And the formal hypothesis test provides an objective approach based on probabilities to make the decision.

In this case, while there's not one really big mis-fit, there are multiple smaller ones. Those smaller ones, taken together, are enough to make us think that the owner's hypothesized distribution is wrong. It might not be extremely wrong, but there's enough evidence to make us not believe it.

The owner over-estimated Tuesday and Saturday, while underestimating Monday, Wednesday, and Thursday. So he gave us this distribution:

10 10 15 20 30 15

Maybe in reality it should have been:

15 8 15 22 29 11

Or maybe like this:

12 8 17 23 29 11

The point is, the hypothesized distribution doesn't have to be*radically*different than the true one, it just has to be different*enough*.(3 votes)

## Video transcript

I'm thinking about
buying a restaurant, so I go and ask
the current owner, what is the distribution
of the number of customers you get each day? And he says, oh, I've
already figure that out. And he gives me
this distribution over here, which essentially
says 10% of his customers come in on Monday, 10% on
Tuesday, 15% on Wednesday, so forth, and so on. They're closed on Sunday. So this is 100% of the
customers for a week. If you add that
up, you get 100%. I obviously am a
little bit suspicious, so I decide to see how good
this distribution that he's describing actually
fits observed data. So I actually observe the number
of customers, when they come in during the week,
and this is what I get from my observed data. So to figure out whether
I want to accept or reject his hypothesis right
here, I'm going to do a little bit
of a hypothesis test. So I'll make the null hypothesis
that the owner's distribution-- so that's this thing
right here-- is correct. And then the
alternative hypothesis is going to be that
it is not correct, that it is not a
correct distribution, that I should not feel
reasonably OK relying on this. It's not the correct--
I should reject the owner's distribution. And I want to do this with
a significance level of 5%. Or another way of
thinking about it, I'm going to calculate a
statistic based on this data right here. And it's going to be
chi-square statistic. Or another way to view
it is it that statistic that I'm going to
calculate has approximately a chi-square distribution. And given that it does have
a chi-square distribution with a certain number
of degrees of freedom and we're going to calculate
that, what I want to see is the probability of
getting this result, or getting a result like
this or a result more extreme less than 5%. If the probability of getting
a result like this or something less likely than
this is less than 5%, then I'm going to reject
the null hypothesis, which is essentially just rejecting
the owner's distribution. If I don't get
that, if I say, hey, the probability of getting
a chi-square statistic that is this extreme or more
is greater than my alpha, than my significance level,
then I'm not going to reject it. I'm going to say,
well, I have no reason to really assume
that he's lying. So let's do that. So to calculate the chi-square
statistic, what I'm going to do is-- so here we're assuming
the owner's distribution is correct. So assuming the
owner's distribution was correct, what would have
been the expected observed? So we have expected
percentage here, but what would have been
the expected observed? So let me write this right here. Expected. I'll add another row, Expected. So we would have expected
10% of the total customers in that week to
come in on Monday, 10% of the total
customers of that week to come in on Tuesday, 15%
to come in on Wednesday. Now to figure out what
the actual number is, we need to figure out the
total number of customers. So let's add up these
numbers right here. So we have-- I'll get
the calculator out. So we have 30 plus 14 plus
34 plus 45 plus 57 plus 20. So there's a total
of 200 customers who came into the
restaurant that week. So let me write this down. So this is equal to-- so I
wrote the total over here. Ignore this right here. I had 200 customers
come in for the week. So what was the expected
number on Monday? Well, on Monday, we would
have expected 10% of the 200 to come in. So this would have been 20
customers, 10% times 200. On Tuesday, another 10%. So we would have
expected 20 customers. Wednesday, 15% of 200,
that's 30 customers. On Thursday, we would have
expected 20% of 200 customers, so that would have
been 40 customers. Then on Friday, 30%, that
would have been 60 customers. And then on Friday 15% again. 15% of 200 would have
been 30 customers. So if this distribution
is correct, this is the actual number
that I would have expected. Now to calculate
chi-square statistic, we essentially just take--
let me just show it to you, and instead of
writing chi, I'm going to write capital X squared. Sometimes someone will write the
actual Greek letter chi here. But I'll write the
x squared here. And let me write it this way. This is our
chi-square statistic, but I'm going to write it with
a capital X instead of a chi because this is going
to have approximately a chi-squared distribution. I can't assume
that it's exactly, so this is where we're dealing
with approximations right here. But it's fairly
straightforward to calculate. For each of the days,
we take the difference between the observed
and expected. So it's going to
be 30 minus 20-- I'll do the first one
color coded-- squared divided by the expected. So we're essentially
taking the square of almost you could kind of
do the error between what we observed and expected or
the difference between what we observed and expect, and
we're kind of normalizing it by the expected right over here. But we want to take the
sum of all of these. So I'll just do all
of those in yellow. So plus 14 minus 20 squared
over 20 plus 34 minus 30 squared over 30 plus-- I'll continue
over here-- 45 minus 40 squared over 40 plus 57 minus
60 squared over 60, and then finally, plus 20
minus 30 squared over 30. I just took the observed
minus the expected squared over the expected. I took the sum of
it, and this is what gives us our
chi-square statistic. Now let's just calculate what
this number is going to be. So this is going to be equal
to-- I'll do it over here so you don't run out of space. So we'll do this a new color. We'll do it in orange. This is going to be
equal to 30 minus 20 is 10 squared, which is 100
divided by 20, which is 5. I might not be able to do all
of them in my head like this. Plus, actually, let me
just write it this way just so you can
see what I'm doing. This right here is 100
over 20 plus 14 minus 20 is negative 6 squared
is positive 36. So plus 36 over 20. Plus 34 minus 30 is
4, squared is 16. So plus 16 over 30. Plus 45 minus 40
is 5 squared is 25. So plus 25 over 40. Plus the difference
here is 3 squared is 9, so it's 9 over 60. Plus we have a difference of
10 squared is plus 100 over 30. And this is equal to-- and I'll
just get the calculator out for this-- this is
equal to, we have 100 divided by 20
plus 36 divided by 20 plus 16 divided by 30
plus 25 divided by 40 plus 9 divided by 60 plus 100
divided by 30 gives us 11.44. So let me write that down. So this right here
is going to be 11.44. This is my chi-square
statistic, or we could call it a big
capital X squared. Sometimes you'll have it
written as a chi-square, but this statistic is
going to have approximately a chi-square distribution. Anyway, with that
said, let's figure out, if we assume that it has roughly
a chi-square distribution, what is the probability of getting a
result this extreme or at least this extreme, I guess is another
way of thinking about it. Or another way of saying, is
this a more extreme result than the critical
chi-square value that there's a 5% chance of
getting a result that extreme? So let's do it that way. Let's figure out the
critical chi-square value. And if this is more
extreme than that, then we will reject
our null hypothesis. So let's figure out our
critical chi-square values. So we have an alpha of 5%. And actually the other
thing we have to figure out is the degrees of freedom. The degrees of freedom, we're
taking one, two, three, four, five, six sums, so
you might be tempted to say the degrees
of freedom are six. But one thing to
realize is that if you had all of this
information over here, you could actually figure out
this last piece of information, so you actually have
five degrees of freedom. When you have just kind of
n data points like this, and you're measuring kind of
the observed versus expected, your degrees of freedom
are going to be n minus 1, because you could figure
out that nth data point just based on everything
else that you have, all of the other information. So our degrees of freedom
here are going to be 5. It's n minus 1. So our significance level is 5%. And our degrees of freedom is
also going to be equal to 5. So let's look at our
chi-square distribution. We have a degree
of freedom of 5. We have a significance
level of 5%. And so the critical
chi-square value is 11.07. So let's go with this chart. So we have a
chi-squared distribution with a degree of freedom of 5. So that's this distribution
over here in magenta. And we care about a
critical value of 11.07. So this is right here. Oh, you actually even
can't see it on this. So if I were to keep drawing
this magenta thing all the way over here, if the
magenta line just kept going, over here, you'd have 8. Over here you'd have 10. Over here, you'd have 12. 11.07 is maybe some
place right over there. So what it's saying
is the probability of getting a result at least
as extreme as 11.07 is 5%. So we could write it even here. Our critical chi-square value is
equal to-- we just saw-- 11.07. Let me look at the chart again. 11.07. The result we got
for our statistic is even less likely than that. The probability is less
than our significance level. So then we are going to reject. So the probability
of getting that is-- let me put it this
way-- 11.44 is more extreme than our
critical chi-square level. So it's very unlikely that
this distribution is true. So we will reject
what he's telling us. We will reject
this distribution. It's not a good fit based
on this significance level.