If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

# Pearson's chi square test (goodness of fit)

## Video transcript

I'm thinking about buying a restaurant so I go and ask the current owner what is the distribution of the number of customers you get each day and he said oh I've already figured that out and he gives me this distribution over here which essentially says ten percent of his customers coming on Monday ten percent on Tuesday fifteen on Wednesday so forth and so on they're closed on Sunday so this is a hundred percent of their customers for a week if you add that up and get a hundred percent I obviously am a little bit suspicious so I decide to see how good how good this distribution that he's describing actually fits observed data so actually observe the number of customers when they come in during the week and this is what I get for my observed data so to figure out to figure out whether I want to accept or reject his hypothesis right here I'm going to do a little bit of a hypothesis test so I'll make the null hypothesis I'll make the null hypothesis that the owners owners owners distribution owners distribution so that's this thing right here is correct is correct and then the alternative hypothesis the alternative hypothesis is going to be that is not correct that it is not a correct and distribution that I should not feel reasonably okay relying on this it's not the correct I should I should reject the owners distribution and I want to do this with a significance level with a significant significant level of 5% of 5% or another way of thinking about it I'm going to calculate a statistic based on this data right here and it's going to be a chi-square statistic or another way to view it is that that cot that statistic that I'm going to calculate has approximately a chi-square distribution and given that it does have a chi-square distribution with a certain number of degrees of freedom and we're going to calculate that what I want to see is the probability of getting this result or getting a result like this or a result more extreme less than five percent if the probability of getting a result like this or something less likely than this is less than five percent then I'm going to reject the null hypothesis and or that or I'm gonna which is essentially just rejecting the owners the owners distribution if I don't get that if I if I say hey the prop the probability of getting a chi-square statistic that is this extreme or more is greater than is greater than my alpha than my significance level that I'm not gonna reject I'm saying well I have no reason to really you know assume that he's lying so let's do that so to calculate the chi-square statistic what I'm going to do is I'm going to so here we're assuming the owners distribution is correct we're assuming that the owners distribution is correct so assuming the owners distribution was correct what would have been the expected observed so we have the expected percentage here but what have been the expected observed so let me write this right here expect it I'll add another row expected so we would have expected 10% of the total customers in that week to come in on Monday 10 percent of the total customers of that week to come in on Tuesday 15 percent to come in on Wednesday now to figure out what that actual number is we need to figure out the total number of customers so let's add up these numbers right here we have the calculator out so we have 30 plus 14 plus 34 plus 45 plus 57 plus 20 so there's a total of 200 customers who came into the restaurant that week so let me write this down so this is equal to so I wrote the total over here total ignore this right here I had 200 customers come in for the week so what was the expected number on Monday well on Monday we would have expected 10% of the 200 to come in so this would have been 20 customers 10% times 200 if on Tuesday another 10% so we would have expected 20 customers Wednesday 15% of 200 that's 30 customers on two on Thursday we would expected 20% of 200 customers so that would have been 40 customers then on Friday 30% that have been 60 customers and then on Friday 15% again 15% of 200 would have been 30 customers so if this distribution is correct this is the actual number that I would have that I would have expected now to calculate our chi-square test statistic we essentially just take let me just show it to you and I'll write it instead of writing hi I'm gonna write a capital x-squared sometimes someone will write the actual Greek letter Chi here but I'll write the x squared here so that it's it will be it and let me write it this way this is our chi-square statistic chi-square statistic statistic and but I'm gonna write it with the capital X instead of a chi because this is going to have an approximately a chi-squared distribution I can't assume that it's exactly so this is we're dealing with approximations right here but it's fairly straightforward to calculate we take for each of the days we take the difference between the observed and the expected so it's going to be 30 minus 20 30 minus 20 I'll do the first one color-coded the squared divided by the expected so we're essentially taking the square of almost you could kind of view the error between what we observed and expected or the difference between the what we observe an expected we're kind of normalizing it by the expected right over here but we want to take the sum of all of these I'll just do all of those in yellow so plus plus 14 minus 20 plus 14 minus 20 squared over 20 plus 34 minus 30 squared 34 minus 30 where'd over 30 plus I'll continue over here 45 minus 40 45 minus 40 squared over 40 plus 57 minus 60 squared 57 minus 6t squared over 60 and then finally 20 minus 30 squared so plus 20 minus 30 squared over 30 I just took the observed minus the expected squared over the expected I took the sum of it and this is what gives us our chi-square statistic now let's just calculate what this number is going to be so this is going to be equal to I'll do it over here so you don't run out of space let me do this in a new color I'll do it in the orange this is going to be equal to this is what 30 minus 20 is 10 squared which is a hundred divided by 20 which is 5 I'm be able to do all of them in my head like this plus actually let me just write it this way just so you see what I'm doing this is are going to be a hundred this right here is a hundred over twenty plus 14 minus twenty is negative six squared is positive 36 so plus 36 over 20 plus 34 minus 30 is 4 squared is 16 so plus 16 over 30 plus 45 minus 40 is 5 squared is 25 so plus 25 over 40 plus the difference here is 3 squared is 9 so it's 9 over 60 plus we have a difference of 10 squared is 100 over 30 plus 100 over 30 and this is equal to and I'll just get the calculator out for this this is equal to well this is equal to we have 100 divided by 20 plus 36 divided by 20 plus 16 divided by 30 plus 25 divided by 40 plus 9 divided by 60 plus 100 divided by 30 gives us 11 point 4 for 11.4 4 so let me write that down so this right here is going to be this right here is going to be 11 0.4 for this is my chi-square statistic or we could call it a big capital x squared sometimes you'll have it with written as a chi-square but this is approximately this statistic is going to have approximately a chi-square distribution anyway with that said let's figure out if we assume that it has a roughly a chi-square distribution what is the probability of getting a result this extreme or this it at least this extreme I guess is another way of thinking about it or another way to say is this is this a more extreme result than the critical chi-square value that there's a 5% chance of getting a result of that extreme so let's do it that way let's figure out the critical chi-square value and if this is more extreme than that then we will reject our null hypothesis so let's figure out our critical a critical chi-square value so we have an alpha 5% and actually the other thing we have to figure out is the degrees of freedom degrees of freedom and the degrees of freedom here we have we're taking we're taking one two three four five six some so you might be tempted to say the degrees of freedom are six but one thing to realize is that if you had all of this information over here you could actually figure out you could actually figure out this last piece of information so you actually have five degrees of freedom when you have just kind of n data points like this and you're measuring kind of the observed versus the expected your degrees of freedom are going to be n minus 1 because you could figure out that nth data point just based on everything else that you have all of the other information so our degrees of freedom here are going to be 5 it's n minus 1 so our our significance level is 5% and our degrees of freedom degrees of freedom is also going to be equal to 5 so let's look at our chi-squared distribution we have a degree of freedom of 5 we have a significance level of 5% we have a significance level of 5% and so the critical chi-squared value is 11 point 0 7 so let's go to this chart so we have a chi-square distribution with a degree of freedom of 5 so that's this distribution over here in magenta and we care about a critical value a critical value of 11 point O 7 so this is right here we actually even can't see it on this so if I were to keep drawing this magenta thing all the way over here if I were to just keep if the magenta line just kept the distribution just kept going over here you'd have 8 over here you'd have 10 over here you'd have 12 11 point O 7 is maybe someplace right over there so that's it's what it's saying is the probability of getting a result at least as Extreme as 11 point O 7 is 5 percent 5 percent our result so our critical our critical chi-square value so we can write even here our critical chi-square value our critical chi-square value is equal to we just saw 11 point O seven let me look at the chart again 11 0.07 is equal to 11 point O 7 the result we got the result we got for our statistic is even less likely than that it's even less likely than that the probability is less than our significance level so then we are going to reject so the probability of getting that is let me put it this way eleven point eleven eleven point four four is more extreme than our critical than a critical chi-square level so it's very unlikely that this distribution is true so we will reject we will reject what he's telling us we will reject this distribution it's not a good fit based on this significance level