Main content

## Statistics and probability

### Course: Statistics and probability > Unit 14

Lesson 1: Chi-square goodness-of-fit tests- Chi-square distribution introduction
- Pearson's chi square test (goodness of fit)
- Chi-square statistic for hypothesis testing
- Chi-square goodness-of-fit example
- Expected counts in a goodness-of-fit test
- Conditions for a goodness-of-fit test
- Test statistic and P-value in a goodness-of-fit test
- Conclusions in a goodness-of-fit test

© 2023 Khan AcademyTerms of usePrivacy PolicyCookie Notice

# Chi-square distribution introduction

Chi-Square Distribution Introduction. Created by Sal Khan.

## Want to join the conversation?

- I always love Sal's videos and they help me so much, but this one confused me even further than I already was. I know it says Chi-Square introduction, but could we get an intro for the intro? Or am I the only one lost with this?!(95 votes)
- You may find it helpful to go back and watch the introductory videos for the normal distribution. If they are too advanced it may be because you don't have a good grasp on probability notation, in which case you need to go back further and watch the introductions to probability including expected values and variance.(0 votes)

- I thought degrees of freedom are n-1.(23 votes)
- The degrees of freedom vary depending on the constraints on your data.

In the "Pearson's chi square test" video, the degree of freedom is indeed n-1. This is because the total number of customers at the restaurant is fixed for the observed and the expected values. So if we know n-1 of the data points (in this case customers on a given day of the week) then we can figure out the last data point because we know the fixed total.

In "Contigency table chi-square test", the degree of freedom in completely different because our data is measured along two axes (type of drug and cured/sick). For a contingency table, the degree of freedom is (r - 1) * (c -1) where r is the number of rows and c is the number of columns.

In this video, a chi-square random variable is constructed out of n independent normal random variables. There are no constraints placed on these variables, so knowing n-1 doesn't mean you know the last one. So in this case, the degree of freedom is just n.(19 votes)

- Is the chi-square test still valid if we are not sure whether the "random variable" is subject to gaussian distribution?(11 votes)
- I pretty sure it has to be a normal distribution. You would have to run a separate test (probability plot, ect) to determine normality before running a chi-square test.(5 votes)

- What is the y axis on the Chi-Square graph?(10 votes)
- The Chi-square graph in the video plots probability density function value (y-axis) against for chi-squared variable (x-axis) at different degree-of-freedom values. It is important to remind ourselves that in probability 'density' function graph y-axis does not represent a probability for each variable. Rather, the summed area of the 'range' of variables define probability.

Hope this helps..(7 votes)

- My intuition for understanding the chi-square distribution is that while the sampling distribution of the sample means can be described with a normal distribution, the sampling distribution of sample variances can be described as a chi-square distribution (provided the population is normally distributed). Then I can say that the ratio of two chi-squares can be described with the F-distribution, the ratio of a normal and chi-square is "t" distributed, etc. I realize these statements may be over generalized, but how far off the mark am i in framing my understanding this way?(4 votes)
- Not very far off the mark at all. It depends on how precise you want to get, but if you're shooting for a general idea, you're right on the bulls-eye.

To get more technical:

- An F distribution is the ratio of two Chi-square variables, each of which is divided its respective degrees of freedom. So (C1/c1) / (C2/c2), where the capital letters are the random variable (RV), and the lowercase are the degrees of freedom.

- A t-distribution is the ratio of a Standard Normal divided by the square root of a Chi-square divided by it's degrees of freedom, e.g., Z / sqrt( C/c ), where Z is a standard normal RV, C is the Chi-square RV, and c is the degrees of freedom.

- The sample variance isn't*directly*Chi-square distributed, but an interesting ratio of it is, it's (n-1)*s^2 / sigma^2. So, this ratio uses the*real*variance inside of it, but that's not really a problem, because one of the main uses of this is to calculate a t test statistic, and the sigma term will actually cancel out.(7 votes)

- Which videos explain what degrees of freedom is ?(5 votes)
- Degrees of freedom indicate how many variable can vary. Formal: the numbers of observations minus the number of parameters (restrictions).

This is likely to not get you very far and there are no videos on it (on this site). I recommend searching the internet for more info(5 votes)

- Is it possible for this topic to be explained in... a simpler way?(3 votes)
- how does one come up with the "degrees of freedom" data in order to confirm or deny the observed vs expected samples? are the "degrees of freedom" charts something that can be expected with a typical problem that may need to be solved and how does one determine the "degrees of freedom" information in the lab/field in a real world setting where the degrees of freedom may be unknown?(1 vote)
- Degrees of freedom are the number of values that are "free" to vary depending on the parameter you are trying to estimate. Using sample variance to estimate population variance is a typical example used to illustrate the concept (and possibly the most appropriate given that you seem to be studying the chi-square distribution). Because all the residuals (distance of each data point from the sample mean) in a sample must add up to 0, you could figure out what the last data point must be if you are all the other ones. In that sense, that last data point isn't "free" to vary because it must be THE value that makes the residuals add to zero. This is where the "n - 1" degrees of freedom arises from for sample variance and its corresponding chi-square distribution.

Finding the degrees of freedom is simply understanding the math and constraints underlying the parameters you are estimating. To my understanding, they aren't ever "unknown" in the field and are at most a simple calculation away. Some parameters have more than one degree of freedom (an example is the F-stat, which is a fraction and it's numerator and denominator will have separate degrees of freedom)(5 votes)

- At1:50Sal says that "we are essentially sampling from this standard normal distribution and then squaring whatever number you got". This is where I get confused. If we are sampling from this normal distribution, does that sample size has a sample size of n associated, and what is the relationship between that sample size n and the chi-square degrees of freedom referred to in the remainder of the video?(1 vote)
- If X is Normally distributed with mean=0 and sd=1 , then X^2 is Chi-square distributed with df=1. Furthermore, if several Chi-square variables are independent, then if we add them together, we just need to add up their degrees of freedom.

So if X1, X2, and X3 are*all*Normally distributed with mean 0 and sd 1, and all three are independent, then X1^2 + X2^2 + X3^2 is Chi-square distributed with df=3.(3 votes)

- Right at the start, why do you square the variable X? The purpose of doing so should be explained.(1 vote)
- Sum or differences between two normally distributed variables is another normally distributed variable.

But by squaring and then adding we can remove the negative values by makin the variable not normally disributed (since it is squared now). This chi-squared distribution has some nice properties such as finding variances (non-negative always) of normal distributions.(2 votes)

## Video transcript

In this video, we'll
just talk a little bit about what the chi-square
distribution is, sometimes called the chi-squared
distribution. And then in the next
few videos, we'll actually use it to
really test how well theoretical distributions
explain observed ones, or how good a fit
observed results are for theoretical
distributions. So let's just think
about it a little bit. So let's say I have
some random variables. And each of them
are independent, standard, normal, normally
distributed random variables. So let me just remind
you what that means. So let's say I have the random
variable X. If X is normally distributed, we
could write that X is a normal random
variable with a mean of 0 and a variance of 1. Or you could say that
the expected value of X, is equal to 0, or
in that the variance of our random variable
X is equal to 1. Or just to visualize
it is that, when we take an instantiation
of this very variable, we're sampling from a
normal distribution, a standardized normal
distribution that looks like this. Mean of 0 and then a variance
of 1, which would also mean, of course, a standard
deviation of 1. So that could be the standard
deviation, or the variance, or the standard deviation,
that would be equal to 1. So a chi-square
distribution, if you just take one of these
random variables-- and let me define it this way. Let me define a new
random variable. Let me define a
new random variable Q that is equal to--
you're essentially sampling from this the
standard normal distribution and then squaring
whatever number you got. So it is equal to this
random variable X squared. The distribution for this
random variable right here is going to be an example
of the chi-square distribution. Actually what we're going
to see in this video is that the chi-square, or the
chi-squared distribution is actually a set
of distributions depending on how
many sums you have. Right now, we only have
one random variable that we're squaring. So this is just one
of the examples. And we'll talk more
about them in a second. So this right
here, this we could write that Q is a chi-squared
distributed random variable. Or that we could use
this notation right here. Q is-- we could
write it like this. So this isn't an X anymore. This is the Greek
letter chi, although it looks a lot like a curvy X. So
it's a member of chi-squared. And since we're only
taking one sum over here-- we're only taking the
sum of one independent, normally distributed, standard
or normally distributed variable, we say that this
only has 1 degree of freedom. And we write that over here. So this right here is
our degree of freedom. We have 1 degree of
freedom right over there. So let's call this Q1. Let's say I have
another random variable. Let's call this Q-- let me
do it in a different color. Let me do Q2 in blue. Let's say I have another
random variable, Q2, that is defined as-- let's say I
have one independent, standard, normally distributed variable. I'll call that X1. And I square it. And then I have another
independent, standard, normally distributed
variable, X2. And I square it. So you could imagine
both of these guys have distributions like this. And they're independent. So get to sample Q2,
you essentially sample X1 from this distribution,
square that value, sample X2 from the same distribution,
essentially, square that value, and then add the two. And you're going to get Q2. This over here-- here we
would write-- so this is Q1. Q2 here, Q2 we would
write is a chi-squared, distributed random variable
with 2 degrees of freedom. Right here. 2 degrees of freedom. And just to visualize
kind of the set of chi-squared distributions,
let's look at this over here. So this, I got this
off of Wikipedia. This shows us some of the
probability density functions for some of the
chi-square distributions. This first one over here,
for k of equal to 1, that's the degrees of freedom. So this is essentially our Q1. This is our probability
density function for Q1. And notice it really
spikes close to 0. And that makes sense. Because if you are
sampling just once from this standard
normal distribution, there's a very high
likelihood that you're going to get something
pretty close to 0. And then if you square
something close to 0-- remember, these are decimals, they're
going to be less than 1, pretty close to 0-- it's
going to become even smaller. So you have a high probability
of getting a very small value. You have high probabilities
of getting values less than some threshold,
this right here, less than, I guess, this is 1 right here. So the less than 1/2. And you have a very
low probability of getting a large number. I mean, to get a 4, you
would have to sample a 2 from this distribution. And we know that 2 is--
actually it's 2 variances or 2 standard deviations
from the mean. So it's less likely. And actually that's to get a 4. So to get even
larger numbers are going to be even less likely. So that's why you see
this shape over here. Now when you have 2
degrees of freedom, it moderates a little bit. This is the shape,
this blue line right here is the shape of Q2. And notice you're a little
bit less likely to get values close to 0 and a little bit
more likely to get numbers further out. But it still is kind of
shifted or heavily weighted towards small numbers. And then if we had
another random variable, another chi-squared
distributed random variable-- so then we have, let's say, Q3. And let's define
it as the sum of 3 of these independent
variables, each of them that have a standard
normal distribution. So X1, X2 squared
plus X3 squared. Then all of a sudden, our
Q3-- this is Q2 right here-- has a chi-squared distribution
with 3 degrees of freedom. And so this guy
right over here-- that will be this green line. Maybe I should have
done this in green. This will be this
green line over here. And then notice,
now it's starting to become a little
bit more likely that you'd get values
in this range over here because you're taking the sum. Each of these are going
to be pretty small values, but you're taking the sum. So it starts to shift it a
little over to the right. And so the more degrees
of freedom you have, the further this lump
starts to move to the right and, to some degree, the
more symmetric it gets. And what's interesting
about this, I guess it's different than
almost every other distribution we've looked at, although
we've looked at others that have this property as well,
is that you can't have a value below 0 because
we're always just squaring these values. Each of these guys can
have values below 0. They're normally distributed. They could have negative values. But since we're squaring and
taking the sum of squares, this is always going
to be positive. And the place that this
is going to be useful-- and we're going to see
in the next few videos-- is in measuring essentially
error from an expected value. And if you took take
this total error, you can figure out
the probability of getting that total error
if you assume some parameters. And we'll talk more about
it in the next video. Now with that said, I
just want to show you how to read a chi-squared
distribution table. So if I were to ask you, if
this is our distribution-- let me pick this
blue one right here. So over here, we have
2 degrees of freedom because we're adding 2
of these guys right here. If I were to ask you, what is
the probability of Q2 being greater than-- or, let
me put it this way. What is the probability of
Q2 being greater than 2.41? And I'm picking that
value for a reason. So I want the probability of
Q2 being greater than 2.41. What I want to do is I'll
look at a chi-square table like this. Q2 is a version of chi-squared
with 2 degrees of freedom. So I look at this row right
here under 2 degrees of freedom. And I want the probability of
getting a value above 2.41. And I picked 2.41 because
it's actually at this table. And so most of these
chi-squared-- the reason why we have these weird
numbers like this instead of whole numbers or
easy-to-read fractions is it is actually
driven by the p value. It's driven by the
probability of getting something larger
than that value. So normally you would
look at the other way. You'd say, OK, if I want to
say, what chi-squared value for 2 degrees of
freedom, there's a 30% chance of getting
something larger than that? Then I would look up 2.41. But I'm doing it
the other way just for the sake of this video. So if I want the probability
of this random variable right here being greater
than 2.41, or its p value, we read it right here. It is 30%. And just to visualize
it on this chart, this chi-square distribution--
this was Q2, the blue one, over here-- 2.41 is
going to sit-- let's see. This is 3. This is 2.5. So 2.41 is going to be
someplace right around here. So essentially,
what that table is telling us is, this entire
area under this blue line right here, what is that? And that right there is
going to be 30% of-- well, it's going to be 0.3. Or you could view it as
30% of the entire area under this curve, because
obviously all the probabilities have to add up to 1. So that's our intro to the
chi-square distribution. In the next video,
we're actually going to use it to make some,
or to test some, inferences.