Statistics and probability
- Chi-square distribution introduction
- Pearson's chi square test (goodness of fit)
- Chi-square statistic for hypothesis testing
- Chi-square goodness-of-fit example
- Expected counts in a goodness-of-fit test
- Conditions for a goodness-of-fit test
- Test statistic and P-value in a goodness-of-fit test
- Conclusions in a goodness-of-fit test
Chi-Square Distribution Introduction. Created by Sal Khan.
Want to join the conversation?
- I always love Sal's videos and they help me so much, but this one confused me even further than I already was. I know it says Chi-Square introduction, but could we get an intro for the intro? Or am I the only one lost with this?!(95 votes)
- You may find it helpful to go back and watch the introductory videos for the normal distribution. If they are too advanced it may be because you don't have a good grasp on probability notation, in which case you need to go back further and watch the introductions to probability including expected values and variance.(0 votes)
- I thought degrees of freedom are n-1.(23 votes)
- The degrees of freedom vary depending on the constraints on your data.
In the "Pearson's chi square test" video, the degree of freedom is indeed n-1. This is because the total number of customers at the restaurant is fixed for the observed and the expected values. So if we know n-1 of the data points (in this case customers on a given day of the week) then we can figure out the last data point because we know the fixed total.
In "Contigency table chi-square test", the degree of freedom in completely different because our data is measured along two axes (type of drug and cured/sick). For a contingency table, the degree of freedom is (r - 1) * (c -1) where r is the number of rows and c is the number of columns.
In this video, a chi-square random variable is constructed out of n independent normal random variables. There are no constraints placed on these variables, so knowing n-1 doesn't mean you know the last one. So in this case, the degree of freedom is just n.(19 votes)
- Is the chi-square test still valid if we are not sure whether the "random variable" is subject to gaussian distribution?(11 votes)
- I pretty sure it has to be a normal distribution. You would have to run a separate test (probability plot, ect) to determine normality before running a chi-square test.(5 votes)
- What is the y axis on the Chi-Square graph?(10 votes)
- The Chi-square graph in the video plots probability density function value (y-axis) against for chi-squared variable (x-axis) at different degree-of-freedom values. It is important to remind ourselves that in probability 'density' function graph y-axis does not represent a probability for each variable. Rather, the summed area of the 'range' of variables define probability.
Hope this helps..(7 votes)
- My intuition for understanding the chi-square distribution is that while the sampling distribution of the sample means can be described with a normal distribution, the sampling distribution of sample variances can be described as a chi-square distribution (provided the population is normally distributed). Then I can say that the ratio of two chi-squares can be described with the F-distribution, the ratio of a normal and chi-square is "t" distributed, etc. I realize these statements may be over generalized, but how far off the mark am i in framing my understanding this way?(4 votes)
- Not very far off the mark at all. It depends on how precise you want to get, but if you're shooting for a general idea, you're right on the bulls-eye.
To get more technical:
- An F distribution is the ratio of two Chi-square variables, each of which is divided its respective degrees of freedom. So (C1/c1) / (C2/c2), where the capital letters are the random variable (RV), and the lowercase are the degrees of freedom.
- A t-distribution is the ratio of a Standard Normal divided by the square root of a Chi-square divided by it's degrees of freedom, e.g., Z / sqrt( C/c ), where Z is a standard normal RV, C is the Chi-square RV, and c is the degrees of freedom.
- The sample variance isn't directly Chi-square distributed, but an interesting ratio of it is, it's (n-1)*s^2 / sigma^2. So, this ratio uses the real variance inside of it, but that's not really a problem, because one of the main uses of this is to calculate a t test statistic, and the sigma term will actually cancel out.(7 votes)
- Which videos explain what degrees of freedom is ?(5 votes)
- Degrees of freedom indicate how many variable can vary. Formal: the numbers of observations minus the number of parameters (restrictions).
This is likely to not get you very far and there are no videos on it (on this site). I recommend searching the internet for more info(5 votes)
- how does one come up with the "degrees of freedom" data in order to confirm or deny the observed vs expected samples? are the "degrees of freedom" charts something that can be expected with a typical problem that may need to be solved and how does one determine the "degrees of freedom" information in the lab/field in a real world setting where the degrees of freedom may be unknown?(1 vote)
- Degrees of freedom are the number of values that are "free" to vary depending on the parameter you are trying to estimate. Using sample variance to estimate population variance is a typical example used to illustrate the concept (and possibly the most appropriate given that you seem to be studying the chi-square distribution). Because all the residuals (distance of each data point from the sample mean) in a sample must add up to 0, you could figure out what the last data point must be if you are all the other ones. In that sense, that last data point isn't "free" to vary because it must be THE value that makes the residuals add to zero. This is where the "n - 1" degrees of freedom arises from for sample variance and its corresponding chi-square distribution.
Finding the degrees of freedom is simply understanding the math and constraints underlying the parameters you are estimating. To my understanding, they aren't ever "unknown" in the field and are at most a simple calculation away. Some parameters have more than one degree of freedom (an example is the F-stat, which is a fraction and it's numerator and denominator will have separate degrees of freedom)(5 votes)
- At1:50Sal says that "we are essentially sampling from this standard normal distribution and then squaring whatever number you got". This is where I get confused. If we are sampling from this normal distribution, does that sample size has a sample size of n associated, and what is the relationship between that sample size n and the chi-square degrees of freedom referred to in the remainder of the video?(1 vote)
- If X is Normally distributed with mean=0 and sd=1 , then X^2 is Chi-square distributed with df=1. Furthermore, if several Chi-square variables are independent, then if we add them together, we just need to add up their degrees of freedom.
So if X1, X2, and X3 are all Normally distributed with mean 0 and sd 1, and all three are independent, then X1^2 + X2^2 + X3^2 is Chi-square distributed with df=3.(3 votes)
- Right at the start, why do you square the variable X? The purpose of doing so should be explained.(1 vote)
- Sum or differences between two normally distributed variables is another normally distributed variable.
But by squaring and then adding we can remove the negative values by makin the variable not normally disributed (since it is squared now). This chi-squared distribution has some nice properties such as finding variances (non-negative always) of normal distributions.(2 votes)
In this video, we'll just talk a little bit about what the chi-square distribution is, sometimes called the chi-squared distribution. And then in the next few videos, we'll actually use it to really test how well theoretical distributions explain observed ones, or how good a fit observed results are for theoretical distributions. So let's just think about it a little bit. So let's say I have some random variables. And each of them are independent, standard, normal, normally distributed random variables. So let me just remind you what that means. So let's say I have the random variable X. If X is normally distributed, we could write that X is a normal random variable with a mean of 0 and a variance of 1. Or you could say that the expected value of X, is equal to 0, or in that the variance of our random variable X is equal to 1. Or just to visualize it is that, when we take an instantiation of this very variable, we're sampling from a normal distribution, a standardized normal distribution that looks like this. Mean of 0 and then a variance of 1, which would also mean, of course, a standard deviation of 1. So that could be the standard deviation, or the variance, or the standard deviation, that would be equal to 1. So a chi-square distribution, if you just take one of these random variables-- and let me define it this way. Let me define a new random variable. Let me define a new random variable Q that is equal to-- you're essentially sampling from this the standard normal distribution and then squaring whatever number you got. So it is equal to this random variable X squared. The distribution for this random variable right here is going to be an example of the chi-square distribution. Actually what we're going to see in this video is that the chi-square, or the chi-squared distribution is actually a set of distributions depending on how many sums you have. Right now, we only have one random variable that we're squaring. So this is just one of the examples. And we'll talk more about them in a second. So this right here, this we could write that Q is a chi-squared distributed random variable. Or that we could use this notation right here. Q is-- we could write it like this. So this isn't an X anymore. This is the Greek letter chi, although it looks a lot like a curvy X. So it's a member of chi-squared. And since we're only taking one sum over here-- we're only taking the sum of one independent, normally distributed, standard or normally distributed variable, we say that this only has 1 degree of freedom. And we write that over here. So this right here is our degree of freedom. We have 1 degree of freedom right over there. So let's call this Q1. Let's say I have another random variable. Let's call this Q-- let me do it in a different color. Let me do Q2 in blue. Let's say I have another random variable, Q2, that is defined as-- let's say I have one independent, standard, normally distributed variable. I'll call that X1. And I square it. And then I have another independent, standard, normally distributed variable, X2. And I square it. So you could imagine both of these guys have distributions like this. And they're independent. So get to sample Q2, you essentially sample X1 from this distribution, square that value, sample X2 from the same distribution, essentially, square that value, and then add the two. And you're going to get Q2. This over here-- here we would write-- so this is Q1. Q2 here, Q2 we would write is a chi-squared, distributed random variable with 2 degrees of freedom. Right here. 2 degrees of freedom. And just to visualize kind of the set of chi-squared distributions, let's look at this over here. So this, I got this off of Wikipedia. This shows us some of the probability density functions for some of the chi-square distributions. This first one over here, for k of equal to 1, that's the degrees of freedom. So this is essentially our Q1. This is our probability density function for Q1. And notice it really spikes close to 0. And that makes sense. Because if you are sampling just once from this standard normal distribution, there's a very high likelihood that you're going to get something pretty close to 0. And then if you square something close to 0-- remember, these are decimals, they're going to be less than 1, pretty close to 0-- it's going to become even smaller. So you have a high probability of getting a very small value. You have high probabilities of getting values less than some threshold, this right here, less than, I guess, this is 1 right here. So the less than 1/2. And you have a very low probability of getting a large number. I mean, to get a 4, you would have to sample a 2 from this distribution. And we know that 2 is-- actually it's 2 variances or 2 standard deviations from the mean. So it's less likely. And actually that's to get a 4. So to get even larger numbers are going to be even less likely. So that's why you see this shape over here. Now when you have 2 degrees of freedom, it moderates a little bit. This is the shape, this blue line right here is the shape of Q2. And notice you're a little bit less likely to get values close to 0 and a little bit more likely to get numbers further out. But it still is kind of shifted or heavily weighted towards small numbers. And then if we had another random variable, another chi-squared distributed random variable-- so then we have, let's say, Q3. And let's define it as the sum of 3 of these independent variables, each of them that have a standard normal distribution. So X1, X2 squared plus X3 squared. Then all of a sudden, our Q3-- this is Q2 right here-- has a chi-squared distribution with 3 degrees of freedom. And so this guy right over here-- that will be this green line. Maybe I should have done this in green. This will be this green line over here. And then notice, now it's starting to become a little bit more likely that you'd get values in this range over here because you're taking the sum. Each of these are going to be pretty small values, but you're taking the sum. So it starts to shift it a little over to the right. And so the more degrees of freedom you have, the further this lump starts to move to the right and, to some degree, the more symmetric it gets. And what's interesting about this, I guess it's different than almost every other distribution we've looked at, although we've looked at others that have this property as well, is that you can't have a value below 0 because we're always just squaring these values. Each of these guys can have values below 0. They're normally distributed. They could have negative values. But since we're squaring and taking the sum of squares, this is always going to be positive. And the place that this is going to be useful-- and we're going to see in the next few videos-- is in measuring essentially error from an expected value. And if you took take this total error, you can figure out the probability of getting that total error if you assume some parameters. And we'll talk more about it in the next video. Now with that said, I just want to show you how to read a chi-squared distribution table. So if I were to ask you, if this is our distribution-- let me pick this blue one right here. So over here, we have 2 degrees of freedom because we're adding 2 of these guys right here. If I were to ask you, what is the probability of Q2 being greater than-- or, let me put it this way. What is the probability of Q2 being greater than 2.41? And I'm picking that value for a reason. So I want the probability of Q2 being greater than 2.41. What I want to do is I'll look at a chi-square table like this. Q2 is a version of chi-squared with 2 degrees of freedom. So I look at this row right here under 2 degrees of freedom. And I want the probability of getting a value above 2.41. And I picked 2.41 because it's actually at this table. And so most of these chi-squared-- the reason why we have these weird numbers like this instead of whole numbers or easy-to-read fractions is it is actually driven by the p value. It's driven by the probability of getting something larger than that value. So normally you would look at the other way. You'd say, OK, if I want to say, what chi-squared value for 2 degrees of freedom, there's a 30% chance of getting something larger than that? Then I would look up 2.41. But I'm doing it the other way just for the sake of this video. So if I want the probability of this random variable right here being greater than 2.41, or its p value, we read it right here. It is 30%. And just to visualize it on this chart, this chi-square distribution-- this was Q2, the blue one, over here-- 2.41 is going to sit-- let's see. This is 3. This is 2.5. So 2.41 is going to be someplace right around here. So essentially, what that table is telling us is, this entire area under this blue line right here, what is that? And that right there is going to be 30% of-- well, it's going to be 0.3. Or you could view it as 30% of the entire area under this curve, because obviously all the probabilities have to add up to 1. So that's our intro to the chi-square distribution. In the next video, we're actually going to use it to make some, or to test some, inferences.