Main content

## Chi-square goodness-of-fit tests

Current time:0:00Total duration:10:23

# Chi-square distribution introduction

Tags

## Video transcript

In this video, we'll
just talk a little bit about what the chi-square
distribution is, sometimes called the chi-squared
distribution. And then in the next
few videos, we'll actually use it to
really test how well theoretical distributions
explain observed ones, or how good a fit
observed results are for theoretical
distributions. So let's just think
about it a little bit. So let's say I have
some random variables. And each of them
are independent, standard, normal, normally
distributed random variables. So let me just remind
you what that means. So let's say I have the random
variable X. If X is normally distributed, we
could write that X is a normal random
variable with a mean of 0 and a variance of 1. Or you could say that
the expected value of X, is equal to 0, or
in that the variance of our random variable
X is equal to 1. Or just to visualize
it is that, when we take an instantiation
of this very variable, we're sampling from a
normal distribution, a standardized normal
distribution that looks like this. Mean of 0 and then a variance
of 1, which would also mean, of course, a standard
deviation of 1. So that could be the standard
deviation, or the variance, or the standard deviation,
that would be equal to 1. So a chi-square
distribution, if you just take one of these
random variables-- and let me define it this way. Let me define a new
random variable. Let me define a
new random variable Q that is equal to--
you're essentially sampling from this the
standard normal distribution and then squaring
whatever number you got. So it is equal to this
random variable X squared. The distribution for this
random variable right here is going to be an example
of the chi-square distribution. Actually what we're going
to see in this video is that the chi-square, or the
chi-squared distribution is actually a set
of distributions depending on how
many sums you have. Right now, we only have
one random variable that we're squaring. So this is just one
of the examples. And we'll talk more
about them in a second. So this right
here, this we could write that Q is a chi-squared
distributed random variable. Or that we could use
this notation right here. Q is-- we could
write it like this. So this isn't an X anymore. This is the Greek
letter chi, although it looks a lot like a curvy X. So
it's a member of chi-squared. And since we're only
taking one sum over here-- we're only taking the
sum of one independent, normally distributed, standard
or normally distributed variable, we say that this
only has 1 degree of freedom. And we write that over here. So this right here is
our degree of freedom. We have 1 degree of
freedom right over there. So let's call this Q1. Let's say I have
another random variable. Let's call this Q-- let me
do it in a different color. Let me do Q2 in blue. Let's say I have another
random variable, Q2, that is defined as-- let's say I
have one independent, standard, normally distributed variable. I'll call that X1. And I square it. And then I have another
independent, standard, normally distributed
variable, X2. And I square it. So you could imagine
both of these guys have distributions like this. And they're independent. So get to sample Q2,
you essentially sample X1 from this distribution,
square that value, sample X2 from the same distribution,
essentially, square that value, and then add the two. And you're going to get Q2. This over here-- here we
would write-- so this is Q1. Q2 here, Q2 we would
write is a chi-squared, distributed random variable
with 2 degrees of freedom. Right here. 2 degrees of freedom. And just to visualize
kind of the set of chi-squared distributions,
let's look at this over here. So this, I got this
off of Wikipedia. This shows us some of the
probability density functions for some of the
chi-square distributions. This first one over here,
for k of equal to 1, that's the degrees of freedom. So this is essentially our Q1. This is our probability
density function for Q1. And notice it really
spikes close to 0. And that makes sense. Because if you are
sampling just once from this standard
normal distribution, there's a very high
likelihood that you're going to get something
pretty close to 0. And then if you square
something close to 0-- remember, these are decimals, they're
going to be less than 1, pretty close to 0-- it's
going to become even smaller. So you have a high probability
of getting a very small value. You have high probabilities
of getting values less than some threshold,
this right here, less than, I guess, this is 1 right here. So the less than 1/2. And you have a very
low probability of getting a large number. I mean, to get a 4, you
would have to sample a 2 from this distribution. And we know that 2 is--
actually it's 2 variances or 2 standard deviations
from the mean. So it's less likely. And actually that's to get a 4. So to get even
larger numbers are going to be even less likely. So that's why you see
this shape over here. Now when you have 2
degrees of freedom, it moderates a little bit. This is the shape,
this blue line right here is the shape of Q2. And notice you're a little
bit less likely to get values close to 0 and a little bit
more likely to get numbers further out. But it still is kind of
shifted or heavily weighted towards small numbers. And then if we had
another random variable, another chi-squared
distributed random variable-- so then we have, let's say, Q3. And let's define
it as the sum of 3 of these independent
variables, each of them that have a standard
normal distribution. So X1, X2 squared
plus X3 squared. Then all of a sudden, our
Q3-- this is Q2 right here-- has a chi-squared distribution
with 3 degrees of freedom. And so this guy
right over here-- that will be this green line. Maybe I should have
done this in green. This will be this
green line over here. And then notice,
now it's starting to become a little
bit more likely that you'd get values
in this range over here because you're taking the sum. Each of these are going
to be pretty small values, but you're taking the sum. So it starts to shift it a
little over to the right. And so the more degrees
of freedom you have, the further this lump
starts to move to the right and, to some degree, the
more symmetric it gets. And what's interesting
about this, I guess it's different than
almost every other distribution we've looked at, although
we've looked at others that have this property as well,
is that you can't have a value below 0 because
we're always just squaring these values. Each of these guys can
have values below 0. They're normally distributed. They could have negative values. But since we're squaring and
taking the sum of squares, this is always going
to be positive. And the place that this
is going to be useful-- and we're going to see
in the next few videos-- is in measuring essentially
error from an expected value. And if you took take
this total error, you can figure out
the probability of getting that total error
if you assume some parameters. And we'll talk more about
it in the next video. Now with that said, I
just want to show you how to read a chi-squared
distribution table. So if I were to ask you, if
this is our distribution-- let me pick this
blue one right here. So over here, we have
2 degrees of freedom because we're adding 2
of these guys right here. If I were to ask you, what is
the probability of Q2 being greater than-- or, let
me put it this way. What is the probability of
Q2 being greater than 2.41? And I'm picking that
value for a reason. So I want the probability of
Q2 being greater than 2.41. What I want to do is I'll
look at a chi-square table like this. Q2 is a version of chi-squared
with 2 degrees of freedom. So I look at this row right
here under 2 degrees of freedom. And I want the probability of
getting a value above 2.41. And I picked 2.41 because
it's actually at this table. And so most of these
chi-squared-- the reason why we have these weird
numbers like this instead of whole numbers or
easy-to-read fractions is it is actually
driven by the p value. It's driven by the
probability of getting something larger
than that value. So normally you would
look at the other way. You'd say, OK, if I want to
say, what chi-squared value for 2 degrees of
freedom, there's a 30% chance of getting
something larger than that? Then I would look up 2.41. But I'm doing it
the other way just for the sake of this video. So if I want the probability
of this random variable right here being greater
than 2.41, or its p value, we read it right here. It is 30%. And just to visualize
it on this chart, this chi-square distribution--
this was Q2, the blue one, over here-- 2.41 is
going to sit-- let's see. This is 3. This is 2.5. So 2.41 is going to be
someplace right around here. So essentially,
what that table is telling us is, this entire
area under this blue line right here, what is that? And that right there is
going to be 30% of-- well, it's going to be 0.3. Or you could view it as
30% of the entire area under this curve, because
obviously all the probabilities have to add up to 1. So that's our intro to the
chi-square distribution. In the next video,
we're actually going to use it to make some,
or to test some, inferences.