Main content

## AP®︎/College Statistics

# Example: Correlation coefficient intuition

AP.STATS:

DAT‑1 (EU)

, DAT‑1.B (LO)

, DAT‑1.B.1 (EK)

, DAT‑1.C (LO)

, DAT‑1.C.1 (EK)

CCSS.Math: Sal explains the intuition behind correlation coefficients and does a problem where he matches correlation coefficients to scatter plots.

## Video transcript

- I took some screen captures from the Khan Academy exercise on
correlation coefficient intuition. They've given us some
correlation coefficients and we have to match them to the various scatterplots on that exercise. There's a little interface
where we can drag these around in a table to match them to the different scatterplots. The point isn't to figure out how exactly to calculate these, we'll
do that in the future, but really to get an intuition
of we are trying to measure. The main idea is that
correlation coefficients are trying to measure
how well a linear model can describe the relationship
between two variables. For example, let me do some coordinate axes here. Let's say that's one variable. Say that's my y variable and let's say that is my x variable. Let's say when x is low, y is low. When x is a little higher,
y is a little higher. When x is a little bit
higher, y is higher. When x is really high, y is even higher. A linear model would
describe it very, very well. It's quite easy to draw a line that essentially goes
through those points. So something like this
would have an r of 1, r is equal to one. A linear model perfectly describes it and it's a positive correlation. When one increases, when
one variable gets larger, then the other variable is larger. When one variable is
smaller then other variable is smaller and vice versa. Now what would an r of
negative one look like? Well, that would once again be a situation where a linear model works really well but when one variable moves up, the other one moves down and vice versa. Let me draw my coordinates, my coordinate axes again. I'm gonna try to draw a dataset where the r would be negative one. Maybe when y is high, x is very low. When y becomes lower, x become higher. When y becomes a good bit lower, x becomes a good bit higher. Once again, when y decreases, x increases or as x
increases, y decreases. They're moving in opposite directions but you can fit a line
very easily to this. The line would look something like this. This would have an r of negative one, and r of zero, r is equal to zero, would be a dataset which a line doesn't really fit very well at all. I'll do that one really small, since I don't have much space here. An r of zero might look something like this. Maybe I'll have a data point here, maybe have a data point here, maybe I have one there. There, there. And it wouldn't necessarily
be this well organized but this gives you a sense of things. How would you actually
try to fit a line here? You could equally justify
a line that looks like that or a line that looks like that, or a line that looks like that. A linear model really does not describe the relationship between
the two variables that well, right over here. So with that, is a primer. Let's see if we can
tackle these scatterplots. The way I'm gonna do it is I'm just gonna try to eyeball what a linear
model might look like. There's different methods of
trying to fit a linear model to a dataset, an imperfect dataset. I drew very perfect ones, at least for the r equals negative one and r equals one but these are what the real
world actually looks like. Very few times will things
perfectly sit on a line. For scatterplot A, if I
were to try to fit a line, it would look something like that. If I were to try to
minimize distances from the points to the line,
I do see a general trend if we look at these data points over here, when y is high, x is low. When x is larger, y is smaller. Looks like r is going
to be less than zero, and a reasonable bit less than zero. It's going to approach this thing here. If we look at our choices, it wouldn't be r equals 0.65. These are positive so I wouldn't
use that one or that one. And this one is almost no correlation. R equals negative 0.02, this
is pretty close to zero. I feel good with r is
equal to negative 0.72. I wanna be clear, if I didn't
have these choices here, I wouldn't just be able to say, just looking at these data points without being able to do a calculation, that r is equals to negative 0.72. I'm just basing it on
the intuition that it is a negative correlation,
it seems pretty strong. The pattern kind of jumps out at you, that when y is large, x is small. When x is large, y is small. So I like something that's approaching r equals negative one. I've used this one up already. Now scatterplot B, if I were
to just try to eyeball it, once again this is gonna be imperfect. But the trend, if I were
to try to fit a line, it looks something like that. It looks like a line
fits in reasonably well. There's some points that
would still be hard to fit. They're still pretty far from the line. It looks like it's a positive correlation. When y is small, x is
relatively small and vice versa. As x grows, y grows and
when y grows, x grows. This ones going to be positive and it looks like it would
be reasonably positive. I have two choices here. I don't know which of
these it's going to be. It's either going to be r is equal to 0.65 or r is equal to 0.84. I also got scatterplot C,
this ones all over the place. It kinda looks like what we did over here. What does a line look like? You could almost imagine anything. Does it look like that? Does a line look like that? There's not a direction
that you could say, "Well, as x increases, maybe
y increases or decreases." There's no rhyme or reason here, so this looks very non-correlated. So this one is pretty close to zero. I feel pretty good that this is the r is equal to negative .02. In fact, if we tried
probably the best line that could be fit, would be one with a slight negative slope. It might look something like this. And notice, even when
we try to fit a line, there's all sorts of points
that are way off the line. So the linear model did
not fit it that well. R is equal to negative 0.02, So we'll use that one. Now we have scatterplot D. That's gonna use one of the
other positive correlations and it does look like there
is a positive correlation. When y is low, x is low. When x is high, y is high and vice versa. We could try to fit something that looks something like that. But it's still not as good as that one. You can see the points
that we're trying to fit, there's several points that
are still pretty far away from our model. The model is not fitting it that well, so I would say scatterplot
B is a better fit. A linear model works
better for scatterplot B than it works for scatterplot D. I would give the higher r to scatterplot B and the lower r, r equals
0.65, to scatterplot D. R is equal to 0.65. Once again that's because
with a linear model it looks like there's a
trend but there's several more data points are way off the line in scatterplot D than in
the case of scatterplot B. There's a few that are
still way off the line but these are even more
off of the line in D.