Main content

## Statistics and probability

# Example: Correlation coefficient intuition

AP.STATS:

DAT‑1 (EU)

, DAT‑1.B (LO)

, DAT‑1.B.1 (EK)

, DAT‑1.C (LO)

, DAT‑1.C.1 (EK)

Sal explains the intuition behind correlation coefficients and does a problem where he matches correlation coefficients to scatter plots.

## Want to join the conversation?

- What is "r", in the correlation coefficient r= 0.65?(13 votes)
- "r" is the correlation coefficient. It is always between -1 and 1, with -1 meaning the points are on a perfect straight line with negative slope, and r = 1 meaning the points are on a perfect straight line with positive slope.

If you want to calculate it from data, this is the procedure:

1) Find the mean (average) of all the x-values. Call this xbar.

2) Find the mean (average) of all the y-values. Call this ybar.

3) For every x-value, subtract xbar. Call these Δxi (i is an index. i = 1, 2, 3, ...)

4) For every y-value, subtract ybar. Call these Δyi (i is an index. i = 1, 2, 3, ...)

These Δxi's and Δyi's are called the "deviations". They will be approximately half positive and half negative, since (usually) about half the values are above the mean and half are below. To calculate r,

r = ( Σ(Δxi*Δyi) ) / [sqrt( Σ( Δxi)² ) * sqrt( Σ( Δyi)² ) ]

So you can see that the bottom is the square root of the sum of the squared deviations for x, times the same for y. Because the deviations are squared, every term is positive (except maybe a few are zero when Δxi = 0 or Δyi = 0 (i.e. for any values exactly equal to the mean).

The key is the top, where nothing is squared. The top is the sum of Δxi *Δyi, so it will be**positive**when Δx and Δy are**BOTH positive or BOTH negative**. This pushes r towards being positive (positive correlation). But when Δx and Δy have**opposite signs**, then Δxi *Δyi will be**negative**, and that pushes r towards being negative (negative correlation).

Make up a simple example and try it, with, say, four points. Here are four points to try it with that make the calculation not too bad:

(1, 1), (2, 3), (6, 5), (7, 11)

You should find xbar = 4 and ybar = 5

Thus, Δxi's are -3, -2, 2, 3, and Δyi's are -4, -2, 0, 6

Put these in the formula and you should get r = 0.891, a quite high correlation.

Conversely, pick any four points that make a horizontal rectangle, for example (2, 2), (8, 2), (2, 6), (8, 6). If you calculate r for these points, it will be 0.(32 votes)

- I think the answer is no, but does the slope of the line matter in regards to the r-value?

Will it always be -1 even if the line is just slightly tilted "downwards"?(4 votes)- Yes and no. There are two particular situations where the slope (oarlock there of) do matter:

1. When there is no variation in the x-variable (ie: all of the points are on a vertical line).

2. When there is no variation in the y-variable (all the points are on a horizontal line).

In both of these cases, the correlation (and also the slope) are undefined. But outside of these special cases, the answer is no, the magnitude of the slope doesn't matter, only the sign. If all the points lie on a straight line, then the slop could be -1 or -1000, and the correlation coefficient would still be -1.(13 votes)

- Can a line be greater than 1 or less than -1?(5 votes)
- Not in this context, no. 1 means a perfect positive correlation here while -1 means a perfect negative correlation. Any deviation from this perfect correlation would reduce the correlation coefficient. It is important to note that the correlation coefficient is NOT the incline / slope of the line that depicts the given data but rather the degree to which all of the data is displayable by that line or how far the data diverts from it. Hence the term linear correlation. If the data results in a perfect line, it is an r = 1 (the more, the more) or an r = -1 (the more, the less).(2 votes)

- @2:36, Sal says that a correlation coefficient of 0 means that a line would not fit well at all. Do we define lines as y=mx+b (algebra) or a set of points that extend infinitely in both/opposite directions(geometry)? Because x=0 geometrically is a line, but algebraically is not. So if the line of best fit is x=0, then what would the correlation coefficient be? (Sorry if this is a dumb question.)(5 votes)
- for the last specific case you mentioned (x=0), the correlation coefficient r would be 0 too.

visually, the line is exactly on the y axis. this means you have no choice on x variable and even when you "choose" 0 as x, it can't give you a definite answer as it could spit out any values as y, thus there's no trend between x and y variables here at all

i think your question isn't dumb, rather thought-provoking

keep going

#p.s. if you meant y=0m+b by saying x=0, the same logic can be applied more clearly. y=b means a line of 0 slope. thus whatever you choose as x, it has no impact on y as y is always b. so no trend, thus r=0 once again.(1 vote)

- How do you determine if its a strong or weak correlation(3 votes)
- Visually, if there is a strong correlation, you can see that by how close the points are to the line. For example, scatterplot B more closely fits the line than scatterplot D. More technically, you can calculate the standard deviation. A lower standard deviation would indicate a stronger correlation. As far as when something tips from being a weak correlation to a strong correlation, I'm afraid I don't know that yet.(5 votes)

- Is this the same as Pearson correlation coefficient?(3 votes)
- What would you say if the line went straight through the graph would the r value = 0 because it’s not positive or negative(3 votes)
- Yes. It'd just be r=0 because there really isn't a relationship between x and y (that is, if you and I are thinking of the same example). For example, take a horizontal line. If y is always something, x is always different (or not, it depends where the line is). There is no relationship there.(1 vote)

- Does the correlation coefficient show how much are data points scattered on the plane?

If I have data points very near to each other but I can't form specific line, does this mean that the scatter plot will have correlation coefficient other than zero?(2 votes)- If you have points very close to each other, but you can't create a specific line, it will be closer to either one or negative one. As the point gets near to other points, the correlation coefficient will go towards 1 or -1. As the points get far away from other points, the correlation coefficient goes toward zero. One of the graphs in Sal's video had lots of points scattered in different directions. This graph had a correlation coefficient of -0.02. Hope this helped!(2 votes)

- I don't quite understand the correlation. Can I say that correlation is based on slope concept?

Correlation varies between -1 and 1. Does this mean that the line with a slope larger than 1 or smaller than -1 (e.g. 1000, -320) will have correlation of 1 or -1?

What if I have a line y=5 (slope of which is zero) or x=5 (with undefined slope)?(2 votes)- Jotaro, the slope has nothing to do with the correlation coefficient. The slope is the measure of how steep a specific line is. However, the correlation coefficient is the measure of close of a line to the points. If a line fits the data well, it will be either 1 or -1. However, if the line does not fit the data well, it will be closer to zero. Hope this helped!(2 votes)

- at3:43you said " I do see a general trend" What do you mean a general trend? What does that mean?(2 votes)
- It means he sees a pattern ("when 𝑦 is high 𝑥 is low, when 𝑥 is larger 𝑦 is smaller").(3 votes)

## Video transcript

- I took some screen captures from the Khan Academy exercise on
correlation coefficient intuition. They've given us some
correlation coefficients and we have to match them to the various scatterplots on that exercise. There's a little interface
where we can drag these around in a table to match them to the different scatterplots. The point isn't to figure out how exactly to calculate these, we'll
do that in the future, but really to get an intuition
of we are trying to measure. The main idea is that
correlation coefficients are trying to measure
how well a linear model can describe the relationship
between two variables. For example, let me do some coordinate axes here. Let's say that's one variable. Say that's my y variable and let's say that is my x variable. Let's say when x is low, y is low. When x is a little higher,
y is a little higher. When x is a little bit
higher, y is higher. When x is really high, y is even higher. A linear model would
describe it very, very well. It's quite easy to draw a line that essentially goes
through those points. So something like this
would have an r of 1, r is equal to one. A linear model perfectly describes it and it's a positive correlation. When one increases, when
one variable gets larger, then the other variable is larger. When one variable is
smaller then other variable is smaller and vice versa. Now what would an r of
negative one look like? Well, that would once again be a situation where a linear model works really well but when one variable moves up, the other one moves down and vice versa. Let me draw my coordinates, my coordinate axes again. I'm gonna try to draw a dataset where the r would be negative one. Maybe when y is high, x is very low. When y becomes lower, x become higher. When y becomes a good bit lower, x becomes a good bit higher. Once again, when y decreases, x increases or as x
increases, y decreases. They're moving in opposite directions but you can fit a line
very easily to this. The line would look something like this. This would have an r of negative one, and r of zero, r is equal to zero, would be a dataset which a line doesn't really fit very well at all. I'll do that one really small, since I don't have much space here. An r of zero might look something like this. Maybe I'll have a data point here, maybe have a data point here, maybe I have one there. There, there. And it wouldn't necessarily
be this well organized but this gives you a sense of things. How would you actually
try to fit a line here? You could equally justify
a line that looks like that or a line that looks like that, or a line that looks like that. A linear model really does not describe the relationship between
the two variables that well, right over here. So with that, is a primer. Let's see if we can
tackle these scatterplots. The way I'm gonna do it is I'm just gonna try to eyeball what a linear
model might look like. There's different methods of
trying to fit a linear model to a dataset, an imperfect dataset. I drew very perfect ones, at least for the r equals negative one and r equals one but these are what the real
world actually looks like. Very few times will things
perfectly sit on a line. For scatterplot A, if I
were to try to fit a line, it would look something like that. If I were to try to
minimize distances from the points to the line,
I do see a general trend if we look at these data points over here, when y is high, x is low. When x is larger, y is smaller. Looks like r is going
to be less than zero, and a reasonable bit less than zero. It's going to approach this thing here. If we look at our choices, it wouldn't be r equals 0.65. These are positive so I wouldn't
use that one or that one. And this one is almost no correlation. R equals negative 0.02, this
is pretty close to zero. I feel good with r is
equal to negative 0.72. I wanna be clear, if I didn't
have these choices here, I wouldn't just be able to say, just looking at these data points without being able to do a calculation, that r is equals to negative 0.72. I'm just basing it on
the intuition that it is a negative correlation,
it seems pretty strong. The pattern kind of jumps out at you, that when y is large, x is small. When x is large, y is small. So I like something that's approaching r equals negative one. I've used this one up already. Now scatterplot B, if I were
to just try to eyeball it, once again this is gonna be imperfect. But the trend, if I were
to try to fit a line, it looks something like that. It looks like a line
fits in reasonably well. There's some points that
would still be hard to fit. They're still pretty far from the line. It looks like it's a positive correlation. When y is small, x is
relatively small and vice versa. As x grows, y grows and
when y grows, x grows. This ones going to be positive and it looks like it would
be reasonably positive. I have two choices here. I don't know which of
these it's going to be. It's either going to be r is equal to 0.65 or r is equal to 0.84. I also got scatterplot C,
this ones all over the place. It kinda looks like what we did over here. What does a line look like? You could almost imagine anything. Does it look like that? Does a line look like that? There's not a direction
that you could say, "Well, as x increases, maybe
y increases or decreases." There's no rhyme or reason here, so this looks very non-correlated. So this one is pretty close to zero. I feel pretty good that this is the r is equal to negative .02. In fact, if we tried
probably the best line that could be fit, would be one with a slight negative slope. It might look something like this. And notice, even when
we try to fit a line, there's all sorts of points
that are way off the line. So the linear model did
not fit it that well. R is equal to negative 0.02, So we'll use that one. Now we have scatterplot D. That's gonna use one of the
other positive correlations and it does look like there
is a positive correlation. When y is low, x is low. When x is high, y is high and vice versa. We could try to fit something that looks something like that. But it's still not as good as that one. You can see the points
that we're trying to fit, there's several points that
are still pretty far away from our model. The model is not fitting it that well, so I would say scatterplot
B is a better fit. A linear model works
better for scatterplot B than it works for scatterplot D. I would give the higher r to scatterplot B and the lower r, r equals
0.65, to scatterplot D. R is equal to 0.65. Once again that's because
with a linear model it looks like there's a
trend but there's several more data points are way off the line in scatterplot D than in
the case of scatterplot B. There's a few that are
still way off the line but these are even more
off of the line in D.