If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Example: Correlation coefficient intuition

AP.STATS:
DAT‑1 (EU)
,
DAT‑1.B (LO)
,
DAT‑1.B.1 (EK)
,
DAT‑1.C (LO)
,
DAT‑1.C.1 (EK)
CCSS.Math:
Sal explains the intuition behind correlation coefficients and does a problem where he matches correlation coefficients to scatter plots.

Want to join the conversation?

  • duskpin tree style avatar for user Michelle
    What is "r", in the correlation coefficient r= 0.65?
    (13 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user robshowsides
      "r" is the correlation coefficient. It is always between -1 and 1, with -1 meaning the points are on a perfect straight line with negative slope, and r = 1 meaning the points are on a perfect straight line with positive slope.
      If you want to calculate it from data, this is the procedure:
      1) Find the mean (average) of all the x-values. Call this xbar.
      2) Find the mean (average) of all the y-values. Call this ybar.
      3) For every x-value, subtract xbar. Call these Δxi (i is an index. i = 1, 2, 3, ...)
      4) For every y-value, subtract ybar. Call these Δyi (i is an index. i = 1, 2, 3, ...)

      These Δxi's and Δyi's are called the "deviations". They will be approximately half positive and half negative, since (usually) about half the values are above the mean and half are below. To calculate r,
      r = ( Σ(Δxi*Δyi) ) / [sqrt( Σ( Δxi)² ) * sqrt( Σ( Δyi)² ) ]
      So you can see that the bottom is the square root of the sum of the squared deviations for x, times the same for y. Because the deviations are squared, every term is positive (except maybe a few are zero when Δxi = 0 or Δyi = 0 (i.e. for any values exactly equal to the mean).

      The key is the top, where nothing is squared. The top is the sum of Δxi *Δyi, so it will be positive when Δx and Δy are BOTH positive or BOTH negative. This pushes r towards being positive (positive correlation). But when Δx and Δy have opposite signs, then Δxi *Δyi will be negative, and that pushes r towards being negative (negative correlation).

      Make up a simple example and try it, with, say, four points. Here are four points to try it with that make the calculation not too bad:
      (1, 1), (2, 3), (6, 5), (7, 11)
      You should find xbar = 4 and ybar = 5
      Thus, Δxi's are -3, -2, 2, 3, and Δyi's are -4, -2, 0, 6
      Put these in the formula and you should get r = 0.891, a quite high correlation.
      Conversely, pick any four points that make a horizontal rectangle, for example (2, 2), (8, 2), (2, 6), (8, 6). If you calculate r for these points, it will be 0.
      (32 votes)
  • leaf yellow style avatar for user Johannes
    I think the answer is no, but does the slope of the line matter in regards to the r-value?

    Will it always be -1 even if the line is just slightly tilted "downwards"?
    (4 votes)
    Default Khan Academy avatar avatar for user
    • leaf blue style avatar for user Dr C
      Yes and no. There are two particular situations where the slope (oarlock there of) do matter:

      1. When there is no variation in the x-variable (ie: all of the points are on a vertical line).
      2. When there is no variation in the y-variable (all the points are on a horizontal line).

      In both of these cases, the correlation (and also the slope) are undefined. But outside of these special cases, the answer is no, the magnitude of the slope doesn't matter, only the sign. If all the points lie on a straight line, then the slop could be -1 or -1000, and the correlation coefficient would still be -1.
      (13 votes)
  • male robot hal style avatar for user Bizzo
    Can a line be greater than 1 or less than -1?
    (5 votes)
    Default Khan Academy avatar avatar for user
    • male robot johnny style avatar for user kingadj123
      Not in this context, no. 1 means a perfect positive correlation here while -1 means a perfect negative correlation. Any deviation from this perfect correlation would reduce the correlation coefficient. It is important to note that the correlation coefficient is NOT the incline / slope of the line that depicts the given data but rather the degree to which all of the data is displayable by that line or how far the data diverts from it. Hence the term linear correlation. If the data results in a perfect line, it is an r = 1 (the more, the more) or an r = -1 (the more, the less).
      (2 votes)
  • piceratops ultimate style avatar for user abhi.devata
    @, Sal says that a correlation coefficient of 0 means that a line would not fit well at all. Do we define lines as y=mx+b (algebra) or a set of points that extend infinitely in both/opposite directions(geometry)? Because x=0 geometrically is a line, but algebraically is not. So if the line of best fit is x=0, then what would the correlation coefficient be? (Sorry if this is a dumb question.)
    (5 votes)
    Default Khan Academy avatar avatar for user
    • starky seedling style avatar for user deka
      for the last specific case you mentioned (x=0), the correlation coefficient r would be 0 too.

      visually, the line is exactly on the y axis. this means you have no choice on x variable and even when you "choose" 0 as x, it can't give you a definite answer as it could spit out any values as y, thus there's no trend between x and y variables here at all

      i think your question isn't dumb, rather thought-provoking

      keep going

      #p.s. if you meant y=0m+b by saying x=0, the same logic can be applied more clearly. y=b means a line of 0 slope. thus whatever you choose as x, it has no impact on y as y is always b. so no trend, thus r=0 once again.
      (1 vote)
  • aqualine seed style avatar for user 418375
    How do you determine if its a strong or weak correlation
    (3 votes)
    Default Khan Academy avatar avatar for user
    • purple pi teal style avatar for user Jay Mitchell
      Visually, if there is a strong correlation, you can see that by how close the points are to the line. For example, scatterplot B more closely fits the line than scatterplot D. More technically, you can calculate the standard deviation. A lower standard deviation would indicate a stronger correlation. As far as when something tips from being a weak correlation to a strong correlation, I'm afraid I don't know that yet.
      (5 votes)
  • blobby green style avatar for user Mian Shah
    Is this the same as Pearson correlation coefficient?
    (3 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user matthewhicks2
    What would you say if the line went straight through the graph would the r value = 0 because it’s not positive or negative
    (3 votes)
    Default Khan Academy avatar avatar for user
    • starky tree style avatar for user Selin Dik
      Yes. It'd just be r=0 because there really isn't a relationship between x and y (that is, if you and I are thinking of the same example). For example, take a horizontal line. If y is always something, x is always different (or not, it depends where the line is). There is no relationship there.
      (1 vote)
  • leaf blue style avatar for user Jotaro
    Does the correlation coefficient show how much are data points scattered on the plane?
    If I have data points very near to each other but I can't form specific line, does this mean that the scatter plot will have correlation coefficient other than zero?
    (2 votes)
    Default Khan Academy avatar avatar for user
    • hopper cool style avatar for user Avi Mahajan
      If you have points very close to each other, but you can't create a specific line, it will be closer to either one or negative one. As the point gets near to other points, the correlation coefficient will go towards 1 or -1. As the points get far away from other points, the correlation coefficient goes toward zero. One of the graphs in Sal's video had lots of points scattered in different directions. This graph had a correlation coefficient of -0.02. Hope this helped!
      (2 votes)
  • leaf blue style avatar for user Jotaro
    I don't quite understand the correlation. Can I say that correlation is based on slope concept?
    Correlation varies between -1 and 1. Does this mean that the line with a slope larger than 1 or smaller than -1 (e.g. 1000, -320) will have correlation of 1 or -1?
    What if I have a line y=5 (slope of which is zero) or x=5 (with undefined slope)?
    (2 votes)
    Default Khan Academy avatar avatar for user
    • hopper cool style avatar for user Avi Mahajan
      Jotaro, the slope has nothing to do with the correlation coefficient. The slope is the measure of how steep a specific line is. However, the correlation coefficient is the measure of close of a line to the points. If a line fits the data well, it will be either 1 or -1. However, if the line does not fit the data well, it will be closer to zero. Hope this helped!
      (2 votes)
  • leaf red style avatar for user Elizabeth Lopez
    at you said " I do see a general trend" What do you mean a general trend? What does that mean?
    (2 votes)
    Default Khan Academy avatar avatar for user

Video transcript

- I took some screen captures from the Khan Academy exercise on correlation coefficient intuition. They've given us some correlation coefficients and we have to match them to the various scatterplots on that exercise. There's a little interface where we can drag these around in a table to match them to the different scatterplots. The point isn't to figure out how exactly to calculate these, we'll do that in the future, but really to get an intuition of we are trying to measure. The main idea is that correlation coefficients are trying to measure how well a linear model can describe the relationship between two variables. For example, let me do some coordinate axes here. Let's say that's one variable. Say that's my y variable and let's say that is my x variable. Let's say when x is low, y is low. When x is a little higher, y is a little higher. When x is a little bit higher, y is higher. When x is really high, y is even higher. A linear model would describe it very, very well. It's quite easy to draw a line that essentially goes through those points. So something like this would have an r of 1, r is equal to one. A linear model perfectly describes it and it's a positive correlation. When one increases, when one variable gets larger, then the other variable is larger. When one variable is smaller then other variable is smaller and vice versa. Now what would an r of negative one look like? Well, that would once again be a situation where a linear model works really well but when one variable moves up, the other one moves down and vice versa. Let me draw my coordinates, my coordinate axes again. I'm gonna try to draw a dataset where the r would be negative one. Maybe when y is high, x is very low. When y becomes lower, x become higher. When y becomes a good bit lower, x becomes a good bit higher. Once again, when y decreases, x increases or as x increases, y decreases. They're moving in opposite directions but you can fit a line very easily to this. The line would look something like this. This would have an r of negative one, and r of zero, r is equal to zero, would be a dataset which a line doesn't really fit very well at all. I'll do that one really small, since I don't have much space here. An r of zero might look something like this. Maybe I'll have a data point here, maybe have a data point here, maybe I have one there. There, there. And it wouldn't necessarily be this well organized but this gives you a sense of things. How would you actually try to fit a line here? You could equally justify a line that looks like that or a line that looks like that, or a line that looks like that. A linear model really does not describe the relationship between the two variables that well, right over here. So with that, is a primer. Let's see if we can tackle these scatterplots. The way I'm gonna do it is I'm just gonna try to eyeball what a linear model might look like. There's different methods of trying to fit a linear model to a dataset, an imperfect dataset. I drew very perfect ones, at least for the r equals negative one and r equals one but these are what the real world actually looks like. Very few times will things perfectly sit on a line. For scatterplot A, if I were to try to fit a line, it would look something like that. If I were to try to minimize distances from the points to the line, I do see a general trend if we look at these data points over here, when y is high, x is low. When x is larger, y is smaller. Looks like r is going to be less than zero, and a reasonable bit less than zero. It's going to approach this thing here. If we look at our choices, it wouldn't be r equals 0.65. These are positive so I wouldn't use that one or that one. And this one is almost no correlation. R equals negative 0.02, this is pretty close to zero. I feel good with r is equal to negative 0.72. I wanna be clear, if I didn't have these choices here, I wouldn't just be able to say, just looking at these data points without being able to do a calculation, that r is equals to negative 0.72. I'm just basing it on the intuition that it is a negative correlation, it seems pretty strong. The pattern kind of jumps out at you, that when y is large, x is small. When x is large, y is small. So I like something that's approaching r equals negative one. I've used this one up already. Now scatterplot B, if I were to just try to eyeball it, once again this is gonna be imperfect. But the trend, if I were to try to fit a line, it looks something like that. It looks like a line fits in reasonably well. There's some points that would still be hard to fit. They're still pretty far from the line. It looks like it's a positive correlation. When y is small, x is relatively small and vice versa. As x grows, y grows and when y grows, x grows. This ones going to be positive and it looks like it would be reasonably positive. I have two choices here. I don't know which of these it's going to be. It's either going to be r is equal to 0.65 or r is equal to 0.84. I also got scatterplot C, this ones all over the place. It kinda looks like what we did over here. What does a line look like? You could almost imagine anything. Does it look like that? Does a line look like that? There's not a direction that you could say, "Well, as x increases, maybe y increases or decreases." There's no rhyme or reason here, so this looks very non-correlated. So this one is pretty close to zero. I feel pretty good that this is the r is equal to negative .02. In fact, if we tried probably the best line that could be fit, would be one with a slight negative slope. It might look something like this. And notice, even when we try to fit a line, there's all sorts of points that are way off the line. So the linear model did not fit it that well. R is equal to negative 0.02, So we'll use that one. Now we have scatterplot D. That's gonna use one of the other positive correlations and it does look like there is a positive correlation. When y is low, x is low. When x is high, y is high and vice versa. We could try to fit something that looks something like that. But it's still not as good as that one. You can see the points that we're trying to fit, there's several points that are still pretty far away from our model. The model is not fitting it that well, so I would say scatterplot B is a better fit. A linear model works better for scatterplot B than it works for scatterplot D. I would give the higher r to scatterplot B and the lower r, r equals 0.65, to scatterplot D. R is equal to 0.65. Once again that's because with a linear model it looks like there's a trend but there's several more data points are way off the line in scatterplot D than in the case of scatterplot B. There's a few that are still way off the line but these are even more off of the line in D.