Main content

## AP®︎/College Statistics

# Calculating correlation coefficient r

The most common way to calculate the correlation coefficient (r) is by using technology, but using the formula can help us understand how r measures the direction and strength of the linear association between two quantitative variables.

## Want to join the conversation?

- Why would you not divide by 4 when getting the SD for x? I don't understand how we got three.(33 votes)
- For calculating SD for a sample (not a population), you divide by N-1 instead of N.(33 votes)

- Why is r always between -1 and 1?

I know that this question has been asked before but the answers are either too technical or too naive. Could someone please provide an answer that is mathematical in nature but can be understood by someone who have ok but not strong mathematical foundation.

Thanks for your help.(13 votes) - Why is r always between -1 and 1?(7 votes)
- Here is a good explination https://sebastiansauer.github.io/why-abs-correlation-is-max-1/

And if Cauchy–Schwarz inequality seems rare for you, here is a good explination of it https://brilliant.org/wiki/cauchy-schwarz-inequality/(7 votes)

- How was the formula for correlation derived?(7 votes)
- Is the correlation coefficient also called the Pearson correlation coefficient?(6 votes)
- The Pearson correlation coefficient(also known as the Pearson Product Moment correlation coefficient) is calculated differently then the sample correlation coefficient. In this video, Sal showed the calculation for the sample correlation coefficient.(3 votes)

- Why is the denominator n-1 instead of n?

Thanks.(4 votes)- When instructor calculated standard deviation (std) he used formula for unbiased std containing n-1 in denominator. If you have the whole data (or almost the whole) there are also another way how to calculate correlation. In this case you must use biased std which has n in denominator. And in overall formula you must divide by n but not by n-1. Does not matter in which way you decide to calculate. The result will be the same.(3 votes)

- What does the little i stand for? Like in xi or yi in the equation. Also, the sideways m means sum right?(2 votes)
- This is a bit of math lingo related to doing the sum function, "Σ". The "i" tells us which x or y value we want. Imagine we're going through the data points in order: (1,1) then (2,2) then (2,3) then (3,6). Remembering that these stand for (x,y), if we went through the all the "x"s, we would get "1" then "2" then "2" again then "3". The "i" indicates which index of that list we're on. So if "i" is 1, then "Xi" is "1", if "i" is 2 then "Xi" is "2", if "i" is 3 then "Xi" is "2" again, and then when "i" is 4 then "Xi" is "3".(5 votes)

- would the correlation coefficient be undefined if one of the z-scores in the calculation have 0 in the denominator? I thought it was possible for the standard deviation to equal 0 when all of the data points are equal to the mean.(2 votes)
- Yes. Assume that the following data points describe two variables (1,4); (1,7); (1,9); and (1,10). The mean for the x-values is 1, and the standard deviation is 0 (since they are all the same value). Given this scenario, the correlation coefficient would be undefined.

Another question to ask is whether it would ever make sense to calculate a correlation coefficient when one has the exact same data for one of the variables. Recall that the correlation coefficient is supposed to describe how well the two variables can be described by a linear relationship. Given that one variable (x in this case) is constant, I don't see how a line would ever describe the relationship between the two variables.(3 votes)

- What calculator is Sal using? (6:58)(3 votes)
- He is using a TI-84(1 vote)

- why calculating SD for a sample (not a population) use N-1 to instead N?(2 votes)
- This video goes into those specifics: https://www.khanacademy.org/math/ap-statistics/summarizing-quantitative-data-ap/more-standard-deviation/v/review-and-intuition-why-we-divide-by-n-1-for-the-unbiased-sample-variance

I'd also recommend the next 3 videos after it which show some simulations that really give you a sense that gives you an unbiased SD. The gist is that when we take a sample and use the mean of that**sample**, since we don't know the mean of the whole**population**we will be underestimating the distance of each data point from the value, we try to make up for this by dividing by a smaller number (n-1), there is a specific reason for choosing that value which the videos explain better than I can :)(2 votes)

## Video transcript

- [Instructor] What we're
going to do in this video is calculate by hand the correlation coefficient
for a set of bi-variated data. Now, when I say bi-variate it's just a fancy way of
saying for each X data point, there's a corresponding Y data point. Now, before I calculate the
correlation coefficient, let's just make sure we understand some of these other statistics
that they've given us. So, we assume that these are samples of the X and the corresponding Y from our broader population. And so, we have the sample mean for X and the sample standard deviation for X. The sample mean for X
is quite straightforward to calculate, it would
just be one plus two plus two plus three over four and this is eight over four which is indeed equal to two. The sample standard deviation for X, we've also seen this before, this should be a little bit review, it's gonna be the square root of the distance from each of these points to the sample mean squared. So, one minus two squared plus two minus two squared plus two minus two squared plus three minus two squared, all of that over, since
we're talking about sample standard deviation, we have four data points, so one less than four is
all of that over three. Now, this actually simplifies quite nicely because this is zero, this is zero, this is one, this is one and so you essentially get the square root of 2/3 which is if you approximate 0.816. So, that's that. And the same thing is true for Y. The sample mean for Y, if you just add up one plus two plus three plus six over four, four data points, this is 12 over four which
is indeed equal to three and then the sample standard deviation for Y you would calculate
the exact same way we did it for X and you would get 2.160. Now, with all of that out of the way, let's think about how we calculate the correlation coefficient. Now, right over here is a representation for the formula for the
correlation coefficient and at first it might
seem a little intimating until you realize a few things. All this is saying is for
each corresponding X and Y, find the Z score for X, so we could call this Z sub X for that particular X, so Z sub X sub I and we could say this is the Z score for that particular Y. Z sub Y sub I is one way that
you could think about it. Look, this is just saying
for each data point, find the difference
between it and its mean and then divide by the
sample standard deviation. And so, that's how many
sample standard deviations is it away from its mean, and so that's the Z score
for that X data point and this is the Z score for
the corresponding Y data point. How many sample standard
deviations is it away from the sample mean? In the real world you
won't have only four pairs and it'll be very hard to do it by hand and we typically use software
computer tools to do it but it's really valuable to do it by hand to get an intuitive understanding
of what's going on here. So, in this particular situation, R is going to be equal
to one over N minus one. We have four pairs, so it's gonna be 1/3 and it's gonna be times
a sum of the products of the Z scores. So, this first pair right over here, so the Z score for this one is going to be one
minus how far it is away from the X sample mean, divided by the X sample
standard deviation, 0.816, that times one, now we're looking at the Y variable, the Y Z score, so it's one minus three, one minus three over the Y
sample standard deviation, 2.160 and we're just going keep doing that. I'll do it like this. So, the next one it's
going to be two minus two over 0.816, this is
where I got the two from and I'm subtracting from
that the sample mean right over here, times, now
we're looking at this two, two minus three over 2.160 plus I'm happy there's
only four pairs here, two minus two again, two minus two over 0.816 times now we're
gonna have three minus three, three minus three over 2.160 and then the last pair you're
going to have three minus two, three minus two over 0.816 times six minus three, six minus three over 2.160. So, before I get a calculator out, let's see if there's some
simplifications I can do. Two minus two, that's gonna be zero, zero times anything is zero, so this whole thing is zero, two minus two is zero, three minus three is zero, this is actually gonna be zero times zero, so that whole thing is zero. Let's see this is going
to be one minus two which is negative one, one minus three is negative two, so this is going to be R is equal to 1/3 times negative times negative is positive and so this is going to be two over 0.816 times 2.160 and then plus
three minus two is one, six minus three is three, so plus three over 0.816 times 2.160. Well, these are the same denominator, so actually I could rewrite
if I have two over this thing plus three over this thing, that's gonna be five over this thing, so I could rewrite this whole thing, five over 0.816 times 2.160 and now I can just get a calculator out to actually calculate this, so we have one divided by three times five divided by 0.816 times 2.16, the zero won't make a difference but I'll just write it down, and then I will close that parentheses and let's see what we get. We get an R of, and since everything else goes to the thousandth place, I'll just round to the thousandths place, an R of 0.946. So, R is approximately 0.946. So, what does this tell us? The correlation coefficient is a measure of how well a line can
describe the relationship between X and Y. R is always going to be greater than or equal to negative one and less than or equal to one. If R is positive one, it means that an upwards sloping line can completely describe the relationship. If R is negative one, it means a downwards sloping line can completely describe the relationship. R anywhere in between says well, it won't be as good. If R is zero that means
that a line isn't describing the relationships well at all. Now in our situation here, not to use a pun, in our situation here, our R is pretty close to one which means that a line
can get pretty close to describing the relationship between our Xs and our Ys. So, for example, I'm just
going to try to hand draw a line here and it does turn out that
our least squares line will always go through the mean of the X and the Y, so the mean of the X is two, mean of the Y is three, we'll study that in more
depth in future videos but let's see, this
actually does look like a pretty good line. So, let me just draw it right over there. You see that I actually can draw a line that gets pretty close to describing it. It isn't perfect. If it went through every point then I would have an R of one but it gets pretty close to describing what is going on. Now, the next thing I wanna do is focus on the intuition. What was actually going on
here with these Z scores and how does taking products
of corresponding Z scores get us this property
that I just talked about where an R of one will be
strong, positive correlation, R of negative one would be strong, negative correlation? Well, let's draw the sample means here. So, the X sample mean is two, this is our X axis here, this is X equals two and our Y sample mean is three. This is the line Y is equal to three. Now, we can also draw
the standard deviations. This is, let's see, the standard deviation for X is 0.816 so I'll
be approximating it, so if I go .816 less than our mean it'll get us at some place around there, so that's one standard
deviation below the mean, one standard deviation above the mean would put us some place right over here, and if I do the same thing in Y, one standard deviation
above the mean, 2.160 so that'll be 5.160 so it would put us some place around there and one standard deviation below the mean, so let's see we're gonna
go, if we took away two, we would go to one and then we're gonna go take another .160, so it's gonna be some
place right around here. So, for example, for this first pair, one comma one. What were we doing? Well, we said alright, how
many standard deviations is this below the mean? And that turned out to be
negative one over 0.816, that's what we have right over here, that's what this would have calculated, and then how many standard deviations for in the Y direction, and that is our negative two over 2.160 but notice, since both
of them were negative it contributed to the R, this would become a positive value and so, one way to think about it, it might be helping us
get closer to the one. If both of them have a negative Z score that means that there's
a positive correlation between the variables. When one is below the mean, the other is you could say, similarly below the mean. Now, if we go to the next data point, two comma two right over
here, what happened? Well, the X variable was right on the mean and because of that that
entire term became zero. The X Z score was zero. And so, that would have taken away a little bit from our
correlation coefficient. The reason why it would take away even though it's not negative, you're not contributing to the sum but you're going to be dividing
by a slightly higher value by including that extra pair. If you had a data point where
let's say X was below the mean and Y was above the mean, something like this, if this was one of the points, this term would have been negative because the Y Z score
would have been positive and the X Z score would have been negative and so, when you put it in the sum it would have actually taken away from the sum and so, it would have made the R score even lower. Similarly something like this would have made the R score even lower because you would have
a positive Z score for X and a negative Z score for Y and so a product of a
positive and a negative would be a negative.