# Calculating correlation coefficient r

## Video transcript

- [Instructor] What we're
going to do in this video is calculate by hand the correlation coefficient
for a set of bi-variated data. Now, when I say bi-variate it's just a fancy way of
saying for each X data point, there's a corresponding Y data point. Now, before I calculate the
correlation coefficient, let's just make sure we understand some of these other statistics
that they've given us. So, we assume that these are samples of the X and the corresponding Y from our broader population. And so, we have the sample mean for X and the sample standard deviation for X. The sample mean for X
is quite straightforward to calculate, it would
just be one plus two plus two plus three over four and this is eight over four which is indeed equal to two. The sample standard deviation for X, we've also seen this before, this should be a little bit review, it's gonna be the square root of the distance from each of these points to the sample mean squared. So, one minus two squared plus two minus two squared plus two minus two squared plus three minus two squared, all of that over, since
we're talking about sample standard deviation, we have four data points, so one less than four is
all of that over three. Now, this actually simplifies quite nicely because this is zero, this is zero, this is one, this is one and so you essentially get the square root of 2/3 which is if you approximate 0.816. So, that's that. And the same thing is true for Y. The sample mean for Y, if you just add up one plus two plus three plus six over four, four data points, this is 12 over four which
is indeed equal to three and then the sample standard deviation for Y you would calculate
the exact same way we did it for X and you would get 2.160. Now, with all of that out of the way, let's think about how we calculate the correlation coefficient. Now, right over here is a representation for the formula for the
correlation coefficient and at first it might
seem a little intimating until you realize a few things. All this is saying is for
each corresponding X and Y, find the Z score for X, so we could call this Z sub X for that particular X, so Z sub X sub I and we could say this is the Z score for that particular Y. Z sub Y sub I is one way that
you could think about it. Look, this is just saying
for each data point, find the difference
between it and its mean and then divide by the
sample standard deviation. And so, that's how many
sample standard deviations is it away from its mean, and so that's the Z score
for that X data point and this is the Z score for
the corresponding Y data point. How many sample standard
deviations is it away from the sample mean? In the real world you
won't have only four pairs and it'll be very hard to do it by hand and we typically use software
computer tools to do it but it's really valuable to do it by hand to get an intuitive understanding
of what's going on here. So, in this particular situation, R is going to be equal
to one over N minus one. We have four pairs, so it's gonna be 1/3 and it's gonna be times
a sum of the products of the Z scores. So, this first pair right over here, so the Z score for this one is going to be one
minus how far it is away from the X sample mean, divided by the X sample
standard deviation, 0.816, that times one, now we're looking at the Y variable, the Y Z score, so it's one minus three, one minus three over the Y
sample standard deviation, 2.160 and we're just going keep doing that. I'll do it like this. So, the next one it's
going to be two minus two over 0.816, this is
where I got the two from and I'm subtracting from
that the sample mean right over here, times, now
we're looking at this two, two minus three over 2.160 plus I'm happy there's
only four pairs here, two minus two again, two minus two over 0.816 times now we're
gonna have three minus three, three minus three over 2.160 and then the last pair you're
going to have three minus two, three minus two over 0.816 times six minus three, six minus three over 2.160. So, before I get a calculator out, let's see if there's some
simplifications I can do. Two minus two, that's gonna be zero, zero times anything is zero, so this whole thing is zero, two minus two is zero, three minus three is zero, this is actually gonna be zero times zero, so that whole thing is zero. Let's see this is going
to be one minus two which is negative one, one minus three is negative two, so this is going to be R is equal to 1/3 times negative times negative is positive and so this is going to be two over 0.816 times 2.160 and then plus
three minus two is one, six minus three is three, so plus three over 0.816 times 2.160. Well, these are the same denominator, so actually I could rewrite
if I have two over this thing plus three over this thing, that's gonna be five over this thing, so I could rewrite this whole thing, five over 0.816 times 2.160 and now I can just get a calculator out to actually calculate this, so we have one divided by three times five divided by 0.816 times 2.16, the zero won't make a difference but I'll just write it down, and then I will close that parentheses and let's see what we get. We get an R of, and since everything else goes to the thousandth place, I'll just round to the thousandths place, an R of 0.946. So, R is approximately 0.946. So, what does this tell us? The correlation coefficient is a measure of how well a line can
describe the relationship between X and Y. R is always going to be greater than or equal to negative one and less than or equal to one. If R is positive one, it means that an upwards sloping line can completely describe the relationship. If R is negative one, it means a downwards sloping line can completely describe the relationship. R anywhere in between says well, it won't be as good. If R is zero that means
that a line isn't describing the relationships well at all. Now in our situation here, not to use a pun, in our situation here, our R is pretty close to one which means that a line
can get pretty close to describing the relationship between our Xs and our Ys. So, for example, I'm just
going to try to hand draw a line here and it does turn out that
our least squares line will always go through the mean of the X and the Y, so the mean of the X is two, mean of the Y is three, we'll study that in more
depth in future videos but let's see, this
actually does look like a pretty good line. So, let me just draw it right over there. You see that I actually can draw a line that gets pretty close to describing it. It isn't perfect. If it went through every point then I would have an R of one but it gets pretty close to describing what is going on. Now, the next thing I wanna do is focus on the intuition. What was actually going on
here with these Z scores and how does taking products
of corresponding Z scores get us this property
that I just talked about where an R of one will be
strong, positive correlation, R of negative one would be strong, negative correlation? Well, let's draw the sample means here. So, the X sample mean is two, this is our X axis here, this is X equals two and our Y sample mean is three. This is the line Y is equal to three. Now, we can also draw
the standard deviations. This is, let's see, the standard deviation for X is 0.816 so I'll
be approximating it, so if I go .816 less than our mean it'll get us at some place around there, so that's one standard
deviation below the mean, one standard deviation above the mean would put us some place right over here, and if I do the same thing in Y, one standard deviation
above the mean, 2.160 so that'll be 5.160 so it would put us some place around there and one standard deviation below the mean, so let's see we're gonna
go, if we took away two, we would go to one and then we're gonna go take another .160, so it's gonna be some
place right around here. So, for example, for this first pair, one comma one. What were we doing? Well, we said alright, how
many standard deviations is this below the mean? And that turned out to be
negative one over 0.816, that's what we have right over here, that's what this would have calculated, and then how many standard deviations for in the Y direction, and that is our negative two over 2.160 but notice, since both
of them were negative it contributed to the R, this would become a positive value and so, one way to think about it, it might be helping us
get closer to the one. If both of them have a negative Z score that means that there's
a positive correlation between the variables. When one is below the mean, the other is you could say, similarly below the mean. Now, if we go to the next data point, two comma two right over
here, what happened? Well, the X variable was right on the mean and because of that that
entire term became zero. The X Z score was zero. And so, that would have taken away a little bit from our
correlation coefficient. The reason why it would take away even though it's not negative, you're not contributing to the sum but you're going to be dividing
by a slightly higher value by including that extra pair. If you had a data point where
let's say X was below the mean and Y was above the mean, something like this, if this was one of the points, this term would have been negative because the Y Z score
would have been positive and the X Z score would have been negative and so, when you put it in the sum it would have actually taken away from the sum and so, it would have made the R score even lower. Similarly something like this would have made the R score even lower because you would have
a positive Z score for X and a negative Z score for Y and so a product of a
positive and a negative would be a negative.