If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content
Current time:0:00Total duration:12:22
AP.STATS:
DAT‑1 (EU)
,
DAT‑1.B (LO)
,
DAT‑1.B.1 (EK)
,
DAT‑1.B.2 (EK)
,
DAT‑1.C (LO)
,
DAT‑1.C.1 (EK)

Video transcript

what we're going to do in this video is calculate by hand to correlation coefficient for a set of bivariate data and when I say bivariate it's just a fancy way of saying for each X data point there is a corresponding Y data point now before I calculate the correlation coefficient let's just make sure we understand some of these other statistics that they've given us so we assume that these are samples of the X and the corresponding Y from a broader population and so we have the sample mean for X and the sample standard deviation for X the sample mean for X is quite straightforward to calculate it would just be 1 plus 2 plus 2 plus 3 over 4 and this is 8 over 4 which is indeed equal to 2 the sample standard deviation for X we've also seen this before this should be a little bit review it's going to be the square root of the distance from each of these points to the sample mean squared so 1 minus 2 squared plus 2 minus 2 squared plus 2 minus 2 squared plus 3 minus 2 squared all of that over since we're talking about sample standard deviation it's we have four data points so one one less than four is all of that over three now this actually simplifies quite nicely because this is 0 this is 0 this is 1 this is 1 and so you essentially get the square root of 2/3 which is if you approximate 0.8 1 6 so that's that and the same thing is true for Y the sample mean for Y if you just add up 1 plus 2 Plus 3 plus 6 over 4 for data points this is 12 over 4 which is indeed equal to 3 and then the sample standard deviation for y you would calculate the exact same way we did it for X and you get 2 point 1 6 0 now with all of that out of the way let's think about how we calculate the correlation coefficient now right over here is a representation for the formula for the correlation coefficient at first it might seem a little intimidating until you realize a few things all this is saying is for each corresponding x and y find the z-score for x so we could call this Z sub X for that particular X so Z sub X sub I and we could say this is the z-score for that particular Y Z sub y sub I is one way that you could think about it look this is just saying for each data point find the difference between it and its mean and then divide by the standard the sample standard deviation and so that's how many sample standard deviations is it away from its mean and so that's the z-score for that X data point and this is the z-score for the corresponding Y data point how many sample standard deviations is it away from the sample mean in the real world you won't have only four pairs and it will be very hard to do it by hand and we typically use software computer tools to do it but it's really valuable to do it by hand to get an intuitive understanding of what's going on here so in this particular situation R is going to be equal to one over N minus one we have four pairs so it's going to be one over three and it's going to be times a sum of the products of the Z scores so this first pair right over here so this the z-score for this one is going to be one minus how far it is away from the X sample mean divided by the X sample standard deviation zero point eight one six that times one now we're looking at the Y variable the Y z-score so it's 1 minus 3 1 minus 3 over the Y sample standard deviation 2 point 1 6 oh and we're just going to keep doing that I'll do it like this so the next one it's going to be 2 minus 2 over 0 point 8 1 6 that is where I got the 2 from and I'm subtracting from that the sample mean right over here times now we're looking at this two to minus three over two point one six zero plus I'm happy there's only four pairs here 2 minus 2 again 2 minus 2 over zero point eight one six times now we're going to have 3 minus 3 3 minus 3 over two point one six zero and then the last pair you're going to have 3 minus 2 3 minus 2 over 0 point 8 1 6 times 6 minus 3 6 minus 3 over 2 point 1 6 0 so before I get a calculator out let's see if there are some simplifications I can do 2 minus 2 that's going to be 0 0 times anything is 0 so this whole thing is 0 2 minus 2 is 0 3 minus 3 is going to actually be 0 times 0 so that whole thing is 0 let's see this is going to be 1 minus 2 which is negative 1 1 minus 3 is negative 2 so this is going to be R is equal to 1/3 times negative times negative is positive and so this is going to be 2 over zero point eight one six times two point one six zero and then plus 3 minus 2 is 1 6 minus 3 is 3 so plus 3 over zero point eight one six times two point one six zero well these are the same denominator so actually I could rewrite if I have 2 over this thing plus three over this thing that's going to be 5 over this thing so I could rewrite this whole thing five over zero point eight one six times two point one six zero and now I can just get a calculator out to actually calculate this so we have 1 divided by 3 times 5 divided by zero point eight one six times two point one 6:0 won't make a difference but I'll just write it down and then I will close that parenthesis and let's see what we get we get an R of and since everything else goes to the thousands place I'll just round to the thousands place an R of 0.9 for six so R is approximately zero point nine four six so what does this tell us the correlation coefficient is a measure of how well a line can describe the relationship between x and y r is always going to be greater than or equal to negative one and less than or equal to 1 if R is positive 1 it means that an upward sloping line can completely describe the relationship if R is negative 1 it means a downward sloping line can completely describe the relationship are anywhere in between says well it won't just it won't be as good if R is 0 that means that a line isn't describing the relationships well at all now in our situation here not to use a pun in our situation here our R is pretty close to 1 which means that a line can get pretty close to describing the relationship between our exes and our Y's so for example I'm just going to try to hand draw a line here and it does turn out that our least squares line will always go through the mean of the x and the y so the mean of the X is to mean of the Y is 3 we'll study that in more depth in future videos but let's see this actually does look like a pretty good line so let me just draw it right over there you see that I actually can draw a line that gets pretty close to describing it isn't perfect if it went through every point then I would have an R of 1 but it gets pretty close to describing what is going on now the next thing I want to do is focus on the intuition what was actually going on here with these z-scores and how do you how does taking products of corresponding z-scores get this property that I just talked about where an R of one will be strong positive correlation R of negative one would be strong negative correlation well let's draw the sample means here so the X sample mean is 2 this is our x axis here this is x equals 2 and our y sample mean is 3 this is the line y is equal to 3 now we can also draw the standard deviations this is let's see the standard deviation for X is zero point eight one six so I'll be approximating it so if I go Oh point eight one six less than our mean it'll get us someplace around there so that's one standard deviation below the mean one standard deviation above the mean would put us someplace right over here and if I do the same thing in Y one standard deviation above the mean two point one six zero so that would be five point one six zero so it would put us someplace around there and one standard deviation below the mean so let's see we're going to go if we took away two we would go to one and then we're going to go take another point one six zero so it's going to be someplace right around here so for example for this first pair 1 comma 1 what were we doing well we said all right how many standard deviations is this below the mean and that turns out to be negative one over zero point eight one six that's what we have right over here that's what this would have calculated and then how many standard deviations four in the Y direction and that is our negative two over two point one six zero but notice since both of them were negative it contributed to the R this would become a positive value and so one way to think about it it might be helping us get closer to the one if both of them have a negative z-score that means that there is a positive correlation between the variables when one is below the mean the other is you could say similarly below the mean now if we go to the next data point two comma two right over here what happened well the X variable was right on the mean and because of that that entire term became zero the X z-score was zero and so that would have taken away a little bit from our correlation coefficient the reason why it would take away even though it's not negative you're not contributing to the sum but you're going to be dividing by a slightly higher value by including that extra pair if you had a data point where let's say X was below the mean and why was above the mean something like this the turn if this was one of the point this term would have been negative because the Y z-score would have been positive and the XD score would have been negative and so when you put it in the sum it would have actually taken away from the sum and so would have made the our score even lower similarly something like this would have done would have made the our score even lower because you would have a positive z-score for x and a negative z-score for y and so a product of a positive and a negative would be a negative