If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains ***.kastatic.org** and ***.kasandbox.org** are unblocked.

Main content

Current time:0:00Total duration:12:41

AP.STATS:

DAT‑1 (EU)

, DAT‑1.G (LO)

, DAT‑1.G.4 (EK)

in the last few videos we saw that if we had n points n points each of them have x and y coordinates so let me draw n of those points so let's call this point 1 it has the coordinates x1 comma x1 y1 you have the second point over here that has the coordinates x2 y2 and then we keep putting points up here and eventually we get to the end point over here the end point that has the coordinates x and y n what we saw is is that there is a line that we can find we can find a line that minimizes the squared distance that minimizes the squared distance so this line right here I'll call it y is equal to M X plus B that there is some line that minimizes the squared distance to the points and let me just review what those squared distances are so it's sometimes it's called the squared error so this is the error between the line and point 1 so I'll call that error 1 this is the error between the line and point 2 and point 2 will call this error 2 this is the error between the line and point 3 or you're sorry and point n so if you wanted the total error if you want the total squared error and this is actually how we started off this whole discussion the total squared error between the points and between the points and the line you literally just take you literally take the y-value at each point so for example you would take y1 that's this value right over here you take y1 minus the value the y-value at this point in the line well that point in the line is essentially what the Y value you get when you substitute x1 into this equation so I'll just substitute x1 into this equation so minus mx1 plus B this right here that is this Y value right over here that is MX MX 1 plus B I don't - I get my graph to cluttered so I'll just delete that there that's the first that is Error 1 right over there that is Error 1 and we want the squared errors between each of the points in the line so that's the first one then you the same thing for the second point and we started our discussion this way y2 minus M x2 plus B squared all the way all the way I'll do dot dot dot to show that there are a bunch of these that we have to do until we get to the end point all the way to yn minus m xn plus B squared and now that we actually know how to find now that we know how to find these M's and B's I showed you the formula the next factor we've proved the formula of how to find these M's and B's we can find this line and if we wanted to say well you know how good is it how much error is there we can then calculate it because we now know the MS and the B so we can calculate it for a certain set of data now what I want to do is kind of come up with a more meaningful estimate of how good something this line is fitting is fitting the the data points that we have and to do that we're going to ask ourselves the question how much how much or we could even say what percentage what percentage of the variation variation in in Y is described is described by the variation variation in X and so let's think about this how much of the total variation in Y there's obviously change variation in Y this Y value is over here this Y this points Y value is over here there's clearly a bunch of variation in the Y but how much of that is essentially described by the variation in X or described by the line so let's think about that first think about what the total variation is how much of the we could even say total variation how much of the total variation in Y so let's just figure out what the total variation in Y is the total variation it's really just a tool for measuring total variation in Y well we care when we think about variation and this is even true when we thought about variance which was the the mean variation in Y is we think about the squared distance from some central tendency and the best central measure we can have why is the arithmetic mean so we could just say the total variation in Y is just going to be the sum the sum of the distances of each of the Y's so you get y1 let me do this in another color you get y1 this y1 over here this is y1 over here you get y1 minus the mean of all the Y's minus the mean of all the Y's squared plus y 2 plus y 2 minus the mean of all of the Y squared + and you just keep going all the way to the nth Y value to yn minus the mean of all the Y's squared this gives you the total variation in Y if you you can just take it all the Val all the Y values find their mean it'll be some value maybe it's right over here someplace maybe that is the mean value of all the Y's and so you can even visualize it the same way we visualized the squared error from the line so if you visualize it you could imagine a line that's y is equal to the mean of Y which would look just like that and what we're measuring over here this error right over here is the square of this distance right over here between the first Y between this point vertically and this line the second one is going to be this distance is going to be this distance just right up to the line the the nth one is going to be the distance from there all the way to the line right over there and then there are these other points in between this is the total variation Y make sense if you divide this by n you actually will get the I should say this is the total variation in Y if you divide this by n you're going to get what we typically associate as the variance of Y which is kind of the average squared distance now we have the total squared distance so what we want to do is how much of this how much of the total variation Y is described by the variation in X so maybe we can think of it this way so our denominator we want what percentage of the total variation in Y so let me write it this way let me call this as the squared error from the average let me call this this is equal to the squared error maybe I was call this the squared error from from the mean of Y and this is really the total vary and why so let's put that as the denominator let's put that as the denominator the total variation Y which is the squared error from from the mean from the mean of the Y's now we want to know what percentage of this is described by the variation in X now what is not described by the variation in X we want how much is described by the variation X but what if we want what if we want how much of the total error how much how much of the total variation how much of the total variation is not is not described is not described by the line over here is not described by the regression line by the regression line how much of the total data is not well we already have a measure for that we have the squared error of the line this tells us the square of the distances from each point to our line so it is exactly this measure it tells us how much of the total variation is not described by the regression line so if you want to know what percentage of the total variation is not described by the regression line by the regression line you would just say this is the total it would just be the squared error the squared error of the line because this is the total variation not described by the regression line divided by the total variation so let me make it clear this is this right over here this right over here is tells us this tells us what percentage what percentage of variation of the total variation is not is not described is not described by the variation in X by the variation by the variation in X or by the line or by the regression line regression by the regression line so to answer our question what percentage is described by the variation well the rest of it has to be described by the variation in X because our question is what percentage of the total variation is described by the variation X this is the percentage that is not described so if this number right here if this number is I don't know 30% if 30% of the variation in Y is not described by the line then the remainder will be described by the line so we could essentially just subtract this from 1 so if we take 1 minus the squared error between our data points and the line over the squared error between the data points between the Y's and the mean why we have we now have a percentage this actually tells us what percentage of total variation total variation is described by the line is described is described you can either view it's described by the line or by the variation in X its described by the variation by the variation in X and this number right here this is called the coefficient of determination this is called the coefficient of determination it's just what statistic statisticians have decided to name it coefficient coefficient of determination of determination determination and it's also called r-squared and you might have even heard that term when people talk about regression now let's think about it if the standard if the squared error of the line if the squared error is really small if the squared error is really small what does that mean it means that these errors it means that these errors right over here are really small are really small which means that the line is a really good fit which means that the line is this at right line order it tells us that the line is a really good fit so if the let me write it over here if the squared error of the line is small is small it tells us that the line is a good fit line is a good it tells us it's a good fit now what would happen over here well if this number is really small this is going to be a very small fraction over here one - a very small fraction is going to be a pretty large it's going to be a number close to one so then so then we're going to have our R squared will be close close to 1 which tells us that a lot of the variation in Y is described by the variation in X which makes sense because the line is a good fit you take the opposite case if the squared error of the line is huge if this number over here is huge if this number over here is huge then that means there's a lot of error between the data points in the line and so if this number is huge then this number over here is going to be huge 1 1 - there's going to be a percentage close to 1 and 1 - that is going to be close to 0 and so if this if if the squared error of the line is large is large it is large if this is large this whole thing is going to be close to 1 and if this whole thing is close to 1 the whole coefficient of determination the whole R square is going to be close to 0 which makes sense R squared will be close to 0 which makes sense that tells us that very little of the total variation in Y is described by the variation x or described by the line well anyway everything I've been dealing with so far has been a little bit in the abstract and the next video actually put this will actually look at some data samples and calculate their regression line and also calculate the r-squared and see how good of a fit it really is

AP® is a registered trademark of the College Board, which has not reviewed this resource.