If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content
Current time:0:00Total duration:4:49

Introduction to residuals and least-squares regression

DAT‑1 (EU)
DAT‑1.D (LO)
DAT‑1.D.1 (EK)
DAT‑1.D.2 (EK)
DAT‑1.E (LO)
DAT‑1.E.1 (EK)
DAT‑1.G (LO)
DAT‑1.G.1 (EK)
CCSS.Math: , ,

Video transcript

let's say we're trying to understand the relationship between people's height and their weight so what we do is we go to ten different people and we measure each of their Heights and each of their weights and so on this scatterplot here each dot represents a person so for example this dot over here represents a person whose height was 60 inches or 5 feet tall so that's the point 60 comma and whose weight which we have on the y-axis was 125 pounds and so when you look at this scatterplot your eyes naturally see some type of a trend it seems like generally speaking as height increases weight increases as well but I said generally speaking you definitely have circumstances where there are taller people who might weigh less but an interesting question is can we try to fit a line to this data and this idea of trying to fit a line as closely as possible to as many of the points as possible is known as linear linear regression now the most common technique is to try to fit a line that minimizes the squared distance to each of those points and we're gonna talk more about that in future videos but for now we want to get intuitive feel for that so if you were to just eyeball it and look at a line like that you wouldn't think that it would be a particularly good fit it looks like most of the data sits above the line similarly something like this also doesn't look that great here most of our data points are sitting below the line but something like this actually looks very good it looks like it's getting as close as possible to as many of the points as possible it seems like it's describing this general trend and so this is the actual regression line and the equation here we would write as and we'd write Y with a little hat over it and that means that we're trying to estimate a Y for a given X it's not always going to be the actual Y for a given X because as we see sometimes the the points aren't sitting on the line but we say Y hat is the - and our y-intercept for this particular regression line it is negative 140 plus the slope 14 over 3 times X now as we can see for most of these points given the x-value of those points the estimate that our regression line gives is different than the actual value and that difference between the actual and the estimate from the regression line is known as the residual so let me write that down so for example the residual at that point residual at that point is going to be equal to for a given X the actual Y value minus the estimated Y value from the regression line for that same X or another way to think about it is for that X value when X is equal to 60 we're talk about the residual just at that point it's going to be the actual Y value minus our estimate of what the Y value is from this regression line for that x value so pause this video and see if you can calculate this residual and you could visually imagine it as being this right over here well to actually calculate the residual you would take our actual value which is 125 for that x value remember we're calculating the residual for a point so it's the actual Y there - what would be the estimated Y there for that x value well we could just go to this equation and say what would Y hat be when X is equal to 60 well it's going to be equal to let's see we have negative 140 plus 14 over 3 times 60 let's see 60 divided by 3 is 20 20 times 14 is 280 and so all of this is going to be 140 and so our residual for this point is going to be 125 minus 140 which is negative 15 and residuals indeed can be negative if your residual is negative it means for that X value your data point your actual Y value is below the estimate if we were to calculate the residual here or if we were to calculate the residual here our actual for that X value is above our estimate so we would get positive residuals and as you will see later in your statistics career the way that we calculate these regression lines is all about minimizing the square of these residuals
AP® is a registered trademark of the College Board, which has not reviewed this resource.