If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Squared error of regression line

Introduction to the idea that one can find a line that minimizes the squared distances to the points. Created by Sal Khan.

Want to join the conversation?

  • blobby green style avatar for user skanga
    Why is the error measured VERTICALLY? The shortest distance from a point to a line will be on a perpendicular to that line. If we want to find the "best" fitting line to a set of points then this distance is the one that should be minimized.
    (154 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user laura.cate.stewart
      Don't think about it as a problem of geometry, but as one of minimizing error.

      In this kind of line fitting, the x-values are assumed to have no error - if your independent variable has been measured with errors then you need a different fitting method.

      Since the error for each point is only in its y-value (the dependent measurements), we only examine the vertical distance from each point to the line.
      (235 votes)
  • aqualine ultimate style avatar for user Ernest Gu
    Why do we square the error? () I know that it exaggerates/differentiates between large and small errors, but if this is so, why not cube the errors? Or better yet, exponentiate the error to its thirteenth power? Is it because of the Gauss-Markov Theorem that Wikipedia mentions?
    (23 votes)
    Default Khan Academy avatar avatar for user
    • male robot hal style avatar for user Cameron
      the reason we choose squared error instead of 3rd or 4th power or 26th power of the error is because of the nice shape that squared errors will make when we make a graph of the squared error vs m and b. The graph will make a 3-d parabola with the smallest square error being at our optimally chosen m and b. Since this graph has only 1 minimum value it is really nice since we can always find this minimum, and the minimum will be unique. If we use higher exponents it would be harder to find the minimum value(s), and we could find possibly non unique minimums or only local minimums (values that look good compared to the neighbouring values but not the absolute best). So, in summary we used squared error because it gives us a minimum that is easy to find and is guaranteed to be the only minimum (this guarantees it is the best!).
      (59 votes)
  • leaf green style avatar for user Ashi
    Is error the same thing as residual?
    (10 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user Spencer
      Not EXACTLY. The difference is that there is (in theory) a TRUE regression line which we will never know, and then there is the one that we estimate to be the regression line. The difference between the point and the TRUE regression line is your error. The difference between your point and the ESTIMATED regression line is your residual. When we fit a regression line, we make the sum of our residuals equal to 0 but that does not necessarily mean that the sum of our error is 0 (there will always be some error in statistics by its nature).

      Here is an example that hopefully won't confuse.. You can see this if you simulate some data in a spreadsheet. If you had a function Y=5+10X as your TRUE regression and say we had 5 different possible inputs for x ( 1,2,3,4,5) now, for x=1, y=15 and for x=2, y=25 etc. Now, if for observation, we had some normally distributed error, and we fitted a line to that data, we might get a line that is reaaallly close to Y=5+10X but we probably would not get that exactly. We might actually get something like Y=5.3+9.5X. Lets say one of our randomly distributed observations were (3,30). the predicted value for Y, according to our regression line, is 5.3+9.5(3)=33.8 but the TRUE regression line (which we won't actually have) would have predicted 35 for Y so, our ERROR is 30-35=5, but our RESIDUAL is 30-33.8=3.8
      (15 votes)
  • male robot hal style avatar for user Justin Tyme
    Is it just me or does this process look a lot like taking the "variance" of the errors, with the line of best fit being where the errors vary the least?
    (7 votes)
    Default Khan Academy avatar avatar for user
  • aqualine ultimate style avatar for user Lauren Gilbert
    At around , why is the error at the different points the vertical distance, and not the horizontal distance between the point and the line? Or, for that matter, why not the vertical AND horizontal distances summed up?
    (4 votes)
    Default Khan Academy avatar avatar for user
    • leaf green style avatar for user kurisu
      A linear regression model assumes that the relationship between the variables y and x is linear (the measured variable y depends linearly of the input variable x).
      Basically, y = mx + b.
      A disturbance term (noise) is added (error variable "e").
      So, we have y = mx + b + e. So the error is e = y - (mx +b).
      So, we try to find m and b (for the line of best fit) that minimize the error, that is the sum of the vertical squared distance Sum(||e||^2) = Sum(||y - (mx +b)||^2).

      There are different ways of trying to find a line of best fit. It depends of what x and y represent.
      For example, if both x and y are observations with errors and x and y have equal variances then what is called the "Deming regression" measured the deviations perpendicularly to the line of best fit.
      (6 votes)
  • blobby green style avatar for user kuli.baro
    is there a lesson on nonlinear regressions?
    (3 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user Stephen Cranwell
    Why are you interested in the squared error of the line as opposed to just the error of the line?
    (5 votes)
    Default Khan Academy avatar avatar for user
    • old spice man green style avatar for user eggie5
      the reason we choose squared error instead of 3rd or 4th power or 26th power of the error is because of the nice shape that squared errors will make when we make a graph of the squared error vs m and b. The graph will make a 3-d parabola with the smallest square error being at our optimally chosen m and b. Since this graph has only 1 minimum value it is really nice since we can always find this minimum, and the minimum will be unique. If we use higher exponents it would be harder to find the minimum value(s), and we could find possibly non unique minimums or only local minimums (values that look good compared to the neighbouring values but not the absolute best). So, in summary we used squared error because it gives us a minimum that is easy to find and is guaranteed to be the only minimum (this guarantees it is the best!).
      (2 votes)
  • blobby green style avatar for user Alison Tracz
    why is this called linear "regression"? What does regression mean?
    (4 votes)
    Default Khan Academy avatar avatar for user
    • leaf red style avatar for user Addison Honeycutt
      The term "regression" was used by Francis Galton in his 1886 paper "Regression towards mediocrity in hereditary stature". To my knowledge he only used the term in the context of regression toward the mean. The term was then adopted by others to get more or less the meaning it has today as a general statistical method.
      (5 votes)
  • leafers ultimate style avatar for user GFauxPas
    When I did least square regression in lab sciences, the book usually recommended adding the point (0,0) to the data, even if that wasn't one of the observations. Anyone have an idea why one would do that?
    (2 votes)
    Default Khan Academy avatar avatar for user
    • leaf blue style avatar for user Matthew Daly
      It would depend on the experiment. For instance, if you wanted to know how far a ball fell t seconds after you dropped it off a table, the one point you can be sure has no experimental error is (0,0). Heck, you can make a case to add it several times to the data so that the curve of best fit will pass closer to the one point you're certain of.
      (5 votes)
  • blobby green style avatar for user Daniel Morrow
    The formula we are attempting to derive defines the line, right? So why does it assume we already know the distance from each data point to the line?
    (2 votes)
    Default Khan Academy avatar avatar for user
    • leaf blue style avatar for user Dr C
      We don't know the line, but we knot the form of the line we want. It's: y = mx + b

      We don't know the values of m and b, but we can talk about the the errors / distances in terms of m and b. Once we see our result, we apply some clever math to find out what values of m and b are the best ones to use.
      (3 votes)

Video transcript

In the next few videos I'm going to embark on something that will just result in a formula that's pretty straightforward to apply. And in most statistics classes, you'll just see that end product. But I actually want to show how to get there. But I just want to warn you right now. It's going to be a lot of hairy math, most of it hairy algebra. And then we're actually going have to do a little bit of calculus near the end. We're going to have to do a few partial derivatives. So if any of that sounds daunting, or sounds like something that will discourage you in some way, you don't have to watch it. You could skip to the end and just get the formula that we're going to derive. But I, at least, find it pretty satisfying to actually derive it. So what we're going to think about here is, let's say we have n points on a coordinate plane. And they all don't have to be in the first quadrant. But just for simplicity of visualization, I'll draw them all in the first quadarant. So let's say I have this point right over here. Let me do them in different colors. And that coordinate is x1, y1. And then let's say I have another point over here. The coordinates there are x2, y2. And then I can keep adding points. And I could keep drawing them. We'd just have a ton of points. There and there and there. And we go all the way to the nth point. Maybe it's over here. And we're just going to call that xn, yn. So we have n points here. I haven't drawn all of the actual points. But what I want to do is find a line that minimizes the squared distances to these different points. So let's think about it. Let's visualize that line for a second. So there's going to be some line. And I'm going to try to draw a line that kind of approximates what these points are doing. So let me draw this line here. So maybe the line might look something like this. I'm going to try my best to approximate it. Actually, let me draw it little bit different. Maybe it looks something like that. I don't even know what it looks like right now. And what we want to do is minimize this squared error from each of these points to the line. So let's think about what that means. So if the equation of this line right here is y is equal to mx plus b. And this just comes straight out of Algebra 1. This is the slope on the line, and this is the y-intercept. This is actually the point 0, b. What I want to do, and that's what the the topic of the next few videos are going to be, I want to find an m and a b. So I want to find these two things that define this line. So that it minimizes the squared error. So let me define what the error even is. So for each of these points, the error between it and the line is the vertical distance. So this right here we can call error one. And then this right here would be error two. It would be the vertical distance between that point and the line. Or you can think of it as the y value of this point and the y value of the line. And you just keep going all the way to the endpoint between the y value of this point and the y value of the line. So this error right here, error one, if you think about it, it is this value right here, this y value. It's equal to y1 minus this y value. Well what's this y value going to be? Well over here we have x is equal to x1. And this point is the point m x1 plus b. You take x1 into this equation of the line and you're going to get this point right over here. So that's literally going to be equal to m x1 plus b. That's that first error. And we can keep doing it with all the points. This error right over here is going to be y2 minus m x2 plus b. And then this point right here is m x2 plus b. The value when you take x2 into this line. And we keep going all the way to our nth point. This error right here is going to be yn minus m xn plus b. Now, so if we wanted to just take the straight up sum of the errors, we could just some these things up. But what we want to do is a minimize the square of the error between each of these points, each of these n points on the line. So let me define the squared error against this line as being equal to the sum of these squared errors. So this error right here, or error one we could call it, is y1 minus m x1 plus b. And we're going to square it. So this is the error one squared. And we're going to go to error two squared. Error two squared is y2 minus m x2 plus b. And then we're going to square that error. And then we keep going, we're going to go n spaces, or n points I should say. We keep going all the way to this nth error. The nth error is going to be yn minus m xn plus b. And then we're going to square it. So this is the squared error of the line. And over the next few videos, is I want to find the m and b that minimizes the squared error of this line right here. So if you viewed this as the best metric for how good a fit a line is, we're going to try to find the best fitting line for these points. And I'll continue in the next video. Because I find that with these very hairy math problems, it's good to kind of just deliver one concept at a time. And it also minimizes my probability of making a mistake.