If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Proof (part 1) minimizing squared error to regression line

Proof (Part 1) Minimizing Squared Error to Regression Line. Created by Sal Khan.

Want to join the conversation?

  • purple pi purple style avatar for user Ray
    What is the point or the purpose of squaring the error line? Why not cubed, square root or even dot or cross product? I do not mean in a mathematical sense, but in a practical sense. What information does the square of the error line give us?
    (38 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user Michael O'Donnell
      There are a couple reasons to square the errors. Squaring the value turns everything positive, effectively putting negative and positive errors on equal footing. In other words, it treats any deviation away from the line of the same absolute size (in the positive or negative direction) as the same.

      You can achieve the same result (of turning negative numbers into positive ones) by taking the absolute value of the number or raising the values by any positive exponent (like 4, 6, etc.). So, why was squaring the value chosen over taking the absolute value? The most simplistic answer is that dealing with exponents is mathematically and computationally easier than dealing with absolute values - this was particularly true back in the day when people did this work solely by hand. Because of the power of computers now days, that computational "problem" is much less of a problem and some people argue for (and use) the sum of absolute errors (instead of sum of squared errors) instead; however, those people are the minority (I will warn that the general expectation is using the sum of squared errors as the measure... people have seen it, they understand it, they know the various tests and statistics around it. So if a person wanted to use absolute errors instead, they would have to possibly derive and educate their audience).

      You could also argue that using the square error instead of the absolute error allows you to place a greater emphasis on values that are relatively further away from the line. In other words, you are punished more for producing a line that is relatively farther away from points because those errors are squared. A potential problem, however, is that outliers can more easily skew the regression line using this methodology. And, that is most likely why you use the smallest multiple of 2 as your exponent instead of something like the "sum of errors raised to the 4th power" or something of that nature, because doing so would highlight the outliers (or near outliers) even more.
      (94 votes)
  • blobby green style avatar for user Andre' O'Brien
    Can someone explain to me how he got y^2-2y(mx+b)+(mx+b)^2 at ?
    (4 votes)
    Default Khan Academy avatar avatar for user
  • leaf green style avatar for user Miguel Vasconcelos
    I don't understand, why y1-(mx1+b) ?
    It shouldn't be (mx1+b)-y1?
    (4 votes)
    Default Khan Academy avatar avatar for user
    • old spice man green style avatar for user Tom Peterson
      It can be! That is the advantage of using squared error instead of just simply 'linear error'.

      Notice that some points end up above the line (where y1-(mx1+b)) and some below (where (mx1+b) - y1). To resolve this problem, statisticians have used a system to square the values, so that all values are positive.

      Overall, you can use either version, they both work.
      (10 votes)
  • blobby green style avatar for user Helen Prinold
    What video should I go to when I don't understand why there he starts putting 2's in front of things and having extra brackets worth of stuff...he calls it algebraic equations...it's sooo much fun doing inferential statistics with only a grade 6 education.
    (2 votes)
    Default Khan Academy avatar avatar for user
    • leaf blue style avatar for user Dr C
      I assume you mean what he's talking about what he's writing at ? That's algebra (probably 2-3 years beyond your level). He's expanding the quadratic (the thing in parentheses that is squared). I'm not sure where that is explained on KhanAcademy, but: (a+b)^2 = a^2 + 2ab + b^2 . Then we could compare what is "a" from what Sal had wrote, and what is "b".

      However, if your math is at 6th grade, then you should probably skip any of the videos that say "Proof." Generally the proofs in Statistics will be using math that's 5 or more years beyond that level. Once you learn Calculus (mainly, finding a minimum or maximum via derivatives), I imagine the proof will make perfect sense.

      At your level, I would assume that the focus would be on applying Statistical methods (e.g. estimate the mean, compute a confidence interval, etc) instead of deriving anything.

      If you're just doing the Stats of KhanAcademy on your own, then if you want to understand the proofs better, I'd suggest going over to the Calculus and Algebra sections, as Statistics makes heavy use of the both of them (Calculus mainly for the proofs).
      (8 votes)
  • purple pi purple style avatar for user Tobia
    Okay, so squaring is done in order to have positive values, but what's the problem actually in having both positive and negative errors? I mean, if we need a line which fits the data, the one which has a 0 or close to 0 error is right between the data set right?
    The only case where this method doesn't work seems me to be the one of aligned data points, but for other cases the "true" error seems not that bad to me.
    (3 votes)
    Default Khan Academy avatar avatar for user
    • female robot grace style avatar for user markovcd
      Sum of errors from the mean without squaring is always zero. Check it for yourself. If you have excel just put some numbers in a column of cells, calculate mean and in column next to this one subtract value from the mean. Then add those errors.
      (4 votes)
  • mr pants teal style avatar for user sunshinecoast
    Why are all the terms (y1, y2, yn...etc) being added? How will that help us find the minimized squared error to the line?
    (3 votes)
    Default Khan Academy avatar avatar for user
    • male robot johnny style avatar for user matt
      Eventually we will find the derivative of the whole thing (find the function that finds the slope if you aren't familiar with calculus) and set that to zero, allowing us to solve the constants for the minimum possible error.
      (5 votes)
  • hopper jumping style avatar for user Dylan G
    Why would I need to do this? (Real life example)
    (1 vote)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user jblheadphone
    Anybody know how he squares the answer so fast at . What method did he use? Is there a video on it on Khan Academy?
    (2 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user Janis Edwards
    what is co-efficient of non-determination
    (2 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      The term "coefficient of non-determination" doesn't have a standard meaning in statistics. It seems to be a misspelling or misunderstanding of the term "coefficient of determination," which is often denoted as R^2 (R-squared). The coefficient of determination represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a regression model.
      (1 vote)
  • blobby green style avatar for user abhijithcvijayaraghavan
    Correlation coefficient, means, standard deviations, and sample size of the variables can be used to
    construct the regression equation, then why use Minimizing Squared Error to Regression Line approach?
    (2 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      While correlation coefficients, means, standard deviations, and sample size can provide valuable insights into the relationship between variables, they don't directly yield the equation of the regression line. The approach of minimizing the squared error to the regression line is used because it allows us to find the line that best fits the data by minimizing the discrepancy between the observed data points and the predicted values from the regression line.

      The regression equation obtained through this approach provides explicit estimates of the slope (m) and y-intercept (b) of the line, which allows for predicting the value of the dependent variable based on the value of the independent variable(s).

      Additionally, the regression equation obtained through minimizing the squared error is based on a systematic mathematical optimization process that ensures the line provides the best possible fit to the data in terms of minimizing the overall error.
      (1 vote)

Video transcript

In the last video, we showed that the squared error between some line, y equals mx plus b and each of these n data points is this expression right over here. In this video, I'm really just going to algebraically manipulate this expression so that it's ready for the calculus stage. So we can actually optimize, we can actually find the m and b values that minimize this value right over here. So this is just going to be a ton of algebraic manipulation. But I'll try to color code it well so we don't get lost in the math. So let me just rewrite this expression over here. So this whole video is just going to be rewriting this over and over again. Just simplifying it a bit with algebra. So this first term right over here, y1 minus mx1 plus b squared, this is all going to be the squared error of the line. So this first term over here, I'll keep it in blue, is going to be if we just expand it, y1 squared minus 2 times y1 times mx1 plus b, plus mx1 plus b squared. All I did is I just squared this binomial right here. You can imagine if this was a minus b, it would be a squared minus 2ab plus b squared. That's all I did. Now I'll just have to do that for each of the terms. And each term is only different by the x and the y coordinates right over here. And I'll go down so that we can kind of combine like terms. So this term over here squared is going to be y2 squared minus 2 times y2 times mx2 plus b plus mx2 plus b squared. Same exact thing up here. Except now it was with x2 and y2, as opposed to x1 and y1. And then we're just going to keep doing that n times. We're going to do it for the third, x3, y3, keep going, keep going. All the way until we get the this nth term over here. And this nth term over here when we square it is going to be yn squared minus 2yn times mxn plus b, plus mxn plus b squared. Now, the next thing I want to do is actually expand these out a little bit more. So let's actually scroll down. So this whole expression, I'm just going to rewrite it, is the same thing as-- and remember this is just the squared error of the line. So let me rewrite this top line over here. This top line over here is y1 squared. And then I'm going to distribute this 2y1. So this is going to be minus 2y1mx1, that's just that times that. Minus 2y1b. And then plus, and now let's expand mx1 plus b squared. So that's going to be m squared x1 squared, plus 2 times mx1 times b plus b squared. All I did, if was a plus b squared, this is a squared plus 2ab plus b squared. And we're going to do that for each of these terms. Or for each of these colors, I guess you could say. So now let's move to the second term. It's going to be the same thing. But instead of y1's and x1's, it's going to be y2's and x2's. So it is y2 squared minus 2y2mx2 minus 2y2b plus m squared x2 squared, plus 2 times mx2b plus b squared. And we're going to keep doing this all the way to get the nth term. I guess color we should say. So this is going to be yn squared minus 2ynmxn. And you don't even have to think. You just have to kind of substitute these with n's now. We could actually look at this. But it's going to be the exact same thing. Minus 2ynb plus m squared xn squared, plus 2mxnb plus b squared. So once again, this is just the squared error of that line with n points. Between those n points and the line y equals mx plus b. So let's see if we can simplify this somehow. And to do that what I'm going to do is I'm going to kind of try to add up a bunch of these terms here. So if I were to add up all of these terms right here, if I were to add up this column right over there, what do I get? It's going to be y1 squared plus y2 squared all the way to all the way to yn squared. That's those terms right over there. So I'm going to have that. And then have this common 2m amongst all of these terms over here. So let me write that down. So then you have this 2m here, 2m here, 2m here. Let me put parentheses around here. So you have these terms all added up. Then you have minus 2m times all of these terms. Actually, let me color code it so you see what we're doing. I want to be very careful with this math so nothing seems too confusing. Although this is really just algebraic manipulation. If I had all of these up, I get y1 squared plus y2 squared all the way to yn squared. I'll put some parentheses around that. And then to that, we have this common term, we have this minus 2m, minus 2m, minus 2m. And so we can distribute those out. And so I should actually write it like this. So we have a minus 2m, once we distribute it out up here, we're just going to be left with a y1x1. Or maybe I can call it an x1y1. That's that over there with the 2m factored out. Let me do that in another color. I want to make this easy to read. Plus x2y2. Plus xnyn. Well we're going to keep adding up-- we're going to do this n times. All the way to plus xnyn. This last term over here, ynxn, same thing. So that's the sum. So this stuff over here, the sum of all of this stuff right over here, is the same thing as this term right over here. And then we have to sum this right over here. And you see again, we can factor out here a minus 2b out of all of these terms. So we have minus 2b times y1 plus y2 plus all the way to to yn. So this business. These terms right over here, when you add them up, give you these terms, or this term, right over there. And let's just keep going. And in the next video, we're probably going to run out of time in this one, I'll simplify this more and clean up the algebra a good bit. So then the next term, what is this going to be? Same drill. We can factor out an m squared. So we have m squared times times x1 squared plus x2 squared-- actually, I want to color code them, I forgot to color code these over here. Plus all the way to xn squared. Let me color code these. This was a yn squared. And this over here was a y2 squared. So this is exactly this. So in this last step we just did, this thing over here is this thing right over here. And of course we have to add it. So I'll put a plus out front. We're almost done with this stage of the simplification. So over here, we have a common 2mb, so let's put a plus 2mb times, once again, x1 plus x2 plus all the way to xn. So this term right over here this is the exact same thing as this term over here. And then finally, we have a b squared in each of these. And how many of these b squared do we have? Well we have n of these lines, right? This is the first line, second line, then bunch, bunch, bunch all the way to the nth line. So we have b squared added to itself n times. So this right over here is just b squared n times. So we'll just write that as plus n times b squared. Let me remind ourselves what this is all about. This is all just algebraic manipulation of the squared error between those n points and the line y equals mx plus b. It doesn't look like I've simplified it much. And I'm going to stop in the video right now. In the next video, we're just going to take off right here and try to simplify this thing.