If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Proof (part 4) minimizing squared error to regression line

Proof (Part 4) Minimizing Squared Error to Regression Line. Created by Sal Khan.

Want to join the conversation?

  • old spice man green style avatar for user Tom Peterson
    At , why does Sal decide to subtract one equation from the other? And how is this okay?
    (6 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user samueljeshurun
      For any equation, there are two sides with " = " in between so let's call that
      Left Hand Side (L.H.S) = Right Hand Side (R.H.S)
      Consider two equations
      L.H.S.1 = R.H.S.1 ........................... say equation1
      L.H.S.2 = R.H.S.2 ........................... say equation2
      We know that we can perform an add/subtract operation on both sides of any equation and the equation still stands valid i.e.
      L.H.S.1 + x = R.H.S.1 + x or L.H.S.1 - x = R.H.S.1 - x
      Now let's subtract L.H.S.2 (say x = L.H.S.2) in equation1 above
      L.H.S.1 - L.H.S.2 = R.H.S.1 - L.H.S.2 ......................... say equation3
      But from equation2, L.H.S.2 = R.H.S.2
      So we can substitute L.H.S.2 from equation 3 with R.H.S.2 to get
      L.H.S.1 - L.H.S.2 = R.H.S.1 - R.H.S.2 ........................ equation4
      Now if we observe equation1,equation2 and equation4 we find that
      equation4 is just another inference from (equation1 - equation2) So subtraction of equations is quite okay :)
      (9 votes)
  • blobby green style avatar for user Josh
    I calculated M by subtracting the first formula [(mx^2)+bx=xy] from the second (y=mx+b). Which is the opposite of what sal does @. I get a different formula for M, is this OK? I can't equate the two formulas. Here is the formula I get (all the x and y should have the mean sign. M= [(xy/x)-y]/[(x^2/x)-x)
    (7 votes)
    Default Khan Academy avatar avatar for user
    • male robot hal style avatar for user Michael
      Multiply both top and bottom by -1, and the result is
      [(-xy/x)+y]/[(-x^2/x)+x]
      where all the x, x^2, and y should have the mean sign. This is equivalent to
      [y-(xy/x)]/[x-(x^2/x)]
      simply by switching the order of the terms in the numerator and denominator, respectively. This is Sal's answer.
      (7 votes)
  • blobby green style avatar for user Barbara
    Two questions please.

    1. I understand that we 'square' the distances to the best fitting line because that will eliminate the negatives. I'm wondering however whether squaring skews the results somehow, so that the points that are furthest from the best fitting line exert more of a force in their direction?

    2. The formula sought the minimum vertical distances between points and the best fitting line. Would the same result be achieved if, instead of minimizing the vertical distances, we minimized the absolute distance between the points and the line?

    Thank you!
    (7 votes)
    Default Khan Academy avatar avatar for user
    • leaf blue style avatar for user Dr C
      Those are some astute questions.

      [1.] Yes and no. The more extreme points will exert a larger influence on the line, but there are some caveats. We have two variables, X and Y, and so points can be out of whack in either the x-direction or the x-direction. Points that are further out in the x-distance will exert a strong pull on the line. There is actually a statistic to measure this called "leverage." Outliers in the y-direction don't impact the regression nearly so much.

      [2.] I'm not sure that you stated your question properly. The formula we used (called "Simple Linear Regression") minimizes the squared vertical distances between the points and the line. We could use the absolute value instead, though that would still be looking at the vertical distance.

      There is also a type of regression which does not measure vertical distance, it's called Deming Regression. In one special case of this type of regression, instead of vertical distances, we look at distances orthogonal / perpendicular to the regression line.
      (8 votes)
  • leaf green style avatar for user 𝜏 Is Better Than 𝝅
    A few videos back Sal presented a formula for the least squares regression line where the slope m=r(Sy/Sx), that is, the correlation coefficient times the sample standard deviation in Y divided by the sample standard deviation in X. Is this formula equivalent to the one presented in this video, and if so how does one establish their equivalence?
    (7 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      Yes, the formula m = r * (STD of y / STD of x) is equivalent to the formula derived in the video for linear regression using the method of least squares. The correlation coefficient r captures the linear relationship between x and y, and multiplying it by the ratio of the standard deviations standardizes the relationship in terms of variability in both x and y. Thus, both formulas aim to find the slope of the regression line that minimizes the squared errors.
      (1 vote)
  • blobby green style avatar for user iam
    we have worked out m = ( x^_ y^_ - xy^_ ) / ( x^_ )^2 - ( x^2 ) ^_ . but what if somehow I get the denominator to be zero? does it mean there is a limitation of this 'analytical solution'? thanks
    (1 vote)
    Default Khan Academy avatar avatar for user
    • leaf blue style avatar for user Dr C
      Think about what the denominator represents: variation in the X-variable.

      If there is no variation in the X-variable (i.e. all the X's are the same value), then there is absolutely no point in doing regression in the first place.
      (7 votes)
  • blobby green style avatar for user InnocentRealist
    @ :
    I'm having trouble seeing that both P1 = (x bar, y bar) and P2 = ( (x^2 bar / x bar ), (xy bar / x bar) ) are both on the best fit line.
    I imagine that any (u, v) such that v = mu + b is on the line y = mx + b; and therefore so are P1 and P2.
    The 2 derivatives tell us that (1) y bar = m * x bar + b, and (2) xy bar = m * x^2 bar + b * x bar (or, dividing through by x bar, (3) xy bar / x bar = m * (x^2 bar / x bar) + b). solving (1) and (2) for m and b gives m1 and b1 (in terms of x and y). So (1) and (3) are both true using m1 and b1. Since (1) and (3) are both of the form v = m1u + b1, then P1 and P2 are both on the best fit line.
    (2 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user tanmaygk97
    what if the mean is 0.The second point will cease toexist.
    (3 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user Apple Li
    In a previous video on the equation of Regression Line, m is derived from r*Sy/Sx, r being the Correlation Coefficient. Is this a different way of calculating m?
    (2 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      The formula presented here for m is derived from the method of least squares, which minimizes the sum of the squared errors between the observed and predicted values of y. While the formula m = r * (Sy / Sx) also involves the correlation coefficient r, it's a different approach. The method of least squares directly optimizes the fit of the line by minimizing the squared errors, whereas the correlation coefficient approach is based on the relationship between the standard deviations and the correlation coefficient.
      (1 vote)
  • marcimus pink style avatar for user Alba Soma
    Why to use this formula instead of m = r*(Sy/Sx)? And find b by substituting Xmean and Ymean in y = mx + b after that? I mean, it would take less time, isn't it?
    (2 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      Both formulas are valid for calculating the regression line, but they approach the problem differently. The formula presented in the video derives the coefficients of the regression line directly from the method of least squares, minimizing the sum of the squared errors. The formula m = r * (Sy / Sx) uses the correlation coefficient to scale the standard deviations of x and y, providing a measure of the linear relationship. Depending on the context and available data, either approach can be used. The method in the video may be preferred when focusing on minimizing the squared errors, while the correlation coefficient approach may provide insights into the strength and direction of the linear relationship.
      (1 vote)
  • duskpin ultimate style avatar for user Mayank Jain
    While teaching about regression line few videos ago, Sal told how to use the value of R to get the best fitting line. Here we have another equation for the best fitting least squares regression line. Are both of these same, if yes then why we have different formulas for the lines. If not same, then what is the basic difference ?
    (2 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      Both methods aim to find the best-fitting line for a given set of data points, but they use different approaches. The method involving the correlation coefficient r focuses on the linear relationship between x and y and utilizes the standard deviations of x and y, while the method of least squares minimizes the sum of the squared errors directly. The formulas may yield similar results for certain datasets, but they are conceptually different and applicable in different contexts.
      (1 vote)

Video transcript

So if you've gotten this far, you've been waiting for several videos to get to the optimal line that minimizes the squared distance to all of those points. So let's just get to the punch line. Let's solve for the optimal m and b. And just based on what we did in the last videos, there's two ways to do that. We actually now know two points that lie on that line. So we can literally find the slope of that line and then the the y intercept, the b there. Or, we could just say it's the solution to this system of equations. And they're actually mathematically equivalent. So let's solve for m first. And if we want to solve for m, we want to cancel out the b's. So let me rewrite this top equation just the way it's written over here. We have m times the mean of the x squareds plus b times the mean of-- Actually, we could even do it better than that. One step better than that is to, based on the work we did in the last video, we can just subtract this bottom equation from this top equation. So let me subtract it. Or let's add the negatives. So if I make this negative, this is negative. This is negative. What do we get? We get m times the mean of the x's minus the mean of the x squareds over the mean of x. The plus b and the negative b cancel out. Is equal to the mean of the y's minus the mean of the xy's over the mean of the x's. And then, we can divide both sides of the equation by this. And so we get m is equal to the mean of the y's minus the mean of the xy's over the mean of the x's over this. The mean of the x's minus the mean of the x squareds over the mean of the x's. Now notice, this is the exact same thing that you would get if you found the slope between these two points over here. Change in y, so the difference between that y and that y, is that right over there. Over the change in x's. The change in that x minus that x is exactly this over here. Now, to simplify it, we can multiply both the numerator and the denominator by the mean of the x's. And I do that just so we don't have this in the denominator both places. So if we multiply the numerator by the mean of the x's, we get the mean of the x's times the mean of the y's minus, this and this will cancel out, minus the mean of the xy's. All of that over, mean of the x's times the mean of the x's is just going to be the mean of the x's squared, minus over here you have the mean of the x squared. And that's what we get for m. And if we want to solve for b, we literally can just substitute back into either equation, but this equation right here is simpler. And so if we wanted to solve for b there, we can solve for b in terms of m. We just subtract m times the mean of x's from both sides. We get b is equal to the mean of the y's minus m times the mean of the x's. So what you do is you take your data point. You find the mean of the x's, the mean of the y's , the mean of the xy's, the mean of the x's squared. You find your m. Once you find your m, then you can substitute back in here and you find your b. And then you have your actual optimal line. And we're done. So these are the two big formula take aways for our optimal line. What I'm going to do in the next video, and this is where if anyone wasn't skipping up to this point, the next video is where they should re-engage, because we're actually going to use these formulas for the best fitting line. At least, when you measure the error by the squared distances from the points. We're going to use these formulas to actually find the best line for some data.