If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Fitting a line to data

Sal creates a scatter plot and then fits a line to data on the median California family income. Created by Sal Khan.

Want to join the conversation?

  • mr pink red style avatar for user Aseel
    How do we do this manually and write an equation usingpoint-slope form?
    (13 votes)
    Default Khan Academy avatar avatar for user
  • starky seedling style avatar for user Cary Wang
    What does "extrapolation" mean? What is linear regression?
    (5 votes)
    Default Khan Academy avatar avatar for user
    • mr pants teal style avatar for user Wrath Of Academy
      "Extrapolation" is using known data with a pattern to predict unknown data. A linear regression draws a line assuming that the scatter plot points go up linearly. He doesn't talk about it in this video, but there are other types of regression lines, like an "exponential regression," which works with something that grows exponentially, like: population, or a bank account, or GDP.
      (9 votes)
  • blobby green style avatar for user Aditya Roy
    I am confused about the word model.
    Is the line which is fitted in the data called the model?
    (3 votes)
    Default Khan Academy avatar avatar for user
  • male robot hal style avatar for user Daniel Six
    How do you find it out, paper-pencil style?
    (3 votes)
    Default Khan Academy avatar avatar for user
  • male robot hal style avatar for user Kevin
    So the meaning of the Y intercept is basically the year at which the study started or at least where are table started right ?

    But What is the meaning of the slope ? I would guess it to be the median income per year but it doesn't ring much to me. need a little help : ) thaks !
    (2 votes)
    Default Khan Academy avatar avatar for user
    • piceratops ultimate style avatar for user Derek Oldfield
      The slope represents the "approximate rate" at which the median income is increasing. Per year, the median income increases x amount of dollars. I say approximate rate, because the rate is not constant, but the line of best fit represents the trend in the data.
      (5 votes)
  • blobby green style avatar for user sweetdov
    What are the steps for figuring out y=2x+1 when x=5?

    Here is what I did

    Y=2(5)+1
    Y=10+1
    Y=11
    (2 votes)
    Default Khan Academy avatar avatar for user
  • mr pants teal style avatar for user Mareena
    What happens if Sal put the actual years in the x-axis rather than the number of years from 1995 (so like 1995, 1996, 1997 etc)? Would the equation be any different? And how would you calculate the income for 2010 then?
    (3 votes)
    Default Khan Academy avatar avatar for user
    • aqualine tree style avatar for user Composir
      The constant (52847) in the equation would change to show the point where x(year) is 0 and the line crosses y-axis. It would probably be something negative. So the year 0 would literally mean all the way from 1995, 1994, 1993...to year 0. The slope wouldn't change since it tells how much y changes if x changes by 1.
      (1 vote)
  • leaf green style avatar for user Richard Nyambura
    how could i have found the equation of the line without using excel?
    (2 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user John
    What type of excel is he using? I dont see those options on mine.
    (3 votes)
    Default Khan Academy avatar avatar for user
  • aqualine ultimate style avatar for user Blert Shabani
    Hey so I am kinda confused at how the slope is 1882, such a huge slope would give a really steep line wouldnt it?
    Thanks!
    (1 vote)
    Default Khan Academy avatar avatar for user
    • mr pants green style avatar for user Neil
      Great question! The "steepness" of a line is determined by 2 things. First, it is the slope, you're correct. But it is also determined by the rate each of the axes is increasing by. And because Sal chose to have the x- and y- axes to increase at different rates, it appears to be less steep.
      Hope that helps!
      (4 votes)

Video transcript

In this video I want to give you an example of what it means to fit data to a line. Instead of doing my traditional video using my little pen tablet, I'm going to do it straight on Excel so you could see how to do this for yourself, so if you have Excel or some other type of a spreadsheet program. We're not going to go into the math of it. I really just want you to get the conceptual understanding of what it means to fit data with line, or do a linear regression. So here, let's just read the problem. The following table shows the median California income-- remember median is the middle, the middle California income --from 1995 to 2002 as reported by the U.S. Census Bureau. Draw a scatter plot and find the equation. What would you expect the median annual income of a California family to be in the year 2010? What are the meanings of the slope and the y-intercept of this problem? So the first thing you'd want to do-- I just copied and pasted this image --we have to get the data in a form that the spreadsheet can understand it. So let's make some tables here. Let's say years since 1995. Let's make that one column. Let me make this a little bit wider. Then let me put median income. This is the median income in California for a family. So we start off with 1 year, or 0 years since 1995, 0, 1, 2, 3, 4. Actually if you want, it'll figure out the trend if you just keep going down. It'll figure out you're just incrementing by 1. Then the income, I'll just copy in these numbers right there. So that's $53,807, $55,217, $55,209, $55,415 $63,100, $63,206, $63,761, and then we have $65,766. So I don't need these over here. So I'm going to get rid of them. I can clear them. So let me make sure I have enough entries. This is 1, 2, 3, 4, 5, 6, 7, 8, and I have 1, 2, 3, 4, 5, 6, 7, 8 entries. I want to make sure I got my data right. $53,807, $55,217, $55,209, 415, 100, 206, 761, 766. OK, there we go. Now you're going to find that in Excel this is incredibly easy if you know what to click on. One, plot this data, create a scatter plot, and then even better, create a regression of that data. So all you have to do is you select the data. Then you go to insert, and I'm going to insert a scatter plot. Then you can pick the different types of scatter plots. I just want to plot the data. There you go. It plotted the data for me. There you go. If you go by this is the actual income, and this is by year since 1995. So this is 1995. It was $53,807. In 1996 it's $55,217. So it plotted all the data. Now what I want to do is fit a line. So this isn't exactly a line. But let's see, if we assume that a line can model this data well, I'm going to get Excel to fit a line for me. So what I can do is I have all of these options up here for different ways to fit a line, all of these different options. I'm going to pick this one here. You might not be able to see it. It looks like it has a line between dots. It also has fx which tells me going to tell me the equation of the line. So if I click on that, there you go. It not only fit, it replotted that same data on a different graph. Let me make it a little bit bigger. No, I don't want to that. Let me make it a little bit bigger. We can cover up the data now, just because I think we know what's going on. So let me cover it up right like that. So not only did it plot the various data points, it actually fit a line to that data and it gave me the equation of that line. Let me see if I can make this a little bit bigger. I'll move it out of the way so you can read it at least. So it tells me right here, that the equation for this line is y is equal to 1,882.3x plus 52,847. So if you remember what we know about slope and y-intercept, the y-intercept is 52,847, which is, if you use this line as your measure, where this line intersects at year 0, or in 1995. So if you use this line as a model, in 1995 the line would say that you're going to make $52,847. The actual data was a little bit off of that. It was a little bit higher, $53,807. So it was a little bit higher. But we're trying to get a line that gets as close as possible to all of this data. It's actually trying to minimize the distance, the square of the distance, between each of these points in the line. We won't go into the math there. But it gave us this nice equation. Now we can use this nice equation to predict things. If we say that this is a good a model for the data-- let me bring this down a little bit --let's try to answer our question. So we drew a scatter plot-- really Excel did it for us. We found the equation right there. They say, what would you expect the median annual income of a California family to be in the year 2010? So here, we can just use the equation they gave us. This right here, was 2002. So I could write down the year. This was the year of 2002. So the year 2010 is 8 more years. Let me make a little column here. So this is the year, 1995, 1996. Then Excel will be able to figure out if I select those, and I go to this little bottom right square and I scroll down, Excel will actually figure out that I want to increment by 1 year every time. If I say years since 1995, once again I can just continue this trend right here. So 2010 would be 15 years. So we can just apply this equation. We could say it's going to be equal to, according to this line-- I'm just going to type it in, hopefully you can read what I'm saying --1,882.3 times x. x here is the year since 1995. I could just select this cell, or I could type in the number 15. That means times this cell, times 15. Then plus 52,847, plus that right there. Click enter and it predicts $81,081.50. So if you just continue this line for another 8 or so years, it predicts that the median income in California for a family will be $81,000. Anyway, hopefully you found that interesting. Spreadsheets are very useful tools for manipulating data. It'll give you a sense of why linear models are interesting, why lines are interesting, and how you can actually use these tools to interpret data and maybe even extrapolate some type of a prediction. This right here, is an extrapolation using this linear regression.