Current time:0:00Total duration:7:48
0 energy points
Sal creates a scatter plot and then fits a line to data on the median California family income. Created by Sal Khan.
Video transcript
In this video I want to give you an example of what it means to fit data to a line. Instead of doing my traditional video using my little pen tablet, I'm going to do it straight on Excel so you could see how to do this for yourself, so if you have Excel or some other type of a spreadsheet program. We're not going to go into the math of it. I really just want you to get the conceptual understanding of what it means to fit data with line, or do a linear regression. So here, let's just read the problem. The following table shows the median California income-- remember median is the middle, the middle California income --from 1995 to 2002 as reported by the U.S. Census Bureau. Draw a scatter plot and find the equation. What would you expect the median annual income of a California family to be in the year 2010? What are the meanings of the slope and the y-intercept of this problem? So the first thing you'd want to do-- I just copied and pasted this image --we have to get the data in a form that the spreadsheet can understand it. So let's make some tables here. Let's say years since 1995. Let's make that one column. Let me make this a little bit wider. Then let me put median income. This is the median income in California for a family. So we start off with 1 year, or 0 years since 1995, 0, 1, 2, 3, 4. Actually if you want, it'll figure out the trend if you just keep going down. It'll figure out you're just incrementing by 1. Then the income, I'll just copy in these numbers right there. So that's $53,807, $55,217, $55,209, $55,415 $63,100, $63,206, $63,761, and then we have $65,766. So I don't need these over here. So I'm going to get rid of them. I can clear them. So let me make sure I have enough entries. This is 1, 2, 3, 4, 5, 6, 7, 8, and I have 1, 2, 3, 4, 5, 6, 7, 8 entries. I want to make sure I got my data right. $53,807, $55,217, $55,209, 415, 100, 206, 761, 766. OK, there we go. Now you're going to find that in Excel this is incredibly easy if you know what to click on. One, plot this data, create a scatter plot, and then even better, create a regression of that data. So all you have to do is you select the data. Then you go to insert, and I'm going to insert a scatter plot. Then you can pick the different types of scatter plots. I just want to plot the data. There you go. It plotted the data for me. There you go. If you go by this is the actual income, and this is by year since 1995. So this is 1995. It was $53,807. In 1996 it's $55,217. So it plotted all the data. Now what I want to do is fit a line. So this isn't exactly a line. But let's see, if we assume that a line can model this data well, I'm going to get Excel to fit a line for me. So what I can do is I have all of these options up here for different ways to fit a line, all of these different options. I'm going to pick this one here. You might not be able to see it. It looks like it has a line between dots. It also has fx which tells me going to tell me the equation of the line. So if I click on that, there you go. It not only fit, it replotted that same data on a different graph. Let me make it a little bit bigger. No, I don't want to that. Let me make it a little bit bigger. We can cover up the data now, just because I think we know what's going on. So let me cover it up right like that. So not only did it plot the various data points, it actually fit a line to that data and it gave me the equation of that line. Let me see if I can make this a little bit bigger. I'll move it out of the way so you can read it at least. So it tells me right here, that the equation for this line is y is equal to 1,882.3x plus 52,847. So if you remember what we know about slope and y-intercept, the y-intercept is 52,847, which is, if you use this line as your measure, where this line intersects at year 0, or in 1995. So if you use this line as a model, in 1995 the line would say that you're going to make $52,847. The actual data was a little bit off of that. It was a little bit higher, $53,807. So it was a little bit higher. But we're trying to get a line that gets as close as possible to all of this data. It's actually trying to minimize the distance, the square of the distance, between each of these points in the line. We won't go into the math there. But it gave us this nice equation. Now we can use this nice equation to predict things. If we say that this is a good a model for the data-- let me bring this down a little bit --let's try to answer our question. So we drew a scatter plot-- really Excel did it for us. We found the equation right there. They say, what would you expect the median annual income of a California family to be in the year 2010? So here, we can just use the equation they gave us. This right here, was 2002. So I could write down the year. This was the year of 2002. So the year 2010 is 8 more years. Let me make a little column here. So this is the year, 1995, 1996. Then Excel will be able to figure out if I select those, and I go to this little bottom right square and I scroll down, Excel will actually figure out that I want to increment by 1 year every time. If I say years since 1995, once again I can just continue this trend right here. So 2010 would be 15 years. So we can just apply this equation. We could say it's going to be equal to, according to this line-- I'm just going to type it in, hopefully you can read what I'm saying --1,882.3 times x. x here is the year since 1995. I could just select this cell, or I could type in the number 15. That means times this cell, times 15. Then plus 52,847, plus that right there. Click enter and it predicts $81,081.50. So if you just continue this line for another 8 or so years, it predicts that the median income in California for a family will be $81,000. Anyway, hopefully you found that interesting. Spreadsheets are very useful tools for manipulating data. It'll give you a sense of why linear models are interesting, why lines are interesting, and how you can actually use these tools to interpret data and maybe even extrapolate some type of a prediction. This right here, is an extrapolation using this linear regression.