Statistics and probability
- Fitting a line to data
- Estimating the line of best fit exercise
- Eyeballing the line of best fit
- Estimating with linear regression (linear models)
- Estimating equations of lines of best fit, and using them to make predictions
- Line of best fit: smoking in 1945
- Estimating slope of line of best fit
- Equations of trend lines: Phone data
- Linear regression review
Fitting a line to data
Sal creates a scatter plot and then fits a line to data on the median California family income. Created by Sal Khan.
Want to join the conversation?
- How do we do this manually and write an equation usingpoint-slope form?(15 votes)
- Find the coordinates of two points on the line (not the data points they don't necessarily lie on the line).
The slope can then be calculated as the change in y over change in x and then you can use the point-slope form(12 votes)
- Are there more videos on how to use excel?(12 votes)
- On eighth grade math,if you said eighth grade math no but the only reason i said on eighth grade is because i do not watch every single video khan academy makes but i watch a lot of them and the the one i watch do not have example's of using excel.In summary,there can be but i don't know.And in my opinion i don't think so.(0 votes)
- What does "extrapolation" mean? What is linear regression?(6 votes)
- "Extrapolation" is using known data with a pattern to predict unknown data. A linear regression draws a line assuming that the scatter plot points go up linearly. He doesn't talk about it in this video, but there are other types of regression lines, like an "exponential regression," which works with something that grows exponentially, like: population, or a bank account, or GDP.(11 votes)
- I am confused about the word model.
Is the line which is fitted in the data called the model?(4 votes)
- Yes, the line that is fitted to the data (and the equation of this line) is an example of a model.
Have a blessed, wonderful day!(5 votes)
- How do you find it out, paper-pencil style?(4 votes)
- well, you can ignore the fact that this is not really a linear equation and use the y=mx+b formula and you still get an estimate which is all they ask. since you cant predict the future in the real world.(5 votes)
- So the meaning of the Y intercept is basically the year at which the study started or at least where are table started right ?
But What is the meaning of the slope ? I would guess it to be the median income per year but it doesn't ring much to me. need a little help : ) thaks !(3 votes)
- The slope represents the "approximate rate" at which the median income is increasing. Per year, the median income increases x amount of dollars. I say approximate rate, because the rate is not constant, but the line of best fit represents the trend in the data.(6 votes)
- What happens if Sal put the actual years in the x-axis rather than the number of years from 1995 (so like 1995, 1996, 1997 etc)? Would the equation be any different? And how would you calculate the income for 2010 then?(4 votes)
- The constant (52847) in the equation would change to show the point where x(year) is 0 and the line crosses y-axis. It would probably be something negative. So the year 0 would literally mean all the way from 1995, 1994, 1993...to year 0. The slope wouldn't change since it tells how much y changes if x changes by 1.(2 votes)
- how could i have found the equation of the line without using excel?(3 votes)
- You could have used a calculator, too. You could even have done it by hand, but the numbers are so big it would have taken you a while to do it!(4 votes)
- Hey so I am kinda confused at how the slope is 1882, such a huge slope would give a really steep line wouldnt it?
- Great question! The "steepness" of a line is determined by 2 things. First, it is the slope, you're correct. But it is also determined by the rate each of the axes is increasing by. And because Sal chose to have the x- and y- axes to increase at different rates, it appears to be less steep.
Hope that helps!(5 votes)
- What type of excel is he using? I dont see those options on mine.(4 votes)
In this video I want to give you an example of what it means to fit data to a line. Instead of doing my traditional video using my little pen tablet, I'm going to do it straight on Excel so you could see how to do this for yourself, so if you have Excel or some other type of a spreadsheet program. We're not going to go into the math of it. I really just want you to get the conceptual understanding of what it means to fit data with line, or do a linear regression. So here, let's just read the problem. The following table shows the median California income-- remember median is the middle, the middle California income --from 1995 to 2002 as reported by the U.S. Census Bureau. Draw a scatter plot and find the equation. What would you expect the median annual income of a California family to be in the year 2010? What are the meanings of the slope and the y-intercept of this problem? So the first thing you'd want to do-- I just copied and pasted this image --we have to get the data in a form that the spreadsheet can understand it. So let's make some tables here. Let's say years since 1995. Let's make that one column. Let me make this a little bit wider. Then let me put median income. This is the median income in California for a family. So we start off with 1 year, or 0 years since 1995, 0, 1, 2, 3, 4. Actually if you want, it'll figure out the trend if you just keep going down. It'll figure out you're just incrementing by 1. Then the income, I'll just copy in these numbers right there. So that's $53,807, $55,217, $55,209, $55,415 $63,100, $63,206, $63,761, and then we have $65,766. So I don't need these over here. So I'm going to get rid of them. I can clear them. So let me make sure I have enough entries. This is 1, 2, 3, 4, 5, 6, 7, 8, and I have 1, 2, 3, 4, 5, 6, 7, 8 entries. I want to make sure I got my data right. $53,807, $55,217, $55,209, 415, 100, 206, 761, 766. OK, there we go. Now you're going to find that in Excel this is incredibly easy if you know what to click on. One, plot this data, create a scatter plot, and then even better, create a regression of that data. So all you have to do is you select the data. Then you go to insert, and I'm going to insert a scatter plot. Then you can pick the different types of scatter plots. I just want to plot the data. There you go. It plotted the data for me. There you go. If you go by this is the actual income, and this is by year since 1995. So this is 1995. It was $53,807. In 1996 it's $55,217. So it plotted all the data. Now what I want to do is fit a line. So this isn't exactly a line. But let's see, if we assume that a line can model this data well, I'm going to get Excel to fit a line for me. So what I can do is I have all of these options up here for different ways to fit a line, all of these different options. I'm going to pick this one here. You might not be able to see it. It looks like it has a line between dots. It also has fx which tells me going to tell me the equation of the line. So if I click on that, there you go. It not only fit, it replotted that same data on a different graph. Let me make it a little bit bigger. No, I don't want to that. Let me make it a little bit bigger. We can cover up the data now, just because I think we know what's going on. So let me cover it up right like that. So not only did it plot the various data points, it actually fit a line to that data and it gave me the equation of that line. Let me see if I can make this a little bit bigger. I'll move it out of the way so you can read it at least. So it tells me right here, that the equation for this line is y is equal to 1,882.3x plus 52,847. So if you remember what we know about slope and y-intercept, the y-intercept is 52,847, which is, if you use this line as your measure, where this line intersects at year 0, or in 1995. So if you use this line as a model, in 1995 the line would say that you're going to make $52,847. The actual data was a little bit off of that. It was a little bit higher, $53,807. So it was a little bit higher. But we're trying to get a line that gets as close as possible to all of this data. It's actually trying to minimize the distance, the square of the distance, between each of these points in the line. We won't go into the math there. But it gave us this nice equation. Now we can use this nice equation to predict things. If we say that this is a good a model for the data-- let me bring this down a little bit --let's try to answer our question. So we drew a scatter plot-- really Excel did it for us. We found the equation right there. They say, what would you expect the median annual income of a California family to be in the year 2010? So here, we can just use the equation they gave us. This right here, was 2002. So I could write down the year. This was the year of 2002. So the year 2010 is 8 more years. Let me make a little column here. So this is the year, 1995, 1996. Then Excel will be able to figure out if I select those, and I go to this little bottom right square and I scroll down, Excel will actually figure out that I want to increment by 1 year every time. If I say years since 1995, once again I can just continue this trend right here. So 2010 would be 15 years. So we can just apply this equation. We could say it's going to be equal to, according to this line-- I'm just going to type it in, hopefully you can read what I'm saying --1,882.3 times x. x here is the year since 1995. I could just select this cell, or I could type in the number 15. That means times this cell, times 15. Then plus 52,847, plus that right there. Click enter and it predicts $81,081.50. So if you just continue this line for another 8 or so years, it predicts that the median income in California for a family will be $81,000. Anyway, hopefully you found that interesting. Spreadsheets are very useful tools for manipulating data. It'll give you a sense of why linear models are interesting, why lines are interesting, and how you can actually use these tools to interpret data and maybe even extrapolate some type of a prediction. This right here, is an extrapolation using this linear regression.