Main content

## Statistics and probability

### Unit 5: Lesson 3

Introduction to trend lines- Fitting a line to data
- Estimating the line of best fit exercise
- Eyeballing the line of best fit
- Estimating with linear regression (linear models)
- Estimating equations of lines of best fit, and using them to make predictions
- Line of best fit: smoking in 1945
- Estimating slope of line of best fit
- Equations of trend lines: Phone data
- Linear regression review

© 2023 Khan AcademyTerms of usePrivacy PolicyCookie Notice

# Fitting a line to data

Sal creates a scatter plot and then fits a line to data on the median California family income. Created by Sal Khan.

## Want to join the conversation?

- How do we do this manually and write an equation usingpoint-slope form?(13 votes)
- Find the coordinates of two points on the line (not the data points they don't necessarily lie on the line).

The slope can then be calculated as the change in y over change in x and then you can use the point-slope form(9 votes)

- What does "extrapolation" mean? What is linear regression?(5 votes)
- "Extrapolation" is using known data with a pattern to predict unknown data. A linear regression draws a line assuming that the scatter plot points go up linearly. He doesn't talk about it in this video, but there are other types of regression lines, like an "exponential regression," which works with something that grows exponentially, like: population, or a bank account, or GDP.(9 votes)

- I am confused about the word model.

Is the line which is fitted in the data called the model?(3 votes)- Yes, the line that is fitted to the data (and the equation of this line) is an example of a model.

Have a blessed, wonderful day!(4 votes)

- How do you find it out, paper-pencil style?(3 votes)
- well, you can ignore the fact that this is not really a linear equation and use the y=mx+b formula and you still get an estimate which is all they ask. since you cant predict the future in the real world.(4 votes)

- So the meaning of the Y intercept is basically the year at which the study started or at least where are table started right ?

But What is the meaning of the slope ? I would guess it to be the median income per year but it doesn't ring much to me. need a little help : ) thaks !(2 votes)- The slope represents the "approximate rate" at which the median income is increasing. Per year, the median income increases x amount of dollars. I say approximate rate, because the rate is not constant, but the line of best fit represents the trend in the data.(5 votes)

- What are the steps for figuring out y=2x+1 when x=5?

Here is what I did

Y=2(5)+1

Y=10+1

Y=11(2 votes) - What happens if Sal put the actual years in the x-axis rather than the number of years from 1995 (so like 1995, 1996, 1997 etc)? Would the equation be any different? And how would you calculate the income for 2010 then?(3 votes)
- The constant (52847) in the equation would change to show the point where x(year) is 0 and the line crosses y-axis. It would probably be something negative. So the year 0 would literally mean all the way from 1995, 1994, 1993...to year 0. The slope wouldn't change since it tells how much y changes if x changes by 1.(1 vote)

- how could i have found the equation of the line without using excel?(2 votes)
- You could have used a calculator, too. You could even have done it by hand, but the numbers are so big it would have taken you a while to do it!(3 votes)

- What type of excel is he using? I dont see those options on mine.(3 votes)
- Hey so I am kinda confused at how the slope is 1882, such a huge slope would give a really steep line wouldnt it?

Thanks!(1 vote)- Great question! The "steepness" of a line is determined by 2 things. First, it is the slope, you're correct. But it is also determined by the rate each of the axes is increasing by. And because Sal chose to have the x- and y- axes to increase at different rates, it appears to be less steep.

Hope that helps!(4 votes)

## Video transcript

In this video I want to give
you an example of what it means to fit data to a line. Instead of doing my traditional
video using my little pen tablet, I'm going to
do it straight on Excel so you could see how to do this for
yourself, so if you have Excel or some other type of
a spreadsheet program. We're not going to go
into the math of it. I really just want you to get
the conceptual understanding of what it means to fit data
with line, or do a linear regression. So here, let's just
read the problem. The following table shows the
median California income-- remember median is the middle,
the middle California income --from 1995 to 2002
as reported by the U.S. Census Bureau. Draw a scatter plot and
find the equation. What would you expect the median
annual income of a California family to be
in the year 2010? What are the meanings of the
slope and the y-intercept of this problem? So the first thing you'd want
to do-- I just copied and pasted this image --we have to
get the data in a form that the spreadsheet can
understand it. So let's make some
tables here. Let's say years since 1995. Let's make that one column. Let me make this a
little bit wider. Then let me put median income. This is the median income in
California for a family. So we start off with 1 year,
or 0 years since 1995, 0, 1, 2, 3, 4. Actually if you want, it'll
figure out the trend if you just keep going down. It'll figure out you're just
incrementing by 1. Then the income, I'll
just copy in these numbers right there. So that's $53,807, $55,217,
$55,209, $55,415 $63,100, $63,206, $63,761, and then
we have $65,766. So I don't need these
over here. So I'm going to get
rid of them. I can clear them. So let me make sure I
have enough entries. This is 1, 2, 3, 4, 5, 6, 7, 8,
and I have 1, 2, 3, 4, 5, 6, 7, 8 entries. I want to make sure I got my
data right. $53,807, $55,217, $55,209, 415, 100,
206, 761, 766. OK, there we go. Now you're going to find that
in Excel this is incredibly easy if you know what
to click on. One, plot this data, create a
scatter plot, and then even better, create a regression
of that data. So all you have to do is
you select the data. Then you go to insert,
and I'm going to insert a scatter plot. Then you can pick the different types of scatter plots. I just want to plot the data. There you go. It plotted the data for me. There you go. If you go by this is the actual
income, and this is by year since 1995. So this is 1995. It was $53,807. In 1996 it's $55,217. So it plotted all the data. Now what I want to
do is fit a line. So this isn't exactly a line. But let's see, if we assume
that a line can model this data well, I'm going to get
Excel to fit a line for me. So what I can do is I have all
of these options up here for different ways to fit
a line, all of these different options. I'm going to pick
this one here. You might not be
able to see it. It looks like it has a
line between dots. It also has fx which tells me
going to tell me the equation of the line. So if I click on that,
there you go. It not only fit, it replotted
that same data on a different graph. Let me make it a little
bit bigger. No, I don't want to that. Let me make it a little
bit bigger. We can cover up the data now,
just because I think we know what's going on. So let me cover it up
right like that. So not only did it plot the
various data points, it actually fit a line to that
data and it gave me the equation of that line. Let me see if I can make this
a little bit bigger. I'll move it out of the way so
you can read it at least. So it tells me right here, that
the equation for this line is y is equal to 1,882.3x
plus 52,847. So if you remember what we
know about slope and y-intercept, the y-intercept
is 52,847, which is, if you use this line as your measure,
where this line intersects at year 0, or in 1995. So if you use this line as a
model, in 1995 the line would say that you're going
to make $52,847. The actual data was a little
bit off of that. It was a little bit
higher, $53,807. So it was a little bit higher. But we're trying to get a line
that gets as close as possible to all of this data. It's actually trying to minimize
the distance, the square of the distance, between
each of these points in the line. We won't go into
the math there. But it gave us this
nice equation. Now we can use this nice
equation to predict things. If we say that this is a good a
model for the data-- let me bring this down a little
bit --let's try to answer our question. So we drew a scatter plot--
really Excel did it for us. We found the equation
right there. They say, what would you expect
the median annual income of a California family
to be in the year 2010? So here, we can just use the
equation they gave us. This right here, was 2002. So I could write
down the year. This was the year of 2002. So the year 2010 is
8 more years. Let me make a little
column here. So this is the year,
1995, 1996. Then Excel will be able to
figure out if I select those, and I go to this little bottom
right square and I scroll down, Excel will actually figure
out that I want to increment by 1 year
every time. If I say years since 1995, once
again I can just continue this trend right here. So 2010 would be 15 years. So we can just apply
this equation. We could say it's going to be
equal to, according to this line-- I'm just going to type it
in, hopefully you can read what I'm saying --1,882.3
times x. x here is the year since 1995. I could just select this
cell, or I could type in the number 15. That means times this
cell, times 15. Then plus 52,847, plus
that right there. Click enter and it predicts
$81,081.50. So if you just continue this
line for another 8 or so years, it predicts that the
median income in California for a family will be $81,000. Anyway, hopefully you found
that interesting. Spreadsheets are very useful
tools for manipulating data. It'll give you a sense of why
linear models are interesting, why lines are interesting, and
how you can actually use these tools to interpret data and
maybe even extrapolate some type of a prediction. This right here, is an
extrapolation using this linear regression.