Current time:0:00Total duration:7:48

0 energy points

# Fitting a line to data

Sal creates a scatter plot and then fits a line to data on the median California family income. Created by Sal Khan.

Video transcript

In this video I want to give
you an example of what it means to fit data to a line. Instead of doing my traditional
video using my little pen tablet, I'm going to
do it straight on Excel so you could see how to do this for
yourself, so if you have Excel or some other type of
a spreadsheet program. We're not going to go
into the math of it. I really just want you to get
the conceptual understanding of what it means to fit data
with line, or do a linear regression. So here, let's just
read the problem. The following table shows the
median California income-- remember median is the middle,
the middle California income --from 1995 to 2002
as reported by the U.S. Census Bureau. Draw a scatter plot and
find the equation. What would you expect the median
annual income of a California family to be
in the year 2010? What are the meanings of the
slope and the y-intercept of this problem? So the first thing you'd want
to do-- I just copied and pasted this image --we have to
get the data in a form that the spreadsheet can
understand it. So let's make some
tables here. Let's say years since 1995. Let's make that one column. Let me make this a
little bit wider. Then let me put median income. This is the median income in
California for a family. So we start off with 1 year,
or 0 years since 1995, 0, 1, 2, 3, 4. Actually if you want, it'll
figure out the trend if you just keep going down. It'll figure out you're just
incrementing by 1. Then the income, I'll
just copy in these numbers right there. So that's $53,807, $55,217,
$55,209, $55,415 $63,100, $63,206, $63,761, and then
we have $65,766. So I don't need these
over here. So I'm going to get
rid of them. I can clear them. So let me make sure I
have enough entries. This is 1, 2, 3, 4, 5, 6, 7, 8,
and I have 1, 2, 3, 4, 5, 6, 7, 8 entries. I want to make sure I got my
data right. $53,807, $55,217, $55,209, 415, 100,
206, 761, 766. OK, there we go. Now you're going to find that
in Excel this is incredibly easy if you know what
to click on. One, plot this data, create a
scatter plot, and then even better, create a regression
of that data. So all you have to do is
you select the data. Then you go to insert,
and I'm going to insert a scatter plot. Then you can pick the different types of scatter plots. I just want to plot the data. There you go. It plotted the data for me. There you go. If you go by this is the actual
income, and this is by year since 1995. So this is 1995. It was $53,807. In 1996 it's $55,217. So it plotted all the data. Now what I want to
do is fit a line. So this isn't exactly a line. But let's see, if we assume
that a line can model this data well, I'm going to get
Excel to fit a line for me. So what I can do is I have all
of these options up here for different ways to fit
a line, all of these different options. I'm going to pick
this one here. You might not be
able to see it. It looks like it has a
line between dots. It also has fx which tells me
going to tell me the equation of the line. So if I click on that,
there you go. It not only fit, it replotted
that same data on a different graph. Let me make it a little
bit bigger. No, I don't want to that. Let me make it a little
bit bigger. We can cover up the data now,
just because I think we know what's going on. So let me cover it up
right like that. So not only did it plot the
various data points, it actually fit a line to that
data and it gave me the equation of that line. Let me see if I can make this
a little bit bigger. I'll move it out of the way so
you can read it at least. So it tells me right here, that
the equation for this line is y is equal to 1,882.3x
plus 52,847. So if you remember what we
know about slope and y-intercept, the y-intercept
is 52,847, which is, if you use this line as your measure,
where this line intersects at year 0, or in 1995. So if you use this line as a
model, in 1995 the line would say that you're going
to make $52,847. The actual data was a little
bit off of that. It was a little bit
higher, $53,807. So it was a little bit higher. But we're trying to get a line
that gets as close as possible to all of this data. It's actually trying to minimize
the distance, the square of the distance, between
each of these points in the line. We won't go into
the math there. But it gave us this
nice equation. Now we can use this nice
equation to predict things. If we say that this is a good a
model for the data-- let me bring this down a little
bit --let's try to answer our question. So we drew a scatter plot--
really Excel did it for us. We found the equation
right there. They say, what would you expect
the median annual income of a California family
to be in the year 2010? So here, we can just use the
equation they gave us. This right here, was 2002. So I could write
down the year. This was the year of 2002. So the year 2010 is
8 more years. Let me make a little
column here. So this is the year,
1995, 1996. Then Excel will be able to
figure out if I select those, and I go to this little bottom
right square and I scroll down, Excel will actually figure
out that I want to increment by 1 year
every time. If I say years since 1995, once
again I can just continue this trend right here. So 2010 would be 15 years. So we can just apply
this equation. We could say it's going to be
equal to, according to this line-- I'm just going to type it
in, hopefully you can read what I'm saying --1,882.3
times x. x here is the year since 1995. I could just select this
cell, or I could type in the number 15. That means times this
cell, times 15. Then plus 52,847, plus
that right there. Click enter and it predicts
$81,081.50. So if you just continue this
line for another 8 or so years, it predicts that the
median income in California for a family will be $81,000. Anyway, hopefully you found
that interesting. Spreadsheets are very useful
tools for manipulating data. It'll give you a sense of why
linear models are interesting, why lines are interesting, and
how you can actually use these tools to interpret data and
maybe even extrapolate some type of a prediction. This right here, is an
extrapolation using this linear regression.