# Introduction to inference about slope in linear regression

## Video transcript

- [Instructor] In this
video, we're going to talk about regression lines,
but it's not gonna be the first time we're talking
about regression lines. And so if the idea of a
regression is foreign to you, I encourage you to watch the
introductory videos on it. Here we're gonna think about
how we can make inferences from a regression line. And so if the idea of statistical
inference is new to you or hypothesis testing, once again, watch those videos as well. But let's say we think
there's a positive association between shoe size and height. And so what we might
want to do is we could, here on the horizontal
axis, that is shoe size. Our sizes could go size
one, two, three, four, five, six, seven, eight, nine, 10, 11, 12, and it could keep going up from there. And then, on this height, or on this axis, our y-axis, this would be height, so one foot, two feet,
three feet, four feet, five feet, six feet, seven feet. And then you could, to see if there's an association,
you might take a sample. Let's say you take a random sample of 20 people from the population. And in future videos, we'll
talk about the conditions necessary for making
appropriate inferences. Let's say those 20 people
are these 20 data points. So there's a young child, and
maybe there's a grown adult, with bigger feet and who's taller. And then three, four, five,
six, seven, eight, nine, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, and so you have these 20 data points. And then what you're likely to do is input them into a computer. You could do it by hand, but we have computers now
to do that for us usually. And the computer could try
to fit a regression line. And there's many techniques for doing it, but one typical technique is
to try to overall minimize the squared distance between
these points and that line. And this regression line
will have an equation, as any line would have. And we tend to show that as saying y hat, this hat tells us that
this is a regression line, is equal to the y-intercept, a plus the slope times our x variable. So this right over here would be a. Now, to be clear, if
you took another sample, you might get different results here. In fact, let's call this y
sub one for our first sample, a sub one, b sub one,
and this is a sub one. If you were to take
another sample of 20 folks, so let's do that. Maybe you get one, two, three, four, five, six, seven, eight, nine, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 and then you tried to fit a line to that, that line might look something like this. It might have a slightly
different y-intercept and a slightly different slope. So we could call that,
for the second sample, y sub two or y hat sub
two is equal to a sub two plus b sub two times x. And so every time you take a sample, you are likely to get different results for these values, which
are essentially statistics. Remember, statistics are things that we can get from samples, and
we're trying to estimate true population parameters. Well, what would be the
true population parameters we're trying to estimate? Well, imagine a world, imagine a world here, that you are able to find out
the true linear relationship, or maybe there is some
true linear relationship between shoe size and height. You could get it if
theoretically you could measure every human being on the planet. And depending what you
define as a population, it could be all living people or all people who will ever live. This isn't practical, but let's just say that
you actually could. And you would have billions
of data points here for the true population. And then if you were to fit
a regression line to that, you could view this as the true
population regression line. And so that would be y hat is equal to, and to make it clear that here, the y-intercept and the slope, this would be the true
population parameters. Instead of saying a, we say alpha. And instead of saying
b, we say beta times x. But it's very hard to come up exactly with what alpha and beta are, and so that's why we
estimate it with a's and b's based on a sample. Now, what's interesting,
with this in mind, is we can start to make
inferences based on our sample. So we know that, for example, b sub two is unlikely to be exactly beta. But how confident can we be that there is at least a
positive linear relationship or a nonzero linear relationship? Or can we create a confidence
interval around this statistic in order to have a good sense of where the true parameter might actually be? And the simple answer is yes. And to do so, we'll use
the same exact ideas that we did when we made
inferences based on proportions or based on means. The way that you can make
an inference, for example, for your true population
slope of your regression line, say, okay, I took a sample, I got this slope right over here, so I'll just call that b two, and then I could create a
confidence interval around that. And so that confidence interval is going to be based on
some critical value times, ideally, the standard deviation of the sampling distribution
of your sample statistic. In this case, it would be the
sample regression line slope. But because we don't know
exactly what this is, we can't figure out precisely
what this is going to be from a sample, we are going to estimate it with what's known as the
standard error of the statistic. And we'll go into more depth
in this in future videos. And since we're estimating here, we're going to use a
critical t-value here, which we have studied before. And so based on your confidence
level you want to have, let's say it's 95%, based on the degrees of
freedom, which we'll see will come out of how
many data points we have, we can figure this out. And from our sample,
we can figure this out, and we can figure this out. And then we would have
constructed a confidence interval. We'll also see that you could
do hypothesis testing here. You could say, hey, let's
set up a null hypothesis, and the null hypothesis is going to be that there is no nonzero
linear relationship or that the true population
slope of the regression line or slope of the population
regression line is equal to zero and that the alternative hypothesis is that the true
relationship could either be greater than zero, it's a
positive linear relationship, or that it's just nonzero. And then what you could
do is, assuming this, you could see what's the
probability of getting a statistic that is at least this
extreme or more extreme? And if that's below some threshold, you might reject the null hypothesis, which would suggest the alternative. So this and this are things
that we have done before where you're creating
a confidence interval around a statistic or you're
doing hypothesis testing, making assumptions about a true parameter. The only difference here
is that the parameter that we're trying to
estimate are going to be the parameters for a theoretical
population regression line, and we're going to do that
using sample statistics for a sample regression line.