Current time:0:00Total duration:26:24

0 energy points

Studying for a test? Prepare with these 6 lessons on Modeling data distributions.

See 6 lessons

# Introduction to the normal distribution

Video transcript

The normal distribution is
arguably the most important concept in statistics. Everything we do,
or almost everything we do in inferential
statistics, which is essentially making inferences
based on data points, is to some degree based on
the normal distribution. And so what I want to do in this
video and in this spreadsheet is to essentially give you
as deep an understanding of the normal
distribution as possible. And the rest of your life,
you're always, if someone says, oh, we're assuming a normal
distribution, it's like, oh, I know what that is. This is the formula,
and I understand how to use it, et
cetera, et cetera. So this spreadsheet,
just so you know, is downloadable at
www.khanacademy.org/downloads/-- and if you just type that
part in, you'll see everything that's downloadable-- but
then download/normalintro.xls. And then you'll get this
spreadsheet right here. And I think I did this
in the right standard. But anyway, if you
go onto Wikipedia, and if you were to type
in normal distribution, or you were to do a search
for normal distribution-- let me actually get my
Pen tool going-- this is what you would see. I literally copied and pasted
this right here from Wikipedia, and I know it looks daunting. You have all these
Greek letters there. But the sigma right
here, that is just the standard deviation
of the distribution. We'll play with that a little
bit with in this chart, and see what that means. I mean, you know what
the standard deviation is in general, but this is
the standard deviation of this distribution, which is
a probability density function. And I encourage you to re-watch
the video on probability density functions,
because it's a little bit of a transition going from the
binomial distribution, which is discrete, right? In the binomial
distribution, you say, oh, what is the probability
of getting a 5? And you just look at that
histogram or that bar chart, and you say, oh,
that's the probability. But in a continuous
probability distribution, or a continuous probability
density function, you can't just say, what is a
probability of me getting a 5? You have to say, what is a
probability of me getting between, let's say,
a 4.5 and a 5.5? You have to give it some range. And then your
probability isn't given by just reading this graph. The probability is given by the
area under that curve, right? It'd be given by this area. And for those of you all
who know calculus, if p of x is our probability
density function-- it doesn't have to be a normal
distribution, although it often is a normal
distribution-- the way you actually figure out the
probability of, let's say, between 4 and 1/2 and 5 and 1/2. What is the probability
this is, whatever-- the odds of me getting between
4 and 1/2 and 5 and 1/2 inches of rain tomorrow? It'll actually be the integral
from 4 and 1/2 to 5 and 1/2 of this probability
density function, or of this probability density
function, the x, right? So that's just the
area in the curve. For those of you who
don't know calculus yet, I encourage you to
watch that playlist. But all this is is saying,
the area in the curve from here to here. And it actually turns out,
for the normal distribution, this isn't an easy thing
to evaluate analytically. And so you do it numerically. You don't have to feel bad
about doing it numerically, because you're like, oh, how
do I take the integral of this? There's actually
functions for it, and you can even approximate it. I mean, one way you
could approximate it is you could use it the way
you approximate integrals in general, where you could say,
well, what is the area of this? Well, it's roughly the
area of this trapezoid. So you could figure
out the area of that trapezoid, taking the average
of that point and that point, and multiplying it by the base. Let me change colors,
just because I think I'm overdoing
it with the green. Or you could just take the
height of this line right here, and multiply
it by the base, and you'll get the area
of this rectangle, which might be a pretty good
approximation for the area under the curve,
right, because you'll have a little bit
extra over here, but you're going to miss
a little bit over there. So it might be pretty
good approximation. And that's actually what
I do in the other video, just to approximate the
area under the curve, and give you a good sense that
the normal distribution is what the binomial distribution
becomes, essentially, if you have many, many,
many, many trials. And what's interesting about
the normal distribution, just so you know-- I don't know
if I mentioned this already-- this right here,
this is the graph. And then this is
just another word. People might talk about
the central limit theorem. But this is really one of the
most important or interesting things about our universe--
central limit theorem. And I won't prove it here,
but it essentially tells us-- and you could
understand it by looking at the other video, where we
talk about flipping coins. And if we were to do many,
many, many flips of coins, right, those are independent
trials of each other. And if you take the sum
of all of your flips-- if you were to give yourself
one point if you got ahead every time-- and if you were
to take the sum of them, as you approach an
infinite number of flips, you approach the
normal distribution. And what's interesting
about that is, each of those
trials, in the case of flipping a coin-- each trial
is a flip of the coin-- each of those trials don't have to
have a normal distribution. So we could be talking about
molecular interactions, and every time compound x
interacts with compound y, what might result doesn't have
to be normally distributed. But what happens is,
if you take a sum of a ton of those interactions,
then, all of a sudden, the end result will be
normally distributed. And this is why this is such
an important distribution. It shows up in nature
all of the time. If you do take data points from
something that is very, very complex, and it is the sum of,
arguably, many, many, almost infinite, individual,
independent trials, it's a pretty good assumption to
assume the normal distribution. We'll do other
videos where we talk about when it is
a good assumption, and when it isn't
a good assumption. But anyway, just to
digest this a little bit-- and let me actually rewrite it. This is what you'll
see on Wikipedia, but this could be rewritten
as 1 over sigma times the square root of 2 pi, times--
x is just e to that power. So it's just e to the this
whole thing over here, minus x minus the mean
squared over 2 sigma squared. This is the standard deviation. Standard deviation squared
is just the variance, right? And just so you know how to use
this-- you're like, oh, wow, there's so many
Greek letters here. What do I do? This tells you the height of the
normal distribution function. Let's say that this is the
distribution of, I don't know, of people's, I don't
know, how far north they live from my house,
or something. I don't know. Well, no. That's not a good one. Let's say it's people's
heights above 5' 9". Let's say that this was
5' 9" and not 0, right? What this tells you
is, if you were to say, what percentage of
people, or I guess, if you wanted to
figure out, what is the probability of
finding someone who is roughly 5 inches taller
than the average right here, what you
would is, you would put in this number
here, this 5, into x. And then you know the
standard deviation, because you've taken
a bunch of samples. You know the variance, which is
a standard deviation squared. You know the mean. And you just put
your x in there, and it'll tell you the
height of the function. And then you have
to give it a range. You can't just say, how many
people are exactly 5 inches taller than average? You would have to say, how many
people are between 5.1 inches and 4.9 inches taller
than the average? You have to give it a little
bit of range, because no is exactly-- or, it's almost
infinitely impossible, to the atom, to
be exactly 5' 9". Even the definition
of an inch isn't defined that particularly. So that's how you
use this function. I think this is so
heavily used in-- one, it shows up in nature. But in all of
inferential statistics, I think it behooves you
to become as familiar with this formula as possible. And I guess to make
that happen, let me play around a little
bit with this formula, just to give you an intuition
of how everything works out, et cetera, et cetera. So if I were to
take this-- and I'd like to just maybe
help you memorize it-- this could be rewritten
as, if we take the sigma into the
square root sign, if we take the standard
deviation in there, it becomes 1 over the square
root of 2 pi sigma squared. I've never seen it
written this way, but it gives me a
little intuition that sigma squared-- it's
always written as sigma squared, but it's really
just the variance. And the variance is
what you calculate before you calculate
the standard deviation. So that's interesting. And then this top
right here, this could be written as e
to the minus 1/2 times-- both of these things here are
squared, so we could just say, x minus the mean
over sigma squared. And this kind of clarifies a
little bit what's going on here a little bit better,
because what's this? x minus sigma is the distance
between whatever point we want to find. Let's say we're here. x minus mu is the
mean, so that's here. So that's this distance. And then this is a
standard deviation, which is this distance. So this in here
tells me how many standard deviations I
am away from the mean. And that's actually called
the standard z-score. I talk about it in
the other video. And then we square that. And then we take this
to the minus 1/2. Well, let me rewrite that. If I were to write,
e to the minus 1/2 times a, that's the
same thing as e to the a to the minus 1/2 power, right? If you take something
to an exponent, and then take that
to an exponent, you can just multiply
these exponents. So likewise, this
could be rewritten as, this is equal to 1 over
the square root of 2 pi sigma squared, which is
just the variance. And I'm just playing
around with the formula, because I really want to see
all the ways that-- maybe you'll get a little intuition. And I encourage you
to email me if you see some insight on why this
exists, and all of that. But once again, I think it
is cool that all of a sudden, we have this other formula
that has pi and e in it, and so many phenomenon
are described by this. And once again, pi and e
show up together, right? Just like e to the I pi
is equal to negative 1. Tells you something
about our universe. But anyway, I could
rewrite this as e to the x minus mu over sigma
squared, and all of that to the minus 1/2. Something in the minus
1/2 power-- that's just 1 over the square root, which
is already going on here. So we could just rewrite
this over here as 1 over the square root of 2
pi times the variance, times e to, essentially, our
z-score squared, right? If we say z is this
thing in here-- z is how many standard deviations
we are from the mean-- z-score squared. And all of a sudden, this
becomes a very clean-- we just say 2 pi times
our variance, times e to the number of
standard deviations we are away from the mean. You square that. You take the square root of
that thing, and invert it, and that's the
normal distribution. So anyway, I wanted to do
that, just because I thought it was neat, and it's
interesting to play around with it. And that way, if
you see it in any of these other forms in
the rest of your life, your won't say, what's that? I thought the normal
distribution was this, or it was this. And now you know. But with that said,
let's play around a little bit with this
normal distribution. So in this spreadsheet, I've
plotted normal distribution, and you can the
assumptions that are in this kind of
green-blue color. So right now it's plotting
it with a mean of 0 and a standard deviation of 4. And I just write the
variance here, just for your information. The variance is just a
standard deviation squared. And so what happens when
you change the mean? So if the mean goes from 0
to-- let's say it goes to 5. Notice, this graph just shifted
to the right by 5, right? It was centered here. Now it's centered over here. If we make it minus
5, what happens? The whole bell curve just shifts
5 to the left from the center. Now, what happens when you
change the standard deviation, right? The standard
deviation is a measure of-- the variance is the average
squared distance from the mean. The standard deviation is
the square root of that. So it's kind of--
not exactly, but kind of-- the average
distance from the mean. So the smaller the
standard deviation, the closer a lot of the points
are going to be to the mean. So we should get
a narrower graph, and let's see if that happens. So when the standard
deviation is 2, we see that. The graph you're more likely
to be really close to the mean than further away. And if you make the standard
deviation-- I don't know, if you make it 10--
all of a sudden, you get a really flat graph. And this thing keeps
going on forever. And that's a key difference. The binomial distribution
is always finite. You can only have a
finite number of values, while the normal
distribution is defined over the entire
real number line. So the probability, if you
have a mean of minus 5, and a standard deviation
of 10, the probability of getting 1,000 here
is very, very low, but there is some probability. There is some
probability that I fall, that all of the atoms in my
body just arrange perfectly, that I fall through the
seat I'm sitting on. Its very unlikely,
and it probably won't happen in the life of the
universe, but it can happen. And that could be described
by a normal distribution, because it says,
anything can happen, although it could be very,
very, very improbable. So the thing I talked about
at the beginning of the video is, when you figure out
a normal distribution, you can't just look at
this point on the graph. Let me get the Pen tool back. You have to figure out the area
under the curve between two points, right? Let's say this was our
distribution, and I said, what is the probability
that I get 0? I don't know what phenomenon
this is describing, but that 0 happened. If I say exactly
0, the probability is 0, because--
I shouldn't use 0 too much-- because the area
under the curve just under 0-- there's no area. It's just a line. You have to say between a range. So you have to say the
probability between, let's say, minus-- and actually, I can
type it in here on our-- I can say, the probability between,
let's say, minus 0.005 and plus 0.05 is-- well, it
rounded, so it says there, close to 0. Let me do it-- between minus
1 and between 1, all right? It calculated at 7%,
and I'll show you how I calculated
this in a second. So let me get the
Screen Draw tool. So what did I just do? This between minus
1 and 1-- and I'll show you the behind
the scenes what Excel is doing-- we're
going from minus 1, which is roughly
right here, to 1. And we're calculating the area
under the curve, all right? We're calculating this area. Or, for those of you
who know calculus, we're calculating the
integral from minus 1 to 1 of this function, where the
standard deviation is right here, is 10, and
the mean is minus 5. And actually, let
me put that in. So we're calculating
for this example, the way it's drawn right
here, the normal distribution function. Let's see. Our standard
deviation is 10 times the square root of 2 pi,
times e to the minus 1/2, times x minus our mean. Our mean is negative
right now, right? Our mean is minus 5. So it's x plus 5 over the
standard deviation squared, which is the variance. So that's 100 squared dx. This is what this
number is right here. This 7%, or actually 0.07, is
the area right under there. Now unfortunately,
for us in the world, this isn't an easy integral
to evaluate analytically, even for those of us
who know our calculus. So this tends to be
done numerically. And kind of an easy way to do
this-- well, not an easy way-- but a function has
been defined, called the cumulative
distribution function, that is a useful tool for
figuring out this area. So what the cumulative
distribution function is, is essentially-- let me call
it the cumulative distribution function-- it's a function of x. It gives us the area
under this curve. So let's say that
this is x right here. That's our x. It tells you the area
under the curve up to x. Or so another way to think
about it-- it tells you, what is the probability
that you land at some value less than your x value? So it's the area
from minus infinity to x of our probability
density function, dx. When you actually use the Excel
normal distribution function, you say, norm distribution. You have to give
it your x value. You give it the mean. You give it the
standard deviation. And then you say
whether you want the cumulative distribution,
in which case, you say true, or you want just this normal
distribution, which you say, false. So if you wanted to
graph this right here, you would say FALSE, in caps. If you wanted to graph the
cumulative distribution function, which I
do down here-- let me move this down a little bit. Let me get out of the Pen tool. So the cumulative distribution
function is right over here. Then you say true when
you make that Excel call. So this is a cumulative
distribution function for this same-- this is
a normal distribution. Here's a cumulative
distribution. And just so you
get the intuition, is, if you want to know, what
is the probability that I get a value less than 20, right? So I can get any value less than
20, given this distribution. The cumulative
distribution right here-- let me make it so you can
see the-- if you go to 20, you just go right
to that point there. And you say, wow,
the probability of getting 20 or less--
it's pretty high. It's approaching 100%. That makes sense, because most
of the area under this curve is less than 20. Or if you said,
what's the probability of getting less than minus 5? Well, minus 5 was the mean,
so half of your results should be above that,
and half should be below. And if you go to this
point right here, you can see that this
right here is 50%. So the probability of
getting less than minus 5 is exactly 50%. If I wanted to know the
probability of getting between negative
1 and 1, what I do is-- let me get back
to my Pen tool-- what I do is, I figure out,
what is the probability of getting minus
1 or lower, right? So I figure out this whole area. And then I figure out the
probability of getting 1 or lower, which is
this whole area-- well, let me do it in a different
color-- 1 or lower is everything there. And I subtract the yellow
area from the magenta area. And I'll just get what's
ever left over here, right? And that's exactly what
I did in the spreadsheet. Let me scroll down. This might be taxing my
computer by taking the screen capture with it. So what I did is I evaluated
the cumulative distribution function at 1, which
would be right there. And I evaluated the cumulative
distribution function at minus 1, which
is right there. And the difference
between these two-- I subtract this number
from this number, and that tells me,
essentially, the probability that I'm between
those two numbers. Or another way to think about
it-- the area right here. And I really encourage
you to play with this, and explore the Excel
formulas and everything. This area right here,
between minus 1 and 1. Now, one thing that shows up a
lot is, what's the probability that you land within a standard
deviation of-- and just so you know this
graph, the central line right here-- this is the mean. And then these two lines
I drew right here-- these are one standard
deviation below, and one standard deviation
above the mean. And some people think,
what's the probability that I land within one
standard deviation of the mean? Well, that's easy to do. What I can do is, I'll
just click on this. What's the probability that
I land between-- let's see. The mean is minus 5. One standard deviation
below the mean is minus 15. And one standard deviation above
the mean is 10 plus minus 5 is 5. So that's between 5 and 15. So 68.3%, and that's
actually always the case that you have a 68.3%
probability of landing within one standard
deviation of the mean, assuming you have a
normal distribution. So once again, that
number represents the area under the curve here,
this area under the curve. And the way you get it is with
the cumulative distribution function. Let me go down here. Every time I move this, I have
to get rid of the Pen tool. You evaluate it at plus 5,
which is right here, right? This was one standard deviation
above the mean, which-- it's a number
right around there. Looks like it's
like, I don't know, 80-something percent,
maybe 90%, roughly. And then you evaluate it
at one standard deviation below the mean,
which is minus 15. And this one looks like, I
don't know, roughly 15% or so? 15%, 16%, maybe 17%? Let's say 18%. But the big picture is,
when you subtract this value from this value, you
get the probability that you land between those two. And that's because this
value tells a probability that you're less than. So when you go to the cumulative
distribution function, you get that right there. That tells a
probability that you are-- let me get-- it keeps
crawling back and forth. So when you go to 5, and
you just go right over here, this essentially
tells you this area under the curve--
the probability that you're less
than or equal to 5. Everything up there. And then when you evaluate
it at minus 15 down here, it tells you the probability
that you're down back here. So when you subtract this
from the larger thing, you're just left with what's
under the curve right there. And just to understand this
spreadsheet a little bit better, just because I really
want you to play with it, and move the-- see what happens
if I make this distribution. The mean was minus 5. Now let me make it 5. It just shifted to the right. It just moved over to
the right by 5, right? Whoops. I'll use the Pen tool. If I were to try to make the
standard deviation smaller, we'll see that the whole thing
just gets a little bit tighter. Let's make it 6,
and all of a sudden, this looks a little
bit tighter curve. We make it two, it
becomes even tighter. And just so you know how
I calculated everything-- and I really want you
to play with this, and play with the formula. And get an intuitive
feeling for this, the cumulative
distribution function. And think a lot
about how it relates to the binomial distribution. And I cover that
in the last video. To plot this, I just took
each of these points. I went to plot the points
between minus 20 and 20, and I just incremented
by 1, right? I just decided to
increment by 1. It's not a continuous curve. It's actually just plotting
a point at each point, and connecting it with a line. Then I did the
distance between each of those points and
the mean, right? Let's say that this 0 minus
5-- this is this distance. So this just tells
you, the point minus 20 is 25 less than the mean, right? That's all I did there. Then I divided that by
the standard deviation. And this is the
standard z-score, right? So this tells me how many
standard deviations is minus 20 away from the mean. It's 12 and 1/2 standard
deviations below the mean. And then I use that, and I just
plugged it into, essentially, this formula, to figure out
the height of the function. So let's say, at minus 20,
the height is very low. Well, let's say, at minus 2, the
height's a little bit better. The height's going to be
someplace right there. And so that gives me that value. But then to actually figure out
the probability of that-- what I do is, I calculate the
cumulative distribution function between-- well,
this is the probability that you're less than that,
so the area under the curve below that, which
is very, very small. It's not 0. I know it looks like 0 here, but
that's only because I round it. It's going to be 0.0001. It's going to be a really,
really small number. There's some probability
that we even get minus 1,000. And another intuitive
thing that you really should have a sense for
is, the integral over this, or the entire area
of the curve, has to be 1, because that
takes into account all possible circumstances. And that should happen if we put
a suitably small number here, and a suitably
large number here. There you go. We get 100%, although
this isn't 100%. We would have to go from minus
infinity to plus infinity to really get 100%. It's just rounding to 100%. It's probably 99.999999%,
or something like that. And so to actually
calculate this, what I do is, I take the
cumulative distribution function of this
point, and I subtract from that the cumulative
distribution function of that point. And that's where I
got this 100% from. Anyway, hopefully
that'll give you a good feel for the
normal distribution. And I really encourage you
to play with the spreadsheet, and to even make a spreadsheet
like this yourself. And in a future
exercise, we'll actually use this type of a spreadsheet
as an input into other models. So if we're doing
a financial model, and if we say our revenue
has a normal distribution around some expected value,
what is the distribution of our net income? Or we could think of 100 other
different types of examples. Anyway, see you
in the next video.