0 energy points

# Introduction to the normal distribution

Exploring the normal distribution. Created by Sal Khan.
Video transcript
The normal distribution is arguably the most important concept in statistics. Everything we do, or almost everything we do in inferential statistics, which is essentially making inferences based on data points, is to some degree based on the normal distribution. And so what I want to do in this video and in this spreadsheet is to essentially give you as deep an understanding of the normal distribution as possible. And the rest of your life, you're always, if someone says, oh, we're assuming a normal distribution, it's like, oh, I know what that is. This is the formula, and I understand how to use it, et cetera, et cetera. So this spreadsheet, just so you know, is downloadable at www.khanacademy.org/downloads/-- and if you just type that part in, you'll see everything that's downloadable-- but then download/normalintro.xls. And then you'll get this spreadsheet right here. And I think I did this in the right standard. But anyway, if you go onto Wikipedia, and if you were to type in normal distribution, or you were to do a search for normal distribution-- let me actually get my Pen tool going-- this is what you would see. I literally copied and pasted this right here from Wikipedia, and I know it looks daunting. You have all these Greek letters there. But the sigma right here, that is just the standard deviation of the distribution. We'll play with that a little bit with in this chart, and see what that means. I mean, you know what the standard deviation is in general, but this is the standard deviation of this distribution, which is a probability density function. And I encourage you to re-watch the video on probability density functions, because it's a little bit of a transition going from the binomial distribution, which is discrete, right? In the binomial distribution, you say, oh, what is the probability of getting a 5? And you just look at that histogram or that bar chart, and you say, oh, that's the probability. But in a continuous probability distribution, or a continuous probability density function, you can't just say, what is a probability of me getting a 5? You have to say, what is a probability of me getting between, let's say, a 4.5 and a 5.5? You have to give it some range. And then your probability isn't given by just reading this graph. The probability is given by the area under that curve, right? It'd be given by this area. And for those of you all who know calculus, if p of x is our probability density function-- it doesn't have to be a normal distribution, although it often is a normal distribution-- the way you actually figure out the probability of, let's say, between 4 and 1/2 and 5 and 1/2. What is the probability this is, whatever-- the odds of me getting between 4 and 1/2 and 5 and 1/2 inches of rain tomorrow? It'll actually be the integral from 4 and 1/2 to 5 and 1/2 of this probability density function, or of this probability density function, the x, right? So that's just the area in the curve. For those of you who don't know calculus yet, I encourage you to watch that playlist. But all this is is saying, the area in the curve from here to here. And it actually turns out, for the normal distribution, this isn't an easy thing to evaluate analytically. And so you do it numerically. You don't have to feel bad about doing it numerically, because you're like, oh, how do I take the integral of this? There's actually functions for it, and you can even approximate it. I mean, one way you could approximate it is you could use it the way you approximate integrals in general, where you could say, well, what is the area of this? Well, it's roughly the area of this trapezoid. So you could figure out the area of that trapezoid, taking the average of that point and that point, and multiplying it by the base. Let me change colors, just because I think I'm overdoing it with the green. Or you could just take the height of this line right here, and multiply it by the base, and you'll get the area of this rectangle, which might be a pretty good approximation for the area under the curve, right, because you'll have a little bit extra over here, but you're going to miss a little bit over there. So it might be pretty good approximation. And that's actually what I do in the other video, just to approximate the area under the curve, and give you a good sense that the normal distribution is what the binomial distribution becomes, essentially, if you have many, many, many, many trials. And what's interesting about the normal distribution, just so you know-- I don't know if I mentioned this already-- this right here, this is the graph. And then this is just another word. People might talk about the central limit theorem. But this is really one of the most important or interesting things about our universe-- central limit theorem. And I won't prove it here, but it essentially tells us-- and you could understand it by looking at the other video, where we talk about flipping coins. And if we were to do many, many, many flips of coins, right, those are independent trials of each other. And if you take the sum of all of your flips-- if you were to give yourself one point if you got ahead every time-- and if you were to take the sum of them, as you approach an infinite number of flips, you approach the normal distribution. And what's interesting about that is, each of those trials, in the case of flipping a coin-- each trial is a flip of the coin-- each of those trials don't have to have a normal distribution. So we could be talking about molecular interactions, and every time compound x interacts with compound y, what might result doesn't have to be normally distributed. But what happens is, if you take a sum of a ton of those interactions, then, all of a sudden, the end result will be normally distributed. And this is why this is such an important distribution. It shows up in nature all of the time. If you do take data points from something that is very, very complex, and it is the sum of, arguably, many, many, almost infinite, individual, independent trials, it's a pretty good assumption to assume the normal distribution. We'll do other videos where we talk about when it is a good assumption, and when it isn't a good assumption. But anyway, just to digest this a little bit-- and let me actually rewrite it. This is what you'll see on Wikipedia, but this could be rewritten as 1 over sigma times the square root of 2 pi, times-- x is just e to that power. So it's just e to the this whole thing over here, minus x minus the mean squared over 2 sigma squared. This is the standard deviation. Standard deviation squared is just the variance, right? And just so you know how to use this-- you're like, oh, wow, there's so many Greek letters here. What do I do? This tells you the height of the normal distribution function. Let's say that this is the distribution of, I don't know, of people's, I don't know, how far north they live from my house, or something. I don't know. Well, no. That's not a good one. Let's say it's people's heights above 5' 9". Let's say that this was 5' 9" and not 0, right? What this tells you is, if you were to say, what percentage of people, or I guess, if you wanted to figure out, what is the probability of finding someone who is roughly 5 inches taller than the average right here, what you would is, you would put in this number here, this 5, into x. And then you know the standard deviation, because you've taken a bunch of samples. You know the variance, which is a standard deviation squared. You know the mean. And you just put your x in there, and it'll tell you the height of the function. And then you have to give it a range. You can't just say, how many people are exactly 5 inches taller than average? You would have to say, how many people are between 5.1 inches and 4.9 inches taller than the average? You have to give it a little bit of range, because no is exactly-- or, it's almost infinitely impossible, to the atom, to be exactly 5' 9". Even the definition of an inch isn't defined that particularly. So that's how you use this function. I think this is so heavily used in-- one, it shows up in nature. But in all of inferential statistics, I think it behooves you to become as familiar with this formula as possible. And I guess to make that happen, let me play around a little bit with this formula, just to give you an intuition of how everything works out, et cetera, et cetera. So if I were to take this-- and I'd like to just maybe help you memorize it-- this could be rewritten as, if we take the sigma into the square root sign, if we take the standard deviation in there, it becomes 1 over the square root of 2 pi sigma squared. I've never seen it written this way, but it gives me a little intuition that sigma squared-- it's always written as sigma squared, but it's really just the variance. And the variance is what you calculate before you calculate the standard deviation. So that's interesting. And then this top right here, this could be written as e to the minus 1/2 times-- both of these things here are squared, so we could just say, x minus the mean over sigma squared. And this kind of clarifies a little bit what's going on here a little bit better, because what's this? x minus sigma is the distance between whatever point we want to find. Let's say we're here. x minus mu is the mean, so that's here. So that's this distance. And then this is a standard deviation, which is this distance. So this in here tells me how many standard deviations I am away from the mean. And that's actually called the standard z-score. I talk about it in the other video. And then we square that. And then we take this to the minus 1/2. Well, let me rewrite that. If I were to write, e to the minus 1/2 times a, that's the same thing as e to the a to the minus 1/2 power, right? If you take something to an exponent, and then take that to an exponent, you can just multiply these exponents. So likewise, this could be rewritten as, this is equal to 1 over the square root of 2 pi sigma squared, which is just the variance. And I'm just playing around with the formula, because I really want to see all the ways that-- maybe you'll get a little intuition. And I encourage you to email me if you see some insight on why this exists, and all of that. But once again, I think it is cool that all of a sudden, we have this other formula that has pi and e in it, and so many phenomenon are described by this. And once again, pi and e show up together, right? Just like e to the I pi is equal to negative 1. Tells you something about our universe. But anyway, I could rewrite this as e to the x minus mu over sigma squared, and all of that to the minus 1/2. Something in the minus 1/2 power-- that's just 1 over the square root, which is already going on here. So we could just rewrite this over here as 1 over the square root of 2 pi times the variance, times e to, essentially, our z-score squared, right? If we say z is this thing in here-- z is how many standard deviations we are from the mean-- z-score squared. And all of a sudden, this becomes a very clean-- we just say 2 pi times our variance, times e to the number of standard deviations we are away from the mean. You square that. You take the square root of that thing, and invert it, and that's the normal distribution. So anyway, I wanted to do that, just because I thought it was neat, and it's interesting to play around with it. And that way, if you see it in any of these other forms in the rest of your life, your won't say, what's that? I thought the normal distribution was this, or it was this. And now you know. But with that said, let's play around a little bit with this normal distribution. So in this spreadsheet, I've plotted normal distribution, and you can the assumptions that are in this kind of green-blue color. So right now it's plotting it with a mean of 0 and a standard deviation of 4. And I just write the variance here, just for your information. The variance is just a standard deviation squared. And so what happens when you change the mean? So if the mean goes from 0 to-- let's say it goes to 5. Notice, this graph just shifted to the right by 5, right? It was centered here. Now it's centered over here. If we make it minus 5, what happens? The whole bell curve just shifts 5 to the left from the center. Now, what happens when you change the standard deviation, right? The standard deviation is a measure of-- the variance is the average squared distance from the mean. The standard deviation is the square root of that. So it's kind of-- not exactly, but kind of-- the average distance from the mean. So the smaller the standard deviation, the closer a lot of the points are going to be to the mean. So we should get a narrower graph, and let's see if that happens. So when the standard deviation is 2, we see that. The graph you're more likely to be really close to the mean than further away. And if you make the standard deviation-- I don't know, if you make it 10-- all of a sudden, you get a really flat graph. And this thing keeps going on forever. And that's a key difference. The binomial distribution is always finite. You can only have a finite number of values, while the normal distribution is defined over the entire real number line. So the probability, if you have a mean of minus 5, and a standard deviation of 10, the probability of getting 1,000 here is very, very low, but there is some probability. There is some probability that I fall, that all of the atoms in my body just arrange perfectly, that I fall through the seat I'm sitting on. Its very unlikely, and it probably won't happen in the life of the universe, but it can happen. And that could be described by a normal distribution, because it says, anything can happen, although it could be very, very, very improbable. So the thing I talked about at the beginning of the video is, when you figure out a normal distribution, you can't just look at this point on the graph. Let me get the Pen tool back. You have to figure out the area under the curve between two points, right? Let's say this was our distribution, and I said, what is the probability that I get 0? I don't know what phenomenon this is describing, but that 0 happened. If I say exactly 0, the probability is 0, because-- I shouldn't use 0 too much-- because the area under the curve just under 0-- there's no area. It's just a line. You have to say between a range. So you have to say the probability between, let's say, minus-- and actually, I can type it in here on our-- I can say, the probability between, let's say, minus 0.005 and plus 0.05 is-- well, it rounded, so it says there, close to 0. Let me do it-- between minus 1 and between 1, all right? It calculated at 7%, and I'll show you how I calculated this in a second. So let me get the Screen Draw tool. So what did I just do? This between minus 1 and 1-- and I'll show you the behind the scenes what Excel is doing-- we're going from minus 1, which is roughly right here, to 1. And we're calculating the area under the curve, all right? We're calculating this area. Or, for those of you who know calculus, we're calculating the integral from minus 1 to 1 of this function, where the standard deviation is right here, is 10, and the mean is minus 5. And actually, let me put that in. So we're calculating for this example, the way it's drawn right here, the normal distribution function. Let's see. Our standard deviation is 10 times the square root of 2 pi, times e to the minus 1/2, times x minus our mean. Our mean is negative right now, right? Our mean is minus 5. So it's x plus 5 over the standard deviation squared, which is the variance. So that's 100 squared dx. This is what this number is right here. This 7%, or actually 0.07, is the area right under there. Now unfortunately, for us in the world, this isn't an easy integral to evaluate analytically, even for those of us who know our calculus. So this tends to be done numerically. And kind of an easy way to do this-- well, not an easy way-- but a function has been defined, called the cumulative distribution function, that is a useful tool for figuring out this area. So what the cumulative distribution function is, is essentially-- let me call it the cumulative distribution function-- it's a function of x. It gives us the area under this curve. So let's say that this is x right here. That's our x. It tells you the area under the curve up to x. Or so another way to think about it-- it tells you, what is the probability that you land at some value less than your x value? So it's the area from minus infinity to x of our probability density function, dx. When you actually use the Excel normal distribution function, you say, norm distribution. You have to give it your x value. You give it the mean. You give it the standard deviation. And then you say whether you want the cumulative distribution, in which case, you say true, or you want just this normal distribution, which you say, false. So if you wanted to graph this right here, you would say FALSE, in caps. If you wanted to graph the cumulative distribution function, which I do down here-- let me move this down a little bit. Let me get out of the Pen tool. So the cumulative distribution function is right over here. Then you say true when you make that Excel call. So this is a cumulative distribution function for this same-- this is a normal distribution. Here's a cumulative distribution. And just so you get the intuition, is, if you want to know, what is the probability that I get a value less than 20, right? So I can get any value less than 20, given this distribution. The cumulative distribution right here-- let me make it so you can see the-- if you go to 20, you just go right to that point there. And you say, wow, the probability of getting 20 or less-- it's pretty high. It's approaching 100%. That makes sense, because most of the area under this curve is less than 20. Or if you said, what's the probability of getting less than minus 5? Well, minus 5 was the mean, so half of your results should be above that, and half should be below. And if you go to this point right here, you can see that this right here is 50%. So the probability of getting less than minus 5 is exactly 50%. If I wanted to know the probability of getting between negative 1 and 1, what I do is-- let me get back to my Pen tool-- what I do is, I figure out, what is the probability of getting minus 1 or lower, right? So I figure out this whole area. And then I figure out the probability of getting 1 or lower, which is this whole area-- well, let me do it in a different color-- 1 or lower is everything there. And I subtract the yellow area from the magenta area. And I'll just get what's ever left over here, right? And that's exactly what I did in the spreadsheet. Let me scroll down. This might be taxing my computer by taking the screen capture with it. So what I did is I evaluated the cumulative distribution function at 1, which would be right there. And I evaluated the cumulative distribution function at minus 1, which is right there. And the difference between these two-- I subtract this number from this number, and that tells me, essentially, the probability that I'm between those two numbers. Or another way to think about it-- the area right here. And I really encourage you to play with this, and explore the Excel formulas and everything. This area right here, between minus 1 and 1. Now, one thing that shows up a lot is, what's the probability that you land within a standard deviation of-- and just so you know this graph, the central line right here-- this is the mean. And then these two lines I drew right here-- these are one standard deviation below, and one standard deviation above the mean. And some people think, what's the probability that I land within one standard deviation of the mean? Well, that's easy to do. What I can do is, I'll just click on this. What's the probability that I land between-- let's see. The mean is minus 5. One standard deviation below the mean is minus 15. And one standard deviation above the mean is 10 plus minus 5 is 5. So that's between 5 and 15. So 68.3%, and that's actually always the case that you have a 68.3% probability of landing within one standard deviation of the mean, assuming you have a normal distribution. So once again, that number represents the area under the curve here, this area under the curve. And the way you get it is with the cumulative distribution function. Let me go down here. Every time I move this, I have to get rid of the Pen tool. You evaluate it at plus 5, which is right here, right? This was one standard deviation above the mean, which-- it's a number right around there. Looks like it's like, I don't know, 80-something percent, maybe 90%, roughly. And then you evaluate it at one standard deviation below the mean, which is minus 15. And this one looks like, I don't know, roughly 15% or so? 15%, 16%, maybe 17%? Let's say 18%. But the big picture is, when you subtract this value from this value, you get the probability that you land between those two. And that's because this value tells a probability that you're less than. So when you go to the cumulative distribution function, you get that right there. That tells a probability that you are-- let me get-- it keeps crawling back and forth. So when you go to 5, and you just go right over here, this essentially tells you this area under the curve-- the probability that you're less than or equal to 5. Everything up there. And then when you evaluate it at minus 15 down here, it tells you the probability that you're down back here. So when you subtract this from the larger thing, you're just left with what's under the curve right there. And just to understand this spreadsheet a little bit better, just because I really want you to play with it, and move the-- see what happens if I make this distribution. The mean was minus 5. Now let me make it 5. It just shifted to the right. It just moved over to the right by 5, right? Whoops. I'll use the Pen tool. If I were to try to make the standard deviation smaller, we'll see that the whole thing just gets a little bit tighter. Let's make it 6, and all of a sudden, this looks a little bit tighter curve. We make it two, it becomes even tighter. And just so you know how I calculated everything-- and I really want you to play with this, and play with the formula. And get an intuitive feeling for this, the cumulative distribution function. And think a lot about how it relates to the binomial distribution. And I cover that in the last video. To plot this, I just took each of these points. I went to plot the points between minus 20 and 20, and I just incremented by 1, right? I just decided to increment by 1. It's not a continuous curve. It's actually just plotting a point at each point, and connecting it with a line. Then I did the distance between each of those points and the mean, right? Let's say that this 0 minus 5-- this is this distance. So this just tells you, the point minus 20 is 25 less than the mean, right? That's all I did there. Then I divided that by the standard deviation. And this is the standard z-score, right? So this tells me how many standard deviations is minus 20 away from the mean. It's 12 and 1/2 standard deviations below the mean. And then I use that, and I just plugged it into, essentially, this formula, to figure out the height of the function. So let's say, at minus 20, the height is very low. Well, let's say, at minus 2, the height's a little bit better. The height's going to be someplace right there. And so that gives me that value. But then to actually figure out the probability of that-- what I do is, I calculate the cumulative distribution function between-- well, this is the probability that you're less than that, so the area under the curve below that, which is very, very small. It's not 0. I know it looks like 0 here, but that's only because I round it. It's going to be 0.0001. It's going to be a really, really small number. There's some probability that we even get minus 1,000. And another intuitive thing that you really should have a sense for is, the integral over this, or the entire area of the curve, has to be 1, because that takes into account all possible circumstances. And that should happen if we put a suitably small number here, and a suitably large number here. There you go. We get 100%, although this isn't 100%. We would have to go from minus infinity to plus infinity to really get 100%. It's just rounding to 100%. It's probably 99.999999%, or something like that. And so to actually calculate this, what I do is, I take the cumulative distribution function of this point, and I subtract from that the cumulative distribution function of that point. And that's where I got this 100% from. Anyway, hopefully that'll give you a good feel for the normal distribution. And I really encourage you to play with the spreadsheet, and to even make a spreadsheet like this yourself. And in a future exercise, we'll actually use this type of a spreadsheet as an input into other models. So if we're doing a financial model, and if we say our revenue has a normal distribution around some expected value, what is the distribution of our net income? Or we could think of 100 other different types of examples. Anyway, see you in the next video.