If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content
Current time:0:00Total duration:26:24

Video transcript

the normal distribution is arguably the most important concept in statistics everything we do or almost everything we do in inferential statistics which is essentially making inferences based on data points is to some degree based on the normal distribution so what I want to do in this video and in this in this and this spreadsheet is to essentially give you as deep and understanding of the normal distribution as possible and you know just the rest of your life you're always if someone says oh we're assuming a normal distribution is like oh I know that is this is the formula and I understand how to use it etc etc so this spreadsheet just so you know is downloadable it WWE an economy org slash downloads slash and if you just type that part and you'll see everything that's downloadable but then download slash normal intro dot XLS and then you'll get this spreadsheet right here and I think it's I did this in the right standard but wait if you go onto Wikipedia and if you were to type in normal distribution or you were to do a search for a normal distribution let me actually get my pen tool going this is what you'd see I literally copied and pasted this right here from Wikipedia and it looks daunting you know you have all these Greek letters there but this is just the Sigma right here that is just the standard deviation of the distribution we'll play with that a little bit within this in this chart and see what that means and well I mean you know what the standard deviation is in general but this is a standard deviation of this distribution which is a probability density function and I encourage you to rewatch the video on probability density functions because it's a little bit of a transition going from the binomial distribution which is discrete right in the binomial distribution say oh what is the probability of getting a 5 and you just kind of look at that histogram or that bar chart you say oh that's the probability but in a continuous probability distribution or a continuous probability density function you can't just say what is the probability of me getting a 5 you have to say what is the probability of me getting between let's say a 4 point 5 and a 5 point 5 you have to give it some range and then your probability isn't given by just reading this graph the probability is given by the area under that curve right it'd be given by this area and for of y'all who know calculus if P of X is our probability density function it doesn't have to be a normal distribution although it almost always well it often is a normal distribution the way you actually figure out the probability above you notice I say between four and a half and five and F what is the probability you know this is whatever the odds of me getting between four and a half and five and a half inches of rain tomorrow it'll actually be the integral from four and a half to five and a half of this probability density function or of this probability density function DX right so that's just the area in the curve for those of you who don't know it calculus yet I encourage you to watch that playlist but all this is is saying the area in the curve from here to here and actually turns out for the normal distribution this isn't an easy thing to evaluate analytically and so you do it numerically and that kind of you know you don't have to feel bad about doing it numerically because they'll go how do I take the integral of this there's actually functions for it and you can even approximate I mean one way you could approximate it is you could use it the way you approximate integrals in general where you could say well what is the area of this what's roughly you know the area of this trapezoid so you can figure out the area of that trapezoid taking the average of that point in that point and multiplying it by the base or you could just take the level the let me change colors just because I think I'm overdoing it with the green or you could just take the height of this line right here and multiply it by the base and you'll get the area of this rectangle which might be a pretty good approximation for the area under the curve right because you'll have a little bit extra over here but you're going to miss a little bit over there so it might be a pretty good approximation that's actually what I do in the other video just to approximate the area under the curve and give you a good sense that the normal distribution is what the binomial distribution becomes essentially if you have of many many many many trials and what's interesting about the normal distribution just so you know identifiers I've mentioned it's already this right here this is the graph and the and this is you know just another word and people I talk about the central limit theorem but this is really kind of one of the most important or interesting things about our universe central limit theorem and I won't prove it here but it essentially tells us so you could kind of understand it by looking at the other video where we talk about flipping coins and if we were to do many many many flips of coins right those are independent trials of each other and if you take the sum of all of your flips if you were to give yourself one point if you got a head every time and if you take the sum of them as you approach an infinite number of flips you approach the normal distribution and what's interesting about that is each of those trials in the case of flipping a coin each trial is a flip of the coin each of those trials don't have to have a normal distribution so we could be talking about molecular interactions and you know every time compound X interacts with compound Y you know what might result doesn't have to have been normally distributed but what happens is if you take a sum of a ton of those interactions then all of a sudden the end result will be normally distributed and this is why this is such an important distribution it shows up in nature all of the time and if people are trying to kind of if you do take data points from something that it's very very complex and and that it is the sum of arguably many many almost infinite individual independent trials it's a pretty good assumption to assume the normal distribution we'll talk of IDEO is where we talk about when it is a good assumption when it isn't a good assumption but anyway just to digest this a little bit and let me let me actually rewrite it this is what you'll see on Wikipedia but this could be re-written as 1 over Sigma times the square root of 2 pi times X is just e to that power so it's just e to the this whole thing over here minus X minus the mean squared over 2 Sigma squared this is a standard deviation standard deviation squared is just the variance right and just so you know how to use this you're like oh why there's so many Greek letters here what do I do this tells you the height of the normal distribution function it you know let's say that this is the distribution of I don't know of people's I don't know how how far north they live from my house or something I don't know I'm well that's not a good one let's say it's it's people's heights above five nine let's say that this was five nine and not zero right what this tells you is if you say how you know what problem what percentage of people or I guess what is the probability if you want to figure out what is the probability of some finding someone who is roughly five inches taller than the average right here what you would do is you would put in this number here you know this 5 into X and then you know the standard deviation because you've taken a bunch of samples you know the variance which is a standard deviation squared you know the mean and you just put your X in there and I'll tell you the height of the function and then you have to give it a range you have to you can't just say how many people are exactly 5 inches taller than average you would actually say how many people are between 5.1 inches and 4.9 inches taller than the average to give it a little bit of range because no one is exactly or you know it's almost infinitely impossible to the atom to be exactly 5 foot not even the definition of an inch isn't defined that particularly so that's how you use this function I think it's you know this is so heavily used in you know when it shows up in nature but in all of inferential statistics I think it it behooves you to become as familiar with this formula as possible and I guess to make that happen let me play around a little bit with this formula just to kind of give you an intuition of how everything works out etc etc so if I were to take this and you know I like to just maybe help you memorize it this could be rewritten as if we take the Sigma into the square root sign if we take the standard deviation in there it becomes 1 over the square root of 2 Pi Sigma squared I've never seen it written this way but it gives me a little intuition that Sigma squared I it's always written as you know Sigma square but it's really just the variance and the variance is what you calculate before you calculate the standard deviation so that's interesting and then this top right here this could be written as e to the minus 1/2 times and if we were to just take this if we were you know both of these things here R squared so we could just say X minus the mean over Sigma squared and this kind of clarifies a little bit what's going on here a little bit better because what's this X minus Sigma is the distance between whatever point we want to find let's say we're here X minus X minus mu X minus mu mu is the mean so that's here so that's this distance and then this is a standard deviation which is this distance so this right this in here tells me how many standard deviations I am away from the mean and that's actually called the standard z-score I talked about in the other video and then we square that and then we take this to the -1 hat well let me rewrite that if I were to write e to the minus 1/2 times a that's the same thing as e to the a to the minus 1/2 power right if you take something to an exponent then take that to an exponent you can just multiply these exponents so likewise this could be rewritten as this is equal to 1 over the square root of 2 Pi Sigma squared which is just the variance and I'm just playing around with the form because I really want you to see all the ways it you know maybe you'll get a little intuition and I encourage you to email me if you see some insight on you know why this exists in all of that but once again I think it is cool that all of a sudden we have this other formula that has PI and E in it and this is really just you know this is what the central you know so many phenomenon this is are described by this and once again pioneer up together right just like e to the I pi is equal to negative 1 tells you something about about about our universe anyway I could rewrite this as e to the X minus mu over Sigma squared and all of that to the minus 1/2 something the minus 1/2 power that's just 1 over the square root which is already going on here so we could just rewrite this over here as 1 over the square root of 2 pi times the variance times e to essentially our z-score squared right if we say Z is the thing in here Z is how many standard deviations we are from the mean z-score square and all of a sudden this kind of becomes a very clean you know we just say 2 pi times our variance times e to the number of standard deviations we are away from the mean you square that you take the square root of that thing and invert it and that's all so that's that's the normal distribution so anyway I wanted to do that just so I thought it was neat and it's interesting to play around with it in that way if you see it in any of these other forms and the rest of your life you won't say what's that I thought the normal distribution was this or was this and now you know what that said let's play around a little bit with this normal distribution so this spreadsheet I've plotted normal just range you can change the assumptions that are in this kind of a green blue color so right now it's plotting it with a mean of 0 and a standard deviation of 4 and I just write the variance here just for your information the variance is just the standard deviation squared and so what happens when you change the mean so if the mean goes from 0 to let's say it goes to 5 notice this graph just shifted to the right by 5 right it was centered here now it's centered over here if we make it minus 5 what happens the whole bell curve just shifts 5 to the left from the center now what happens when you change the standard deviation right the standard deviation is a measure of it you know the variance is the average squared distance from the mean the standard deviation is the square root of that so it's kind of not exactly but kind of the average distance from the mean so the smaller the standard deviation the closer a lot of the points are going to be to the mean so we should get kind of a a narrower graph and let's see that happens so when the standard deviation is 2 we see that the graph you're more likely to be really close to the mean then further away and if you make the standard deviation I don't know if you make it 10 all of a sudden you get a really flat graph and this thing keeps going on forever and that's another that's a key difference the binomial distribution is always finite you can only have a finite number of values while the normal distribution is defined over the entire real number line so you know you the probability if you have if you have a mean of -5 and a standard deviation of 10 the probability of you having like you know getting a thousand here is very very low but there is some some probability it's like the you know there's some probability that I fall that all of the atoms in my body just arranged perfectly that I fall through the seed I'm sitting on it's very unlikely and it probably won't happen in the life of the universe but it can happen and you know and that could be described by a normal distribution because it says you know anything can happen although it could be very very very unprofitable so you know the thing I talked about at the beginning of the video is when you figure out a normal distribution you have to you can't just look at this point on the graph let me get the pen tool back you have to figure out the area of the curve under the curve between two points right so if I wanted to say let's say this was our distribution I said what is the probability that I get 0 I don't know what phenomenon this is describing but that zero happen if I say exactly zero the probability is zero because it shouldn't you zero too much because the area under the curve just under zero it's not a there's no area it's just a line you have to say between a range so you have to say the probability between you know let's say - and actually I can type it in here on the R on earth I can say the probability between let's say minus 0.005 and plus 0.05 is well it rounded so it says they're close to zero let me do it between minus 1 and between 1 all right it calculated it at 7 percent and I'll show you how I calculated this in a second so let me get the screen draw into a log so what did I just do this between minus 1 and 1 and I'll show you the behind-the-scenes what Excel is doing we're going from minus 1 which is roughly right here to 1 and we're calculating the area under the curve right we're calculating this area or for those of you who know calculus we're calculating the integral from minus 1 to 1 of this function where the standard deviation is right here is 10 and the mean is -5 and actually let me put that in so we're calculating for this example the way it's drawn right here the normal distribution function let's see our standard is ten times the square root of two pi times e to the minus 1/2 times X minus our mean our mean is negative right now right our mean is -5 so it's X plus five over the standard deviation squared which is the variance so that's 100 squared DX this is what this number is right here this 7% or actually 0.07 is the area right under there now unfortunately for us in the world this isn't an easy integral to evaluate analytically even for those of us who know our calculus so this tends to be done numerically and and kind of an easy way to do this and not an easy way but a function has been defined called the cumulative distribution function that is a useful tool for figuring out this area so what the cumulative distribution function is is essentially let me call it the cumulative distribution function it tells you know it's a function of X it gives us the area under the curve under this curve so let's say that this is X right here that's our X it tells you the area under the curve up to X or so another way to think about it it tells you what is the probability that you land at some value less than your x value so see it's the area from minus infinity to X of our probability density function DX and there's actually a a in excel when you actually use the Excel normal distribution function the you know the you say norm distribution you have to give it your x-value you give it the mean you give it the standard deviation and then you say whether you want the you want you on the cumulative distribution in which case you say true or you want just this normal distribution which you say fall so if you wanted to graph this right here you would say false in caps if you wanted to graph the cumulative distribution function which I do down here let me move this down a little bit let me get out of the pen tool so the cumulative distribution function is right over here then you say true when you make that excel call so this is a cumulative distribution function for the same for this this is a normal distribution here's a cumulative distribution just so you get the intuition is if you want to know what is the probability that I'm going to value less than 20 right so I can get any value less than 20 given this distribution the cumulative distribution right here let me get to make it so you can see the if you go to 20 you just go right to that point there and you say wow the probability of getting 20 or less it's pretty high it's you know it's approaching 100% that makes sense because most of the area under this curve is less than 20 or if you said what's the probability of getting less than 5 less than minus 5 well minus 5 was the mean so half of your result should be above that and half should be below and if you go to this point right here you can see that this right here is 50% so the probability of getting less than minus 5 is exactly 50% so what you do is if I wanted to know if I want to know the probability of going getting between negative 1 and 1 what I do is let me get back to my pen tool what I do is I figure out what is the probability of getting minus 1 or lower right so I figure out this whole area and then I figure out the probability of getting 1 or lower which is this whole area let me do it in a different color 1 or lower is everything there and I subtract the yellow area from the magenta area and I'll just get what's ever left over here right so what I do is I take that's exactly what I did in the spreadsheet let me scroll down this might be taxing my computer by taking the screen capture with it so what I did is I evaluated the cumulative distribution function at 1 should be right there and I evaluate the cumulative distribution function at minus 1 which is right there and the difference between these two I subtract this number from this number and that tells me essentially the probability that I'm between those two numbers or another way to think about it the area the area right here and I really encourage you to play with this and explore the Excel formulas and everything this area right right here between minus one and one now one thing that shows up a lot is you know what's the probability that you land within the standard deviation of and just so you know this graph the central line right here this is the mean and then these two lines I drew right here these are one standard deviation below and one standard deviation above the mean and some people think you know what's the probability that I land within one standard deviation of the mean well that's easy to do what I can do is I'll just click on this and I could call this what's the probability that I land between let's see one standard deviation my the mean is minus five one standard deviation below the mean is minus fifteen and one standard deviation above the mean is ten plus -5 is 5 so this between 5 and 15 so 68.3% that's actually always the case that you have a 68 point three percent probability of landing within one standard deviation of the mean assuming you have a normal distribution so once again that number comes from that represents the area under the curve here this area under the curve and the way you you get it is with the cumulative distribution function let me go down here every time I move this have to get rid of the pen tool so you go from you evaluate it at plus five which is right here all right this is one standard deviation above the mean which that's a number right around there looks like it's like I don't know 80 something percent maybe ninety percent roughly and then you evaluate it at once to aviation below the mean which is minus fifteen and this one looks like I don't know roughly fifteen percent or so fifteen 16 maybe 17 percent let's say 18 percent but the big picture is when you subtract this value from this value you get the probability that you land between those two and that's because this value tells the probability that you're less than so when you go to the cumulative distribution function you get that right there that tells us a probability that you are let me get it keep scrolling back and forth B that tells you that you're the so when you look when you go to 5 and you go right over here this essentially tells you this area under the curve the probability that you're less than or equal to 5 everything up there and then when you've evaluated it - 15 down here it tells you the probability that you're at down back here so when you subtract this from the larger thing you're just left with what's under the curve right there and just to understand this the spreadsheet a little bit better just because I really want you to play with it and move the you know see what happens when you know if I make this distribution it was my the mean was -5 now let me make it 5 it just shifted to the right it just moved over to the right by 5 right whoops I'll use the pen tool it just moved over to the right by 5 if I were to if and if I were to try to make the standard deviation smaller we'll see that the whole thing just gets a little bit tighter let's make it 6 and all of a sudden this looks a little bit tighter curve we make it 2 it becomes even tighter and just so you know how I calculated everything and I really want you to play with this and play with the the formula and get an intuitive feeling for this the cumulative distribution function and think a lot about how it relates to the binomial distribution and I cover that in the last video this I just to plot this I just took each of these points I want to plot the points between minus 20 and 20 and I just incremented by 1 right I just decided to increment by one so this isn't it's not a continuous curve it's actually just taking plotting a point at each point and connecting it with a line then I did the distance between each of those points and the mean right so I just took let's say that you know this the 0-5 this is this distance so this just tells you the point minus 20 is 25 less than the mean right that's all I did there then I divided that by the standard deviation and this is this is the z-score the standard z-score right so this tells me how many points how many standard deviations is minus 20 away from the me it's twelve and a half standard deviations below the mean and then I use that and I just plugged it into essentially this formula to figure out the height of the function so let's say it minus 20 the height is very low at minus five well let's say it - and - to the heights a little bit better the height is going to be someplace it's going to be like you know right there and then and so that gives me that value but then to actually figure out the probability of that what I do is I calculate the the cumulative distribution function between well this is the value that the probability that you're less than that so you know the area under the curve below that which is very very small it's not zero I know it looks like zero here but that's only because I round it it's going to be you know zero zero zero one it's going to be a really really small number there's some probability that we even get like - a thousand another intuitive thing that you really should have a sense for is you know what is the problem the integral over this or the entire area of the curve has to be one because that takes into account all possible circumstances and that should happen if we put a suitably small number here and a suitably large number here there you go we get 100% although this isn't a hundred percent we would have to go from minus infinity to plus infinity to really get a hundred percent it's just rounding to one hundred percent it's probably you know 99.999999% or something like that and so you know just and so to actually calculate this what I do is I take the cumulative distribution function of this point and I subtract from that the cumulative distribution function of that point and that's where I got this 100 percent from anyway hopefully that'll give you a good feel for the for the for the normal distribution and you know I really encourage you to play with the spreadsheet and to even make a spreadsheet like this yourself and in future exercises we'll actually use this type of spreadsheet to as an input into other models so if you know if we're doing a financial model and if we say our revenue has a normal distribution around some expected value what is the distribution of our net income or or we could think of a hundred other different types of examples anyway see you in the next video