If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains ***.kastatic.org** and ***.kasandbox.org** are unblocked.

Main content

Current time:0:00Total duration:12:34

in the last video we talked about different ways to represent the central tendency or the average of a data set what we're going to do in this video is to expand that a little bit to understand how spread apart the data is as well so let's just let's just think about this a little bit let's say I have negative 10 0 10 20 and 30 let's say that's one data set right there and let's say the other data set is 8 9 10 11 and 12 now in both let's calculate the arithmetic mean for both of these data sets so let's calculate the mean the mean and when you go further on in statistics you're going to understand the difficulty in a population in a sample we're assuming that this is the entire population of our data this is the entire population of our data so we're going to be dealing with the population mean we're going to be dealing with as you see the population measures of dispersion I know these are all fancy words in the future you're not going to have all of that in it you're going to have some samples of it you're going to try to estimate things for the entire population so I don't want you to worry too much about that just now but if you are going to go further in statistics I just want to make that clarification now the population mean or the arithmetic mean of this data set right here it is negative 10 plus 0 plus 10 plus 20 plus 30 over we have five data points over five and what is this equal to that negative 10 cancels out with that 10 20 plus 30 is 50 divided by 5 it's equal to 10 now what's the mean of this data set 8 plus 9 plus 10 plus 11 plus 12 all of that over 5 and the way we could think about it 8 plus 12 is 29 plus 11 is another 20 so that's 40 and then we have a 50 there at another 10 so this is once again is going to be 50 over 5 so this has the exact same sample would be very close they have the exact same population means or if you don't want to worry about the word population or sample and all of that they have the both of these datasets have the exact same arithmetic mean when you average all these numbers and divide by five or when you take the sum of these numbers divided by five you get ten some of these numbers divide by five you get ten as well but clearly these these sets of numbers are different you know if you just looked at this number say oh maybe these sets are very similar to each other but when you look at these two data sets one thing might pop out at you all of these numbers are very close to 10 I mean the farthest number here is 2 away from 10 12 is only two away from 10 here these numbers are further away from 10 even the closer ones are still 10 away and then these guys are 20 away from 10 so this right here this data set right here is more more dispersed more dispersed right these guys are further away from our mean then these guys are from this mean so let's think about different ways we can measure dispersion or how far away we are from the center on average now one way this is kind of the most simple way is the range and you won't see it used too often but it's kind of a very simple way of understanding how far is the spread between the largest and the smallest number and literally take the largest number which is 30 in our example and from that you subtract the smallest number so 30 minus negative 10 which is equal to 40 which tells us that the difference between the largest and the smallest number is 40 so we have a range of 40 for this data set here the range is the largest number 12 minus the smallest number which is 8 which is equal to 4 so here range is actually a pretty good measure of dispersion we say okay both of these guys have a mean of 10 but when I look at the range this guy has a much larger range so that tells me May this is a more dispersed set but the range is always not going to tell you the whole picture you might have two data sets with the exact same range we're still based on how things are bunched up it could still have very different distributions of where the numbers lie now the one that you'll see used most often is called the variance the variance actually you want to see the standard deviation in this video that's probably what's used most often but it has a very close relationship to the variance so the symbol for the variance and we're dealing with the population variance once again we're assuming that we this is all of the data for our whole population that we're not just sampling taking a subset of the data so the variance it's symbol is literally this Sigma this lists this Greek letter R squared that is the symbol for variance and we'll see that the Sigma letter actually is the symbol for standard deviation and that is for a reason but anyway the definition of variance is you literally take each of these data points find the find the difference between those data points and your mean square them and then take the average of those squares I know that sounds very complicated but when I actually calculate it you're going to see it's not too bad so remember the mean here is 10 so I take the first data point I say it's let me do it over here let me scroll down a little bit so I take the first data point negative 10 from that I'm going to subtract our mean and I'm going to square that so I just found the difference from that first data point to the mean and squared it and that's essentially to make it positive plus the second data point 0 minus 10 minus the mean this is the mean this is that 10 right there squared plus 10 minus 10 squared that's the middle 10 right there plus 20 minus 10 that's the 20 squared plus 30 minus 10 squared so this is the squared differences between each number and the mean this is the mean this is the mean right there that is the mean I'm finding the difference between every data point and the mean squaring them summing them up and then dividing by that number of data points I'm taking the average of these of these numbers of the squared distances so when you say when you say it kind of verbally it sounds very complicated but yours taking each number how far how far it what's the different that the mean square it take the average of those so I have one two three four five divided by five so what is this going to be equal to what is this going to be equal to negative 10 minus 10 is negative 20 negative 20 squared is 400 zero minus 10 is negative 10 squared is 100 so plus 110 minus 10 squared that's just 0 squared which is 0 plus 20 minus 10 is 10 squared is a hundred plus 30 minus 10 which is 20 squared is 400 all of that over five and what do we have here 400 plus 100 is 500 plus another 500 is a thousand it's equal to one thousand over five which is equal to 200 so in this situation our variance is going to be 200 that's our measure of dispersion there and let's compare it to this data set over here let's compare it to the the the variance of this less dispersed data set so let me scroll over a little bit so we have some real estate although I'm running out let me clean I could scroll up here there you go so let me calculate the variance of this data set so we already know it's mean so it's variance of this data set is going to be equal to 8 minus 10 8 minus 10 squared plus 9 minus 10 squared plus 10 minus 10 squared plus 11 minus 10 let me scroll up a little bit squared plus 12 minus 10 squared remember that 10 is just the mean that we calculate you have to calculate the mean first divided by we have one two three four five squared differences so this is going to be equal to 8 minus 10 is negative 2 squared is positive 4 9 minus 10 is negative 1 squared is positive 1 10 minus 10 is 0 squared you still get 0 1 11 minus 10 is one squared you get 112 minus 10 is 2 squared you get 4 now what is this equal to all of that over 5 this is 10 over 5 so this is going to be all right this is 10 over 5 10 over 5 which is equal to 2 so the variance here let me make sure I got that yes we have 10 over 5 so the variance of this less dispersed data set is a lot smaller the variance here the variance of this data set right here is only 2 so that gave you a sense that tells you look this is definitely a less dispersed data set than that there now the problem with the variance is you're taking these numbers you're taking the difference between them and the mean then you're squaring it it kind of gives you a bit of an arbitrary number and if you're dealing with units let's say if these are each you know negative well let's say these are let's say they're their distances so this is negative 10 meters 0 meters 10 meters this is 8 meters so on and so forth then when you square it you get your variance in terms of meters squared it's kind of an odd set of units so what people like to do is talk in terms of standard deviation standard deviation standard deviation which is just the square root of the variance it's just the square root of the variance or the square root of Sigma squared and the symbol for the standard deviation is just Sigma so now we figured out the variance very easy to figure out the standard deviation of both of these characters the standard deviation of this first one up here of this first data set is going to be the square root of 200 square root of 200 is what the square root of 2 times 100 this is equal to 10 square roots of 2 that's that first data set now the variance of the second data set is just going to be the square root of 2 variance so maybe squared of that sorry the standard deviation of the second data set is going to be the square root of its variance which is 2 which is just 2 so the second data set has 1/10 the standard DV da ssin as this first data center this is ten roots of two this is just the root of two so this is ten times ten times the standard deviation ten times the standard deviation and this hopefully will make a little bit more sense let's think about this has ten times more the standard deviation than this and let's remember how we calculated variance we just took the each data point how far is it away from the mean square that took the average of those then we took the square root really just to make the unit's look nice but the end result is we said that that first dataset has ten times the standard deviation as the second data set so let's look at the two data sets this has ten times the standard deviation ten times the standard deviation standard deviation which makes sense intuitively right I mean they both have a ten in here but each of these guys nine is only one away from the ten zero is ten away from the ten ten less eight is only two away this guy's 20 away so it's ten times on average further away so the standard deviation at least in my sense is giving a much better sense of how far away on average we are from the mean anyway hopefully you found that useful