If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

# Sample standard deviation and bias

AP.STATS:
UNC‑1 (EU)
,
UNC‑1.J (LO)
,
UNC‑1.J.3 (EK)

## Video transcript

let's say that you're a watermelon farmer and you want to study how dense the seeds are in your watermelon perhaps you want to do this because over time you're trying to breed watermelons that have fewer seeds and you should see whether you are actually making progress and you don't want to cut open every watermelon in your in your watermelon farm or patch or whatever it might be called because you want to sell most of them you just want to sample a few watermelons and then take samples of those watermelons to figure out how dense the seeds are and hope that those that you can you can calculate statistics on those samples that are decent estimates of the parameters for the population so let's start doing that so let's say that you take these little cubic inch chunks out of a random sample of your watermelons and then you count the number of seeds in them and you get you have eight samples like this so in one of them you found four seeds in the next you found three five seven two nine eleven and seven so this is a sample just to make sure we're visualizing it right if this is the population of all of the chunks these little cubic I guess you view this as a cubic inch the cubic inch chunks in my entire watermelon farm I'm sampling a very small sample of them so I'm sampling a very small sample maybe I could have had a million over here a million chunks of watermelon could have been produced from my farm but I'm only sampling so Capital n would be 1 million lowercase n is equal to 8 and once again you might want to have more samples but this will make our math easy now let's think about what statistics we can measure well the first one that we often do is a measure of central tendency and that's the arithmetic mean but here we were trying to estimate the population mean by coming up with the sample mean the sample mean so what is the sample mean going to be well let's list all we have to do is add up these points add up these these measurements and then divide number of measurements we have so let's get our calculator out for that actually I actually maybe I don't need my calculator let's see so 4 plus 3 is 7 7 plus 5 is 12 12 plus 7 is 19 19 plus 2 is 21 plus 9 is 30 plus 11 is 41 plus 7 is 48 so I'm going to get 48 over 8 data points so this worked out quite well 48 divided by 8 is equal to 6 so our sample mean is 6 it's our estimate of what the population mean might be but we also want to think about how much in our population we want to estimate how much in our population how much how much does how much spread is there how much do our measurements on how much do they vary from this mean so there we say well we can try to estimate the population variance by calculating the sample variance and we're going to calculate the unbinds but the unbiased sample variance hopefully we're fairly convinced at this point why we divide by n minus 1 so we're going to calculate the unbiased sample variance and if we do that what do we get well it's just going to be I'll do sin a different color it's going to be 4 minus 6 squared plus 3 minus 6 squared plus 5 minus 6 squared plus 7 minus 6 squared plus 2 minus 6 squared plus 9 minus 6 squared plus 11 minus 6 squared plus 7 minus 6 squared all of that divided by not by 8 remember we want the unbiased sample variance we're going to divide it by 8 minus 1 so we're going to divide by 7 and so this is going to be equal to let me give myself a little bit more real estate the samp unbiased sample variance that I could even denote it by this to make it clear that we're dividing by lowercase n minus 1 is going to be equal to let's see 4 minus 6 is negative 2 that squared is positive 4 so I did that 1 3 minus 6 is negative 3 that squared is to be 9 5 minus 6 squared is 1 squared which is 1 7 minus 6 is once again 1 squared which is 1 2 minus 6 negative 4 squared negative 4 squared is 16 9 minus 6 squared well that's going to be 9 11 minus 6 squared that is 25 25 and then finally 7 minus 6 squared that's another one and we're going to divide it by 7 now let's see if we can add this up in our heads 4 plus 9 is 13 plus 1 is 14 15 31 40 65 66 so this is going to be equal to 66 over 7 and we could either divide because that's nine and 3/7 we could write that as 9 and 3/7 or if we want to write that as a decimal I can just take 66 66 divided by 7 gives us 9 point I'll just round it so it's approximately 9 point 4 3 so this is approximately nine point nine point four three now that gave us our unbiased sample variance well how could we calculate a sample standard deviation we want to somehow get at an estimate of what the population standard deviation might be well the logic should I guess is reasonable to say well this is our unbiased sample variance it's our best estimate of what the true population variance is when we think about population parameters to get the population standard deviation to get the population standard deviation we just take the square root of the population variance so if we want to get an estimate of the sample standard deviation why don't we just take the square root of the unbiased sample variance the unbiased sample variance so that's what we'll do so we'll define it that way we'll call the sample standard deviation we're going to define it to be equal to the square root of the unbiased sample variance so it's going to be the square root of this quantity and we could take our calculator out it's going to be the square root of what I just typed in I can do second answer it'll be the last entry here so square root of that is and I'll just round it's approximately equal to three point two zero seven approximately equal to three point zero seven now I'm going to tell you something very counterintuitive or at least initially it's counterintuitive but hopefully you'll appreciate this over time this we've already talked about in some depth we've even had people have even created simulations to show that this is an unbiased estimate of population variance when we divided by n minus one and that's a good starting point we're going to take the square root of anything but it actually turns out that because the square root the square root function is nonlinear that this sample standard deviation the sample standard deviation and this is how it tends to be defined sample standard standard deviation that this sample standard deviation which is the square root the square root of the square root of our of our sample variance so from I equals 1 to N of our unbiased sample variance so divided by n minus 1 this is how we literally divide our sample standard deviation because the square root function is nonlinear this is nonlinear non linear it turns out that this is not an unbiased estimate of the true population standard deviation and I encourage people to make simulations of that if they're interested but then you might say okay well we went through great pains to divide by n minus 1 here in order to get an unbiased estimate of the population variance why don't we go through similar pains and somehow figure out a formula for an unbiased estimate of the population standard deviation and the reason why that's difficult is to unbias the sample variance we just have to divide by n minus 1 instead of n and that worked for any any pop just probability distribution for our population it turns out to do the same thing for the standard deviation it's not that easy it's actually dependent on how that population is actually distributed so in statistics we just define the sample standard deviation and the one that we use is based we typically use is based on the square root of the unbiased sample variance but when you take that square root it does introduce it is a it does give you a biased result when you're estimating for the cent when you're trying to use this to estimate the population standard deviation but it's the simplest best tool we have