If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

# Reference: Conditions for inference on a mean

AP.STATS:
UNC‑4 (EU)
,
UNC‑4.P (LO)
,
UNC‑4.P.1 (EK)
,
VAR‑7 (EU)
,
VAR‑7.D (LO)
,
VAR‑7.D.1 (EK)
When we want to carry out inference (build a confidence interval or do a significance test) on a mean, the accuracy of our methods depends on a few conditions. Before doing the actual computations of the interval or test, it's important to check whether or not these conditions have been met. Otherwise the calculations and conclusions that follow may not be correct.
The conditions we need for inference on a mean are:
• Random: A random sample or randomized experiment should be used to obtain the data.
• Normal: The sampling distribution of x, with, \bar, on top (the sample mean) needs to be approximately normal. This is true if our parent population is normal or if our sample is reasonably large left parenthesis, n, is greater than or equal to, 30, right parenthesis.
• Independent: Individual observations need to be independent. If sampling without replacement, our sample size shouldn't be more than 10, percent of the population.
Let's look at each of these conditions a little more in-depth.

## The random condition

Random samples give us unbiased data from a population. When we don't use random selection, the resulting data usually has some form of bias, so using it to infer something about the population can be risky.
More specifically, sample means are unbiased estimators of their population mean. For example, suppose we have a bag of ping pong balls individually numbered from 0 to 30, so the population mean of the bag is 15. We could take random samples of balls from the bag and calculate the mean from each sample. Some samples would have a mean higher than 15 and some would be lower. But on average, the mean of each sample will equal 15. We write this property as mu, start subscript, x, with, \bar, on top, end subscript, equals, mu, which holds true as long as we are taking random samples.
This won't necessarily happen if we use a non-random sample. Biased samples can lead to inaccurate results, so they shouldn't be used to create confidence intervals or carry out significance tests.

## The normal condition

The sampling distribution of x, with, \bar, on top (a sample mean) is approximately normal in a few different cases. The shape of the sampling distribution of x, with, \bar, on top mostly depends on the shape of the parent population and the sample size n.

### Case 1: Parent population is normally distributed

If the parent population is normally distributed, then the sampling distribution of x, with, \bar, on top is approximately normal regardless of sample size. So if we know that the parent population is normally distributed, we pass this condition even if the sample size is small. In practice, however, we usually don't know if the parent population is normally distributed.

### Case 2: Not normal or unknown parent population; sample size is large ($n \geq 30$n, is greater than or equal to, 30)

The sampling distribution of x, with, \bar, on top is approximately normal as long as the sample size is reasonably large. Because of the central limit theorem, when n, is greater than or equal to, 30, we can treat the sampling distribution of x, with, \bar, on top as approximately normal regardless of the shape of the parent population.
There are a few rare cases where the parent population has such an unusual shape that the sampling distribution of the sample mean x, with, \bar, on top isn't quite normal for sample sizes near 30. These cases are rare, so in practice, we are usually safe to assume approximately normality in the sampling distribution when n, is greater than or equal to, 30.

### Case 3: Not normal or unknown parent population; sample size is small ($n<30$n, is less than, 30)

As long as the parent population doesn't have outliers or strong skew, even smaller samples will produce a sampling distribution of x, with, \bar, on top that is approximately normal. In practice, we can't usually see the shape of the parent population, but we can try to infer shape based on the distribution of data in the sample. If the data in the sample shows skew or outliers, we should doubt that the parent is approximately normal, and so the sampling distribution of x, with, \bar, on top may not be normal either. But if the sample data are roughly symmetric and don't show outliers or strong skew, we can assume that the sampling distribution of x, with, \bar, on top will be approximately normal.
The big idea is that we need to graph our sample data when n, is less than, 30 and then make a decision about the normal condition based on the appearance of the sample data.

## The independence condition

To use the formula for standard deviation of x, with, \bar, on top, we need individual observations to be independent. In an experiment, good design usually takes care of independence between subjects (control, different treatments, randomization).
In an observational study that involves sampling without replacement, individual observations aren't technically independent since removing each observation changes the population. However the 10, percent condition says that if we sample 10, percent or less of the population, we can treat individual observations as independent since removing each observation doesn't change the population all that much as we sample. For instance, if our sample size is n, equals, 30, there should to be at least N, equals, 300 members in the population for the sample to meet the independence condition.
Assuming independence between observations allows us to use this formula for standard deviation of x, with, \bar, on top when we're making confidence intervals or doing significance tests:
sigma, start subscript, x, with, \bar, on top, end subscript, equals, start fraction, sigma, divided by, square root of, n, end square root, end fraction
We usually don't know the population standard deviation sigma, so we substitute the sample standard deviation s, start subscript, x, end subscript as an estimate for sigma. When we do this, we call it the standard error of x, with, \bar, on top to distinguish it from the standard deviation.
So our formula for standard error of x, with, \bar, on top is:
sigma, start subscript, x, with, \bar, on top, end subscript, approximately equals, start fraction, s, start subscript, x, end subscript, divided by, square root of, n, end square root, end fraction

## Summary

If all three of these conditions are met, then we can we feel good about using t distributions to make a confidence interval or do a significance test. Satisfying these conditions makes our calculations accurate and conclusions reliable.
The random condition is perhaps the most important. If we break the random condition, there is probably bias in the data. The only reliable way to correct for a biased sample is to recollect the data in an unbiased way.
The other two conditions are important, but if we don't meet the normal or independence conditions, we may not need to start over. For example, there is a way to correct for the lack of independence when we sample more than 10, percent of a population, but it's beyond the scope of what we're learning right now.
The main idea is that it's important to verify certain conditions are met before we make these confidence intervals or do these significance tests.