What happened to the normal condition np ≥ 10 and n(1-p) ≥ 10

it is for sampling distribution of sample proportion

Shouldn't the standard error of sample be just sample standard deviation instead of (sample standard deviation)/(sqrt(n)) ??

When you take a sample of a population, the sd should be sd/sqrt(n). What stays the same is the mean. The mean is the same both for the population and the sample. I think you're confusing the two.

It's said: *n should be >= 30*, for calculating t-intervals (using t-statistics). But in another video (https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/more-significance-testing-videos/v/z-statistics-vs-t-statistics) Sal is saying that t-statistics should be used instead of z-statistics only *IF n < 30*.

z-statistics will give a narrower confidence interval than t-statistics, but the larger 𝑛 is the less that difference will be, and for 𝑛 ≥ 30, the difference can be considered negligible.

Main content

Reference: Conditions for inference on a mean

Google Classroom

When we want to carry out inference (build a confidence interval or do a significance test) on a mean, the accuracy of our methods depends on a few conditions. Before doing the actual computations of the interval or test, it's important to check whether or not these conditions have been met. Otherwise the calculations and conclusions that follow may not be correct.

The conditions we need for inference on a mean are:

Random: A random sample or randomized experiment should be used to obtain the data.
Normal: The sampling distribution of $\bar{x}$ ‍ (the sample mean) needs to be approximately normal. This is true if our parent population is normal or if our sample is reasonably large $(n \geq 30)$ ‍.
Independent: Individual observations need to be independent. If sampling without replacement, our sample size shouldn't be more than $10 %$ ‍ of the population.

Let's look at each of these conditions a little more in-depth.

The random condition

Random samples give us unbiased data from a population. When we don't use random selection, the resulting data usually has some form of bias, so using it to infer something about the population can be risky.

For example, suppose a university wants to report the average starting salary of their graduates. How do they obtain the data? They can't access the salaries of all graduates, and they can't realistically get salaries from a random sample of graduates. The university could rely on graduates who are willing to share their salaries to calculate the average, but using voluntary response will likely lead to a biased estimate of the true average. Graduates with higher starting salaries will probably be more willing to report their salaries than graduates with low salaries (or graduates without salaries). Also, graduates who participate may claim their salary is higher than it really is, but they'd be unlikely to say it's lower than it really is.

The big idea is that data that came from a non-random sample may not be representative of its population.

More specifically, sample means are unbiased estimators of their population mean. For example, suppose we have a bag of ping pong balls individually numbered from

0

30

, so the population mean of the bag is

15

. We could take random samples of balls from the bag and calculate the mean from each sample. Some samples would have a mean higher than

15

and some would be lower. But on average, the mean of each sample will equal

15

. We write this property as

μ_{\bar{x}} = μ

, which holds true as long as we are taking random samples.

This won't necessarily happen if we use a non-random sample. Biased samples can lead to inaccurate results, so they shouldn't be used to create confidence intervals or carry out significance tests.

The normal condition

The sampling distribution of

\bar{x}

(a sample mean) is approximately normal in a few different cases. The shape of the sampling distribution of

\bar{x}

mostly depends on the shape of the parent population and the sample size

n

Case 1: Parent population is normally distributed

If the parent population is normally distributed, then the sampling distribution of

\bar{x}

is approximately normal regardless of sample size. So if we know that the parent population is normally distributed, we pass this condition even if the sample size is small. In practice, however, we usually don't know if the parent population is normally distributed.

Case 2: Not normal or unknown parent population; sample size is large ( $n \geq 30$ ‍)

The sampling distribution of

\bar{x}

is approximately normal as long as the sample size is reasonably large. Because of the central limit theorem, when

n \geq 30

, we can treat the sampling distribution of

\bar{x}

as approximately normal regardless of the shape of the parent population.

There are a few rare cases where the parent population has such an unusual shape that the sampling distribution of the sample mean

\bar{x}

isn't quite normal for sample sizes near

30

. These cases are rare, so in practice, we are usually safe to assume approximately normality in the sampling distribution when

n \geq 30

Case 3: Not normal or unknown parent population; sample size is small ( $n < 30$ ‍)

As long as the parent population doesn't have outliers or strong skew, even smaller samples will produce a sampling distribution of

\bar{x}

that is approximately normal. In practice, we can't usually see the shape of the parent population, but we can try to infer shape based on the distribution of data in the sample. If the data in the sample shows skew or outliers, we should doubt that the parent is approximately normal, and so the sampling distribution of

\bar{x}

may not be normal either. But if the sample data are roughly symmetric and don't show outliers or strong skew, we can assume that the sampling distribution of

\bar{x}

will be approximately normal.

The big idea is that we need to graph our sample data when $n < 30$ ‍ and then make a decision about the normal condition based on the appearance of the sample data.

The independence condition

To use the formula for standard deviation of

\bar{x}

, we need individual observations to be independent. In an experiment, good design usually takes care of independence between subjects (control, different treatments, randomization).

In an observational study that involves sampling without replacement, individual observations aren't technically independent since removing each observation changes the population. However the

10 %

condition says that if we sample

10 %

or less of the population, we can treat individual observations as independent since removing each observation doesn't change the population all that much as we sample. For instance, if our sample size is

n = 30

, there should be at least

N = 300

members in the population for the sample to meet the independence condition.

Assuming independence between observations allows us to use this formula for standard deviation of

\bar{x}

when we're making confidence intervals or doing significance tests:

σ_{\bar{x}} = \frac{σ}{\sqrt{n}}

We usually don't know the population standard deviation

σ

, so we substitute the sample standard deviation

s_{x}

as an estimate for

σ

. When we do this, we call it the standard error of

\bar{x}

to distinguish it from the standard deviation.

So our formula for standard error of

\bar{x}

is:

σ_{\bar{x}} \approx \frac{s_{x}}{\sqrt{n}}

Summary

If all three of these conditions are met, then we can we feel good about using

t

distributions to make a confidence interval or do a significance test. Satisfying these conditions makes our calculations accurate and conclusions reliable.

The random condition is perhaps the most important. If we break the random condition, there is probably bias in the data. The only reliable way to correct for a biased sample is to recollect the data in an unbiased way.

The other two conditions are important, but if we don't meet the normal or independence conditions, we may not need to start over. For example, there is a way to correct for the lack of independence when we sample more than

10 %

of a population, but it's beyond the scope of what we're learning right now.

The main idea is that it's important to verify certain conditions are met before we make these confidence intervals or do these significance tests.

Want to join the conversation?

Sort by:

Brian Bale
Posted 5 years ago. Direct link to Brian Bale's post “What happened to the norm...”
What happened to the normal condition np ≥ 10 and n(1-p) ≥ 10
Button navigates to signup pageComment on Brian Bale's post “What happened to the norm...”
(22 votes)
Answer
- ronaldoamulya
  Posted 5 years ago. Direct link to ronaldoamulya's post “it is for sampling distri...”
  it is for sampling distribution of sample proportion
  Button navigates to signup page
  (45 votes)
Pramoth Viswan
Posted 6 years ago. Direct link to Pramoth Viswan's post “Shouldn't the standard er...”
Shouldn't the standard error of sample be just sample standard deviation instead of (sample standard deviation)/(sqrt(n)) ??
Button navigates to signup pageComment on Pramoth Viswan's post “Shouldn't the standard er...”
(9 votes)
Answer
- Abe
  Posted 6 years ago. Direct link to Abe's post “When you take a sample of...”
  When you take a sample of a population, the sd should be sd/sqrt(n). What stays the same is the mean. The mean is the same both for the population and the sample. I think you're confusing the two.
  Button navigates to signup page
  (4 votes)
Alba Soma
Posted 3 years ago. Direct link to Alba Soma's post “It's said: *n should be >...”
It's said: n should be >= 30, for calculating t-intervals (using t-statistics). But in another video (https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/more-significance-testing-videos/v/z-statistics-vs-t-statistics) Sal is saying that t-statistics should be used instead of z-statistics only IF n < 30.
Button navigates to signup pageComment on Alba Soma's post “It's said: *n should be >...”
(6 votes)
Answer
- Jerry Nilsson
  Posted 3 years ago. Direct link to Jerry Nilsson's post “z-statistics will give a ...”
  z-statistics will give a narrower confidence interval than t-statistics, but the larger 𝑛 is the less that difference will be, and for 𝑛 ≥ 30, the difference can be considered negligible.
  Comment on Jerry Nilsson's post “z-statistics will give a ...”
  (7 votes)
John Ostrowski
Posted 5 years ago. Direct link to John Ostrowski's post “I'm very curious about th...”
I'm very curious about the method for correcting the independence factor of samples when n> 10%N. It was mentioned as "beyond the scope", does anyone have references for that? Maybe Stat trek?
Button navigates to signup pageComment on John Ostrowski's post “I'm very curious about th...”
(4 votes)
Answer
- daniella
  Posted 7 days ago. Direct link to daniella's post “When you sample a big chu...”
  When you sample a big chunk (more than 10%) of a population, the choices you make can affect the remaining population, which can mess with the independence of the data points. To fix this, you use something called the Finite Population Correction (FPC). The formula for FPC is:
  
  FPC = sqrt((N - n) / (N - 1))
  
  Where:
  
  N is the total population size
  n is the size of your sample
  You then adjust the standard error (SE) of your statistic by multiplying it by FPC:
  
  Adjusted SE = FPC x SE
  
  This helps correct the independence issue when you're sampling a large part of the population.
  Button navigates to signup page
  (1 vote)
Kyle Wright
Posted 4 years ago. Direct link to Kyle Wright's post “If the population is know...”
If the population is known to likely not be normal and a sample of n<30, then does the sample have to be transformed to normal to make an inference on CI? E.g. if you want to determine a CI on the number of customers that arrive at a drive through window during lunch - I believe this would be a Poisson counting process, therefore not normal. So wouldn't this be skewed to the right since the number of customers is bounded to >=0 and likely not symmetric. Basically, how can you correct a sample to make an inference on CI mean if dealing with Case #3. Thanks!
Button navigates to signup pageComment on Kyle Wright's post “If the population is know...”
(3 votes)
Answer
- daniella
  Posted 7 days ago. Direct link to daniella's post “If your data isn't normal...”
  If your data isn't normally distributed and you have a small sample size, you have a few options:
  
  a) Transformation Methods: You can try transforming the data using logarithms, square roots, or other functions to make the distribution more symmetrical and closer to normal.
  
  b) Non-parametric Methods: These methods don't assume your data follows any specific distribution. Examples include the Mann-Whitney U test or the Wilcoxon signed-rank test.
  
  c) Bootstrapping: This technique involves repeatedly resampling from your data to build up a "bootstrap" distribution of your statistic. It's very flexible and doesn't assume normality.
  
  For something like the number of customers at a drive-through (which could be modeled by a Poisson distribution), you might look at Poisson regression or a generalized linear model (GLM) with a Poisson link, which are good for count data. Or, you could use bootstrapping or a non-parametric method if you need to estimate something like a confidence interval.
  Button navigates to signup page
  (1 vote)
Renzo ChaseC
Posted 4 days ago. Direct link to Renzo ChaseC's post “wow this was very easy”
wow this was very easy
Button navigates to signup pageButton navigates to signup page
(1 vote)
Answer

Statistics and probability

Course: Statistics and probability > Unit 11