Main content

## Statistics and probability

# Hypothesis testing and p-values

Sal walks through an example about a neurologist testing the effect of a drug to discuss hypothesis testing and p-values. Created by Sal Khan.

## Want to join the conversation?

- Why do we reject the null hypothesis when we have 99.7% of area under the curve supporting null hypothesis?(64 votes)
- The bell curve is saying that if you test groups of 100 (undrugged) rats over and over again, the average reactions times will range between 1.05 and 1.35 almost every time. The average outcome is 1.2 seconds but just by chance you could get a group of faster reacting rats, or slower reacting rats. Each rat can be a little different than the average. Now suppose you get a group of rats and they average out to 1.05 seconds. That would be really really rare for normal rats. So that bunch of rats must have been drinking Starbucks Coffee or something because they are not normal rats. Think of the the 99.7% part of the curve as caffeine free rates. Null = no = caffeine free. Then along comes a bunch of rats that are so 'hyper' they are "off the charts" Anyone claiming these are normal rats can be pretty safely "rejected" (called a liar).(112 votes)

- Starting at4:22, why do you need to estimate the sample standard deviation when you already have it(.5)? He goes on to say that you put a hat on it to show that you estimated the population standard deviation by using the sample but why does the sigma have a hat for population estimate and have an x bar for sample? Is the notation correct on that section?(69 votes)
- Don't forget, we don't really care about the st.dv. of the sampl, we care about it's relationship to the population. So we have to take measures that involve the actual population. You must first see the video "standard error of the mean" to get this one.(8 votes)

- SHouldn't it be the other way around when calculating the Z value?

(1.05-1.2)/0.05 instead of (1.2-1.05)/0.05?

My professor always told me to do it that way. The final conclusion doesn't change in this case though, but just wanted to make sure if that's the proper way.(59 votes)- since normal probability distribution (bell curve) is symmetric around the mean, it doesnt matter. It gives same result in terms of area under curve, thats why prof. wanted to make it less complex in saying that. But if we were dealing with a non symmetric prob. distr. like F distr, then it would matter.

hope that helps.(16 votes)

- Why are you not using a t-distribution to find the probability of getting the sample result? I know that when the sample size is large (n = 100), a t-distribution is essentially the same as a normal distribution, but I think this lesson can be misleading when we are taught to use a t-distribution in the common case when the population standard deviation is not known and we are estimating it from the sample.(57 votes)
- The t-test is more conservative, if the sample size is small. I think you would opt for the more conservative test, knowing that with a larger sample size, there is essentially no difference between t and z. In general, when comparing two means, the t-test is used. Note from the results given above by ericp, that the conclusion from either test is the same. The two groups differ significantly. In scientific reports, p-value is reported to 2 decimal places. So using either the z or t test, you would report a significant difference "with p < .01".(20 votes)

- Is it valid to assume the sample SD is close to the population SD? Even if the sample size is high, the rats in the sample have been injected, how do we know that doesn't affect the sample SD?(15 votes)
- It is an assumption you are making, justified by the fact that your Ho is that the drug has no effect, and that the populations (drug vs. no drug) will actually be identical. If the drug has no effect, then the standard deviation of drug and no drug rats should be the same. It is an assumption, justified with some logic, but not proven.

In a research paper, this would be recognized as a weakness, but an unavoidable one, because it is impossible to know the true standard deviation of either population - you only know the samples.(11 votes)

- I have a very fundamental question:
**Short formulation of the question**: Why is the hypothesis test designed the way it is? I want to know exactly why we can't calculate the probability of the alternative hypothesis given the sample directly and why we have to assume the null hypothesis is true?**Long formulation of the question**: When conducting an experiment and setting up hypotheses about its outcome, what we actually want to know is whether our alternative hypothesis is true or at least how likely it is (i.e. the probability of the alternative hypothesis to be true), right?

The here presented hypothesis test only gives us the probability of the sample mean to be extreme, but not the probability of the real underlying population mean to be extreme.

So why do we have to go through this process of calculating the probability of the sample given the null-hypothesis is true and then use this result to infer the likelihood of the alternative hypothesis?

Unfortunately I have not found one textbook on statistics yet which answers this fundamental question but I was reading so many enlightening answers here and hope to get an answer!

From my understanding of the hypothesis test I would answer my own question like that:

Since we don't know anything about the underlying population except the tested sample, we just are not able to do any calculations of it. This includes calculating the probabilities of the alternative hypothesis because it is a hypothesis about the population.

We have to work under the assumption that the null hypothesis is true because otherwise we cannot really do anything, we wouldn't know where to center the normally distributed curve which we use to calculate significance.. ...but somehow I am not entirely convinced by my own answer..

Even if my own answer to the question happens to be not far away from the truth, I would appreciate it very much if someone could elaborate a bit.

Thank you!(12 votes)- H_0: pop_mean that someone (including you) insists it's true

H_A: pop_mean what another (including you, again) insists H_0 can't be right, cause this, their own mean, is true

sample_mean (and sample_std): the only evidence for both sides to check which is right

in short, what you're doing with significance test is attacking someone's mean with a different mean based on a gathered data

if it's good enough to support you, you can kill H_0 and insist your own H_A as the next H_0 (that's how scientifical theories have been developed and challenged and so forth)

if not, you can't kill it. that's it. no more, no less (what about your precious H_A? just forget about that, not enough evidence)(1 vote)

- I don't understand where Sal got 99.7%... can anyone explain? (8:50)(8 votes)
- He mentioned this a couple of videos ago, but he is using the empirical rule, which states that, for a normal distribution, 99,7% of all values lie within 3 standard deviations of the mean. Similarly, 68,27% lies within 1 standard deviation and 95,45% within 2. See: http://en.wikipedia.org/wiki/68-95-99.7_rule(9 votes)

- Shouldn't we say that the alternative hypothesis is just μ<1.2s and not in both directions?(8 votes)
- That's an important question. In the end, it gets down to the reason that you are conducting the experiment. In this case, the null hypothesis is that the drug doesn't have an effect on response time, so you want to measure both tails. If your null hypothesis was that the drug doesn't have a
**negative**effect on response time, then you would only measure one tail.(8 votes)

- How do you calculate the critical value? I cant find an explaination for it in your video list. Thank you!(6 votes)
- short answer: Critical values are generally chosen or looked up in a table (based on a chosen alpha).

longer answer:

--------------------

In this video there was no critical value set for this experiment. In the last seconds of the video, Sal briefly mentions a p-value of 5% (0.05), which would have a critical of value of z = (+/-) 1.96. Since the experiment produced a z-score of 3, which is more extreme than 1.96, we reject the null hypothesis.

Generally, one would chose an alpha (a percentage) which represents the "tolerance level for making a mistake.*" Then the corresponding critical value can be looked up from a table. [* the "mistake" being to incorrectly reject the null hypothesis. In other words, we made the error of claiming that the experiment had an effect when it did not.]

The critical value is the cut-off point that corresponds to that alpha; any value beyond the critical value is less than alpha(%) likely to occur by chance.

see the wikipedia page for a z-tables and how to read them

http://en.wikipedia.org/wiki/Standard_normal_table

note that for an alpha of 5%, in a cumulative table, you would first divide your alpha in half for a two-tailed test, then subtract that from 1. That is the value you are looking for in the table. So we get 1 - (.05/2) = 1 - .025 = 0.9750

We find 0.9750 in our table, look at the row: 1.9; look at the column: 0.06; add the two together to get the corresponding z-score: 1.96.(6 votes)

- If we assume that the null hypothesis is true, then why do we assume that the sample mean is 1.2 sec? We already know that it's 1.05 sec.(5 votes)
- Because that
the null hypothesis (**is**).**H0**

What we are testing is how likely we are to have seen the data, under the assumption thatis true. Null hypothesis testing follows a somewhat backward seeming logic, but this is apparently pretty standard in mathematics.**H0**

1) We calculate how probable it is that we would have seen the observed data ifis true.**H0**

2) We then either reject(or fail to reject it) depending on how often we are willing to wrongly reject**H0**(this is the Type I error rate).**H0**

3) If we rejectthen we**H0***provisionally*conclude that our alternate hypothesis could be true ...(7 votes)

## Video transcript

A neurologist is testing the
effect of a drug on response time by injecting 100 rats with
a unit dose of the drug, subjecting each to neurological
stimulus and recording its response time. The neurologist knows that the
mean response time for rats not injected with the
drug is 1.2 seconds. The mean of the 100 injected
rats response times is 1.05 seconds with the
sample standard deviation of 0.5 seconds. Do you think that the drug has
an affect on response time? So to do this we're going to
set up two hypotheses. We're going to say, one, the
first hypothesis is we're going to call it the null
hypothesis, and that is that the drug has no effect
on response time. And your null hypothesis is
always going to be-- you can view it as a status quo. You assume that whatever your
researching has no effect. So drug has no effect. Or another way to think about
it is that the mean of the rats taking the drug should be
the mean with the drug-- let me write it this way-- with the
mean is still going to be 1.2 seconds even
with the drug. So that's essentially saying it
has no effect, because we know that if you don't give
the drug the mean response time is 1.2 seconds. Now, what you want is an
alternative hypothesis. The hypothesis is no,
I think the drug actually does do something. So the alternative hypothesis,
right over here, that the drug has an effect. Or another way to think about
it is that the mean does not equal 1.2 seconds when
the drug is given. So how do we think about this? How do we know whether we should
accept the alternative hypothesis or whether we should
just default to the null hypothesis because the
data isn't convincing? And the way we're going to do it
in this video, and this is really the way it's done in
pretty much all of science, is you say OK, let's assume that
the null hypothesis is true. If the null hypothesis was true,
what is the probability that we would have gotten these
results with the sample? And if that probability is
really, really small, then the null hypothesis probably
isn't true. We could probably reject the
null hypothesis and we'll say well, we kind of believe in the
alternative hypothesis. So let's think about that. Let's assume that the null
hypothesis is true. So if we assume the null
hypothesis is true, let's try to figure out the probability
that we would have actually gotten this result, that we
would have actually gotten a sample mean of 1.05 seconds with
a standard deviation of 0.5 seconds. So I want to see if we assumed
the null hypothesis is true, I want to figure out the
probability-- and actually what we're going to do is
not just figure out the probability of this, the
probability of getting something like this or even
more extreme than this. So how likely of an
event is that? To think about that let's just
think about the sampling distribution if we assume
the null hypothesis. So the sampling distribution
is like this. It'll be a normal
distribution. We have a good number
of samples, we have 100 samples here. So this is the sampling
distribution. It will have a mean. Now if we assume the null
hypothesis, that the drug has no effect, the mean of our
sampling distribution will be the same thing as the meaning
of the population distribution, which would
be equal to 1.2 seconds. Now, what is the standard
deviation of our sampling distribution? The standard deviation of our
sampling distribution should be equal to the standard
deviation of the population distribution divided by the
square root of our sample size, so divided by the
square root of 100. We do not know what the standard
deviation of the entire population is. So what we're going to do is
estimate it with our sample standard deviation. And it's a reasonable thing to
do, especially because we have a nice sample size. The sample size is
greater than 100. So this is going to be a pretty
good approximator. This is going to be a pretty
good approximator for this over here. So we could say that this is
going to be approximately equal to our sample standard
deviation divided by the square root of 100, which is
going to be equal to our sample standard deviation is
0.5, 0.5 seconds, and we want to divide that by square
root of 100 is 10. So 0.5 divided by 10 is 0.05. So the standard deviation of our
sampling distribution is going to be-- and we'll put a
little hat over it to show that we approximated it with--
we approximated the population standard deviation with the
sample standard deviation. So it is going to be equal
to 0.5 divided by 10. So 0.05. So what is the probability--
so let's think about it this way. What is the probability of
getting 1.05 seconds? Or another way to think about
it is how many standard deviations away from this mean
is 1.05 seconds, and what is the probability of getting a
result at least that many standard deviations away
from the mean. So let's figure out how many
standard deviations away from the mean that is. Now essentially we're just
figuring out a Z-score, a Z-score for this result
right over there. So let me pick a nice color--
I haven't used orange yet. So our Z-score-- you could
even do the Z-statistic. It's being derived from these
other sample statistics. So our Z-statistic, how far
are we away from the mean? Well the mean is 1.2. And we are at 1.05, so I'll
put that less just so that it'll be a positive distance. So that's how far away we are. And if we wanted it in terms
of standard deviations, we want to divide it by our best
estimate of the sampling distribution's standard
deviation, which is this 0.05. So this is 0.05, and what is
this going to be equal to? This result right here,
1.05 seconds. 1.2 minus 1.05 is 0.15. So this is 0.15 in the numerator
divided by 0.05 in the denominator, and so
this is going to be 3. So this result right here
is 3 standard deviations away from the mean. So let me draw this. This is the mean. If I did 1 standard deviation,
2 standard deviations, 3 standard deviations-- that's
in the positive direction. Actually let me draw
it a little bit different than that. This wasn't a nicely drawn
bell curve, but I'll do 1 standard deviation, 2 standard
deviation, and then 3 standard deviations in the positive
direction. And then we have 1 standard
deviation, 2 standard deviations, and 3 standard
deviations in the negative direction. So this result right here, 1.05
seconds that we got for our 100 rat sample is
right over here. 3 standard deviations
below the mean. Now what is the probability
of getting a result this extreme by chance? And when I talk about this
extreme, it could be either a result less than this or a
result of that extreme in the positive direction. More than 3 standard
deviations. So this is essentially, if we
think about the probability of getting a result more extreme
than this result right over here, we're thinking about
this area under the bell curve, both in the negative
direction or in the positive direction. What is the probability
of that? Well we go from the empirical
rule that 99.7% of the probability is within 3
standard deviations. So this thing right here-- you
can look it up on a Z-table as well, but 3 standard deviation
is a nice clean number that doesn't hurt to remember. So we know that this area right
here I'm doing and just reddish-orange, that area
right over is 99.7%. So what is left for these two
magenta or pink areas? Well if these are 99.7% and
both of these combined are going to be 0.3%. So both of these combined are
0.3-- I should write it this way or exactly-- are 0.3%. 0.3%. Or is we wrote it as a decimal
it would be 0.003 of the total area under the curve. So to answer our question, if we
assume that the drug has no effect, the probability of
getting a sample this extreme or actually more extreme
than this is only 0.3% Less than 1 in 300. So if the null hypothesis was
true, there's only a 1 in 300 chance that we would have
gotten a result this extreme or more. So at least from my point of
view this results seems to favor the alternative
hypothesis. I'm going to reject the
null hypothesis. I don't know 100% sure. But if the null hypothesis was
true there's only 1 in 300 chance of getting this. So I'm going to go with the
alternative hypothesis. And just to give you a little
bit of some of the name or the labels you might see in some
statistics or in some research papers, this value, the
probability of getting a result more extreme than this
given the null hypothesis is called a P-value. So the P-value here, and that
really just stands for probability value, the P-value
right over here is 0.003. So there's a very, very small
probability that we could have gotten this result if the null
hypothesis was true, so we will reject it. And in general, most people
have some type of a threshold here. If you have a P-value less than
5%, which means less than 1 in 20 shot, let's say, you
know what, I'm going to reject the null hypothesis. There's less than a 1 in 20
chance of getting that result. Here we got much less
than 1 in 20. So this is a very strong
indicator that the null hypothesis is incorrect,
and the drug definitely has some effect.