627 lines
8.2 KiB
Markdown
627 lines
8.2 KiB
Markdown
---
|
||
title: "Replication crisis"
|
||
chunk: 2/15
|
||
source: "https://en.wikipedia.org/wiki/Replication_crisis"
|
||
category: "reference"
|
||
tags: "science, encyclopedia"
|
||
date_saved: "2026-05-05T03:17:03.520061+00:00"
|
||
instance: "kb-cron"
|
||
---
|
||
|
||
A null hypothesis test is a decision procedure which takes in some data, and outputs either
|
||
|
||
|
||
|
||
|
||
H
|
||
|
||
0
|
||
|
||
|
||
|
||
|
||
{\displaystyle H_{0}}
|
||
|
||
or
|
||
|
||
|
||
|
||
|
||
H
|
||
|
||
1
|
||
|
||
|
||
|
||
|
||
{\displaystyle H_{1}}
|
||
|
||
. If it outputs
|
||
|
||
|
||
|
||
|
||
H
|
||
|
||
1
|
||
|
||
|
||
|
||
|
||
{\displaystyle H_{1}}
|
||
|
||
, it is usually stated as "there is a statistically significant effect" or "the null hypothesis is rejected".
|
||
Often, the statistical test is a (one-sided) threshold test, which is structured as follows:
|
||
|
||
Gather data
|
||
|
||
|
||
|
||
D
|
||
|
||
|
||
{\displaystyle D}
|
||
|
||
.
|
||
Compute a test statistic
|
||
|
||
|
||
|
||
t
|
||
[
|
||
D
|
||
]
|
||
|
||
|
||
{\displaystyle t[D]}
|
||
|
||
for the data.
|
||
Compare the test statistic against a critical value/threshold
|
||
|
||
|
||
|
||
|
||
t
|
||
|
||
threshold
|
||
|
||
|
||
|
||
|
||
{\displaystyle t_{\text{threshold}}}
|
||
|
||
. If
|
||
|
||
|
||
|
||
t
|
||
[
|
||
D
|
||
]
|
||
>
|
||
|
||
t
|
||
|
||
threshold
|
||
|
||
|
||
|
||
|
||
{\displaystyle t[D]>t_{\text{threshold}}}
|
||
|
||
, then output
|
||
|
||
|
||
|
||
|
||
H
|
||
|
||
1
|
||
|
||
|
||
|
||
|
||
{\displaystyle H_{1}}
|
||
|
||
, else, output
|
||
|
||
|
||
|
||
|
||
H
|
||
|
||
0
|
||
|
||
|
||
|
||
|
||
{\displaystyle H_{0}}
|
||
|
||
.
|
||
A two-sided threshold test is similar, but with two thresholds, such that it outputs
|
||
|
||
|
||
|
||
|
||
H
|
||
|
||
1
|
||
|
||
|
||
|
||
|
||
{\displaystyle H_{1}}
|
||
|
||
if either
|
||
|
||
|
||
|
||
t
|
||
[
|
||
D
|
||
]
|
||
<
|
||
|
||
t
|
||
|
||
threshold
|
||
|
||
|
||
−
|
||
|
||
|
||
|
||
|
||
{\displaystyle t[D]<t_{\text{threshold}}^{-}}
|
||
|
||
or
|
||
|
||
|
||
|
||
t
|
||
[
|
||
D
|
||
]
|
||
>
|
||
|
||
t
|
||
|
||
threshold
|
||
|
||
|
||
+
|
||
|
||
|
||
|
||
|
||
{\displaystyle t[D]>t_{\text{threshold}}^{+}}
|
||
|
||
|
||
There are 4 possible outcomes of a null hypothesis test: false negative, true negative, false positive, true positive. A false negative means that
|
||
|
||
|
||
|
||
|
||
H
|
||
|
||
0
|
||
|
||
|
||
|
||
|
||
{\displaystyle H_{0}}
|
||
|
||
is true, but the test outcome is
|
||
|
||
|
||
|
||
|
||
H
|
||
|
||
1
|
||
|
||
|
||
|
||
|
||
{\displaystyle H_{1}}
|
||
|
||
; a true negative means that
|
||
|
||
|
||
|
||
|
||
H
|
||
|
||
0
|
||
|
||
|
||
|
||
|
||
{\displaystyle H_{0}}
|
||
|
||
is true, and the test outcome is
|
||
|
||
|
||
|
||
|
||
H
|
||
|
||
0
|
||
|
||
|
||
|
||
|
||
{\displaystyle H_{0}}
|
||
|
||
, etc.
|
||
|
||
Significance level, false positive rate, or the alpha level, is the probability of finding the alternative to be true when the null hypothesis is true:
|
||
|
||
|
||
|
||
(
|
||
|
||
significance
|
||
|
||
)
|
||
:=
|
||
α
|
||
:=
|
||
P
|
||
r
|
||
(
|
||
|
||
find
|
||
|
||
|
||
H
|
||
|
||
1
|
||
|
||
|
||
|
||
|
|
||
|
||
|
||
H
|
||
|
||
0
|
||
|
||
|
||
)
|
||
|
||
|
||
{\displaystyle ({\text{significance}}):=\alpha :=Pr({\text{find }}H_{1}|H_{0})}
|
||
|
||
For example, when the test is a one-sided threshold test, then
|
||
|
||
|
||
|
||
α
|
||
=
|
||
P
|
||
|
||
r
|
||
|
||
D
|
||
∼
|
||
|
||
H
|
||
|
||
0
|
||
|
||
|
||
|
||
|
||
(
|
||
t
|
||
[
|
||
D
|
||
]
|
||
>
|
||
|
||
t
|
||
|
||
threshold
|
||
|
||
|
||
)
|
||
|
||
|
||
{\displaystyle \alpha =Pr_{D\sim H_{0}}(t[D]>t_{\text{threshold}})}
|
||
|
||
where
|
||
|
||
|
||
|
||
D
|
||
∼
|
||
|
||
H
|
||
|
||
0
|
||
|
||
|
||
|
||
|
||
{\displaystyle D\sim H_{0}}
|
||
|
||
means "the data is sampled from
|
||
|
||
|
||
|
||
|
||
H
|
||
|
||
0
|
||
|
||
|
||
|
||
|
||
{\displaystyle H_{0}}
|
||
|
||
".
|
||
Statistical power, true positive rate, is the probability of finding the alternative to be true when the alternative hypothesis is true:
|
||
|
||
|
||
|
||
(
|
||
|
||
power
|
||
|
||
)
|
||
:=
|
||
1
|
||
−
|
||
β
|
||
:=
|
||
P
|
||
r
|
||
(
|
||
|
||
find
|
||
|
||
|
||
H
|
||
|
||
1
|
||
|
||
|
||
|
||
|
|
||
|
||
|
||
H
|
||
|
||
1
|
||
|
||
|
||
)
|
||
|
||
|
||
{\displaystyle ({\text{power}}):=1-\beta :=Pr({\text{find }}H_{1}|H_{1})}
|
||
|
||
where
|
||
|
||
|
||
|
||
β
|
||
|
||
|
||
{\displaystyle \beta }
|
||
|
||
is also called the false negative rate. For example, when the test is a one-sided threshold test, then
|
||
|
||
|
||
|
||
1
|
||
−
|
||
β
|
||
=
|
||
P
|
||
|
||
r
|
||
|
||
D
|
||
∼
|
||
|
||
H
|
||
|
||
1
|
||
|
||
|
||
|
||
|
||
(
|
||
t
|
||
[
|
||
D
|
||
]
|
||
>
|
||
|
||
t
|
||
|
||
threshold
|
||
|
||
|
||
)
|
||
|
||
|
||
{\displaystyle 1-\beta =Pr_{D\sim H_{1}}(t[D]>t_{\text{threshold}})}
|
||
|
||
.
|
||
Given a statistical test and a data set
|
||
|
||
|
||
|
||
D
|
||
|
||
|
||
{\displaystyle D}
|
||
|
||
, the corresponding p-value is the probability that the test statistic is at least as extreme, conditional on
|
||
|
||
|
||
|
||
|
||
H
|
||
|
||
0
|
||
|
||
|
||
|
||
|
||
{\displaystyle H_{0}}
|
||
|
||
. For example, for a one-sided threshold test,
|
||
|
||
|
||
|
||
p
|
||
[
|
||
D
|
||
]
|
||
=
|
||
P
|
||
|
||
r
|
||
|
||
|
||
D
|
||
′
|
||
|
||
∼
|
||
|
||
H
|
||
|
||
0
|
||
|
||
|
||
|
||
|
||
(
|
||
t
|
||
[
|
||
|
||
D
|
||
′
|
||
|
||
]
|
||
>
|
||
t
|
||
[
|
||
D
|
||
]
|
||
)
|
||
|
||
|
||
{\displaystyle p[D]=Pr_{D'\sim H_{0}}(t[D']>t[D])}
|
||
|
||
If the null hypothesis is true, then the p-value is distributed uniformly on
|
||
|
||
|
||
|
||
[
|
||
0
|
||
,
|
||
1
|
||
]
|
||
|
||
|
||
{\displaystyle [0,1]}
|
||
|
||
. Otherwise, it is typically peaked at
|
||
|
||
|
||
|
||
p
|
||
=
|
||
0.0
|
||
|
||
|
||
{\displaystyle p=0.0}
|
||
|
||
and roughly exponential, though the precise shape of the p-value distribution depends on what the alternative hypothesis is.
|
||
Because the p-values are distributed uniformly on
|
||
|
||
|
||
|
||
[
|
||
0
|
||
,
|
||
1
|
||
]
|
||
|
||
|
||
{\displaystyle [0,1]}
|
||
|
||
under the null hypothesis, researchers can set any significance level
|
||
|
||
|
||
|
||
α
|
||
|
||
|
||
{\displaystyle \alpha }
|
||
|
||
by computing the p-value, then output
|
||
|
||
|
||
|
||
|
||
H
|
||
|
||
1
|
||
|
||
|
||
|
||
|
||
{\displaystyle H_{1}}
|
||
|
||
if
|
||
|
||
|
||
|
||
p
|
||
[
|
||
D
|
||
]
|
||
<
|
||
α
|
||
|
||
|
||
{\displaystyle p[D]<\alpha }
|
||
|
||
. This is usually stated as "the null hypothesis is rejected at significance level
|
||
|
||
|
||
|
||
α
|
||
|
||
|
||
{\displaystyle \alpha }
|
||
|
||
", or "
|
||
|
||
|
||
|
||
|
||
H
|
||
|
||
1
|
||
|
||
|
||
|
||
(
|
||
p
|
||
<
|
||
α
|
||
)
|
||
|
||
|
||
{\displaystyle H_{1}\;(p<\alpha )}
|
||
|
||
", such as "smoking is correlated with cancer (p < 0.001)".
|
||
|
||
=== History ===
|
||
The replication crisis dates to a number of events in the early 2010s. Felipe Romero identified four precursors to the crisis: |