kb/data/en.wikipedia.org/wiki/Replication_crisis-0.md

6.8 KiB
Raw Blame History

title chunk source category tags date_saved instance
Replication crisis 1/15 https://en.wikipedia.org/wiki/Replication_crisis reference science, encyclopedia 2026-05-05T03:45:08.741659+00:00 kb-cron

The replication crisis, also known as the reproducibility or replicability crisis, refers to widespread failures to reproduce published scientific results. Because the reproducibility of empirical results is the cornerstone of the scientific method, such failures undermine the credibility of theories and challenge substantial parts of scientific knowledge. Psychology and medicine have been focal points for replication efforts, with researchers systematically reexamining classic studies to verify their reliability and, when failures emerge, to identify the underlying causes. Data strongly indicates that other natural and social sciences are also affected. The phrase "replication crisis" was coined in the early 2010s as part of a growing awareness of the problem. Considerations of causes and remedies have given rise to a new scientific discipline known as metascience, which uses methods of empirical research to examine empirical research practice. Researchers distinguish two forms of reproducibility. Reproducibility in a narrow sense refers to reexamining and validating the analysis of a set of data. The second category, replication, involves repeating an experiment or study with new, independent data to verify the original conclusions.

== Background ==

=== Replication === Replication has been called "the cornerstone of science". Environmental health scientist Stefan Schmidt began a 2009 review with this description of replication:

Replication is one of the central issues in any empirical science. To confirm results or hypotheses by a repetition procedure is at the basis of any scientific conception. A replication experiment to demonstrate that the same findings can be obtained in any other place by any other researcher is conceived as an operationalization of objectivity. It is the proof that the experiment reflects knowledge that can be separated from the specific circumstances (such as time, place, or persons) under which it was gained. But no universal definition of replication or related concepts has been agreed on. Replication types include:

direct (repeating procedures as closely as possible), systematic (repeating with intentional changes), and conceptual (testing hypotheses using different procedures to assess generalizability). Reproducibility can also be distinguished from replication, as referring to reproducing the same results using the same data set. Reproducibility of this type is why many researchers make their data available to others for testing. Replication failures do not indicate that affected fields lack scientific rigor. Rather, they reflect the normal operation of science—a mechanism by which unsupported hypotheses are eliminated, but which often functions slowly and inconsistently. A hypothesis is generally considered supported when the results match the predicted pattern and that pattern is found to be statistically significant. Under null hypothesis assumption, results are deemed statistically significant when their probability falls below a predetermined threshold (the significance level). This generally answers the question of how unlikely such results would be by chance alone if no true effect existed in the statistical population. If the probability associated with the test statistic exceeds the chosen critical value, the results are considered statistically significant. The p-value represents the probability of obtaining results at least as extreme as observed, assuming the null hypothesis is true. The standard threshold p < 0.05 means accepting a 5% false positive rate. Some fields use smaller p-values, such as p < 0.01 (1% chance of a false positive) or p < 0.001 (0.1% chance of a false positive). But a smaller chance of a false positive often requires greater sample sizes or a greater chance of a false negative (a correct hypothesis being erroneously found incorrect). Although p-value testing is the most commonly used method, it is not the only one.

=== Statistics === Certain terms commonly used in discussions of the replication crisis have technically precise meanings, which are presented here. In the most common case, null hypothesis testing, there are two hypotheses, a null hypothesis

      H
      
        0
      
    
  

{\displaystyle H_{0}}

and an alternative hypothesis

      H
      
        1
      
    
  

{\displaystyle H_{1}}

. The null hypothesis is typically of the form "X and Y are statistically independent". For example, the null hypothesis might be "taking drug X does not change 1-year recovery rate from disease Y", and the alternative hypothesis is that it does change. As testing for full statistical independence is difficult, the full null hypothesis is often reduced to a simplified null hypothesis "the effect size is 0", where "effect size" is a real number that is 0 if the full null hypothesis is true, and the larger the effect size is, the more the null hypothesis is false. For example, if X is binary, then the effect size might be defined as the change in the expectation of Y upon a change of X:

    (
    
      effect size
    
    )
    =
    
      E
    
    [
    Y
    
      |
    
    X
    =
    1
    ]
    
    
      E
    
    [
    Y
    
      |
    
    X
    =
    0
    ]
  

{\displaystyle ({\text{effect size}})=\mathbb {E} [Y|X=1]-\mathbb {E} [Y|X=0]}

Note that the effect size as defined above might be zero even if X and Y are not independent, such as when their relationship is non-linear (such as

    Y
    
    
      
        N
      
    
    (
    0
    ,
    1
    +
    X
    )
  

{\displaystyle Y\sim {\mathcal {N}}(0,1+X)}

) or when one variable affects different subgroups oppositely. Since different definitions of "effect size" capture different ways for X and Y to be dependent, there are many definitions of effect size. In practice, effect sizes cannot be directly observed, but must be measured by statistical estimators. For example, the above definition of effect size is often measured by Cohen's d estimator. The same effect size might have multiple estimators, as they have tradeoffs between efficiency, bias, variance, etc. This further increases the number of possible statistical quantities that can be computed on a single dataset. When an estimator for an effect size is used for statistical testing, it is called a test statistic.