kb/data/en.wikipedia.org/wiki/Replication_crisis-7.md

6.1 KiB
Raw Blame History

title chunk source category tags date_saved instance
Replication crisis 8/15 https://en.wikipedia.org/wiki/Replication_crisis reference science, encyclopedia 2026-05-05T03:45:08.741659+00:00 kb-cron

=== Statistical issues ===

==== Low statistical power ==== Low statistical power hinders replication for three reasons: (1) low-power replications have reduced ability to detect true effects, (2) low-power original studies produce biased effect size estimates leading to undersized replications, and (3) low-power original studies yield results unlikely to reflect true effects. Mathematically, the probability of replicating a previous publication that rejected a null hypothesis

      H
      
        0
      
    
  

{\displaystyle H_{0}}

in favor of an alternative

      H
      
        1
      
    
  

{\displaystyle H_{1}}

is

    (
    
      significance
    
    )
    P
    r
    (
    
      H
      
        0
      
    
    
      |
    
    
      publication
    
    )
    +
    (
    
      power
    
    )
    P
    r
    (
    
      H
      
        1
      
    
    
      |
    
    
      publication
    
    )
    ≤
    (
    
      power
    
    )
  

{\displaystyle ({\text{significance}})Pr(H_{0}|{\text{publication}})+({\text{power}})Pr(H_{1}|{\text{publication}})\leq ({\text{power}})}

assuming significance is less than power. Thus, low power implies low probability of replication, regardless of how the previous publication was designed, and regardless of which hypothesis is really true. Stanley and colleagues estimated the average statistical power of psychological literature by analyzing data from 200 meta-analyses. They found that on average, psychology studies have between 33.1% and 36.4% statistical power. These values are quite low compared to the 80% considered adequate statistical power for an experiment. Across the 200 meta-analyses, the median of studies with adequate statistical power was between 7.7% and 9.1%, implying that a positive result would replicate with probability less than 10%, regardless of whether the positive result was a true positive or a false positive. The statistical power of neuroscience studies is quite low. The estimated statistical power of fMRI research is between .08 and .31, and that of studies of event-related potentials was estimated as .72.98 for large effect sizes, .35.73 for medium effects, and .10.18 for small effects. In a study published in Nature, psychologist Katherine Button and colleagues conducted a similar study with 49 meta-analyses in neuroscience, estimating a median statistical power of 21%. Meta-scientist John Ioannidis and colleagues computed an estimate of average power for empirical economic research, finding a median power of 18% based on literature drawing upon 6.700 studies. In light of these results, it is plausible that a major reason for widespread failures to replicate in several scientific fields might be very low statistical power on average. The same statistical test with the same significance level will have lower statistical power if the effect size is small under the alternative hypothesis. Complex inheritable traits are typically correlated with a large number of genes, each of small effect size, so high power requires a large sample size. In particular, many results from the candidate gene literature suffered from small effect sizes and small sample sizes and would not replicate. More data from genome-wide association studies (GWAS) come close to solving this problem. As a numeric example, most genes associated with schizophrenia risk have low effect size (genotypic relative risk, GRR). A statistical study with 1000 cases and 1000 controls has 0.03% power for a gene with GRR = 1.15, which is already large for schizophrenia. In contrast, the largest GWAS to date has ~100% power for it.

==== Positive effect size bias ====

Even when the study replicates, the replication typically has a smaller effect size. Underpowered studies have a large effect size bias.

In studies that statistically estimate a regression factor, such as the

    k
  

{\displaystyle k}

in

    Y
    =
    k
    X
    +
    b
  

{\displaystyle Y=kX+b}

, when the dataset is large, noise tends to cause the regression factor to be underestimated, but when the dataset is small, noise tends to cause the regression factor to be overestimated.

==== Problems of meta-analysis ==== Meta-analyses have their own methodological problems and disputes, which leads to rejection of the meta-analytic method by researchers whose theory is challenged by meta-analysis. Rosenthal proposed the "fail-safe number" (FSN) to avoid the publication bias against null results. It is defined as follows: Suppose the null hypothesis is true; how many publications would be required to make the current result indistinguishable from the null hypothesis? Rosenthal's point is that certain effect sizes are large enough, such that even if there is a total publication bias against null results (the "file drawer problem"), the number of unpublished null results would be impossibly large to swamp out the effect size. Thus, the effect size must be statistically significant even after accounting for unpublished null results. One objection to the FSN is that it is calculated as if unpublished results are unbiased samples from the null hypothesis. But if the file drawer problem is true, then unpublished results would have effect sizes concentrated around 0. Thus fewer unpublished null results would be necessary to swap out the effect size, and so the FSN is an overestimate. Another problem with meta-analysis is that bad studies are "infectious" in the sense that one bad study might cause the entire meta-analysis to overestimate statistical significance.

==== P-hacking ====