kb/data/en.wikipedia.org/wiki/Replication_crisis-8.md

4.2 KiB
Raw Blame History

title chunk source category tags date_saved instance
Replication crisis 9/15 https://en.wikipedia.org/wiki/Replication_crisis reference science, encyclopedia 2026-05-05T03:17:03.520061+00:00 kb-cron

Various statistical methods can be applied to make the p-value appear smaller than it really is. This need not be malicious, as moderately flexible data analysis, routine in research, can increase the false-positive rate to above 60%. For example, if one collects some data, applies several different significance tests to it, and publishes only the one that happens to have a p-value less than 0.05, then the total p-value for "at least one significance test reaches p < 0.05" can be much larger than 0.05, because even if the null hypothesis were true, the probability that one out of many significance tests is extreme is not itself extreme. Typically, a statistical study has multiple steps, with several choices at each step, such as during data collection, outlier rejection, choice of test statistic, choice of one-tailed or two-tailed test, etc. These choices in the "garden of forking paths" multiply, creating many "researcher degrees of freedom". The effect is similar to the file-drawer problem, as the paths not taken are not published. Consider a simple illustration. Suppose the null hypothesis is true, and we have 20 possible significance tests to apply to the dataset. Also suppose the outcomes to the significance tests are independent. By definition of "significance", each test has probability 0.05 to pass with significance level 0.05. The probability that at least 1 out of 20 is significant is, by assumption of independence,

    1
    
    (
    1
    
    0.05
    
      )
      
        20
      
    
    =
    0.64
  

{\displaystyle 1-(1-0.05)^{20}=0.64}

. Another possibility is the multiple comparisons problem. In 2009, it was twice noted that fMRI studies had a suspicious number of positive results with large effect sizes, more than would be expected since the studies have low power (one example had only 13 subjects). It pointed out that over half of the studies would test for correlation between a phenomenon and individual fMRI voxels, and only report on voxels exceeding chosen thresholds.

Optional stopping is a practice where one collects data until some stopping criterion is reached. Though a valid procedure, it is easily misused. The problem is that p-value of an optionally stopped statistical test is larger than it seems. Intuitively, this is because the p-value is supposed to be the sum of all events at least as rare as what is observed. With optional stopping, there are even rarer events that are difficult to account for, i.e. not triggering the optional stopping rule, and collecting even more data before stopping. Neglecting these events leads to a p-value that is too low. In fact, if the null hypothesis is true, any significance level can be reached if one is allowed to keep collecting data and stop when the desired p-value (calculated as if one has always been planning to collect exactly this much data) is obtained. For a concrete example of testing for a fair coin, see p-value#optional stopping. More succinctly, the proper calculation of p-value requires accounting for counterfactuals, that is, what the experimenter could have done in reaction to data that might have been. Accounting for what might have been is hard even for honest researchers. One benefit of preregistration is to account for all counterfactuals, allowing the p-value to be calculated correctly. The problem of early stopping is not just limited to researcher misconduct. There is often pressure to stop early if the cost of collecting data is high. Some animal ethics boards even mandate early stopping if the study obtains a significant result midway. Such practices are widespread in psychology. In a 2012 survey, 56% of psychologists admitted to early stopping, 46% to only reporting analyses that "worked", and 38% to post hoc exclusion, that is, removing some data after analysis was already performed on the data before reanalyzing the remaining data (often on the premise of "outlier removal").