6.6 KiB
| title | chunk | source | category | tags | date_saved | instance |
|---|---|---|---|---|---|---|
| Replication crisis | 13/15 | https://en.wikipedia.org/wiki/Replication_crisis | reference | science, encyclopedia | 2026-05-05T03:45:08.741659+00:00 | kb-cron |
In response to concerns in psychology about publication bias and data dredging, more than 140 psychology journals have adopted result-blind peer review. In this approach, studies are accepted not on the basis of their findings and after the studies are completed, but before they are conducted and on the basis of the methodological rigor of their experimental designs, and the theoretical justifications for their statistical analysis techniques before data collection or analysis is done. Early analysis of this procedure has estimated that 61% of result-blind studies have led to null results, in contrast to an estimated 5% to 20% in earlier research. In addition, large-scale collaborations between researchers working in multiple labs in different countries that regularly make their data openly available for different researchers to assess have become much more common in psychology.
==== Pre-registration of studies ====
Scientific publishing has begun using pre-registration reports to address the replication crisis. The registered report format requires authors to submit a description of the study methods and analyses prior to data collection. Once the method and analysis plan is vetted through peer-review, publication of the findings is provisionally guaranteed, based on whether the authors follow the proposed protocol. One goal of registered reports is to circumvent the publication bias toward significant findings that can lead to implementation of questionable research practices. Another is to encourage publication of studies with rigorous methods. The journal Psychological Science has encouraged the preregistration of studies and the reporting of effect sizes and confidence intervals. The editor in chief also noted that the editorial staff will be asking for replication of studies with surprising findings from examinations using small sample sizes before allowing the manuscripts to be published.
==== Metadata and digital tools for tracking replications ==== It has been suggested that "a simple way to check how often studies have been repeated, and whether or not the original findings are confirmed" is needed. Categorizations and ratings of reproducibility at the study or results level, as well as addition of links to and rating of third-party confirmations, could be conducted by the peer-reviewers, the scientific journal, or by readers in combination with novel digital platforms or tools.
=== Statistical reform ===
==== Requiring smaller p-values ==== Many publications require a p-value of p < 0.05 to claim statistical significance. The paper "Redefine statistical significance", signed by a large number of scientists and mathematicians, proposes that in "fields where the threshold for defining statistical significance for new discoveries is p < 0.05, we propose a change to p < 0.005. This simple step would immediately improve the reproducibility of scientific research in many fields." Their rationale is that "a leading cause of non-reproducibility (is that the) statistical standards of evidence for claiming new discoveries in many fields of science are simply too low. Associating 'statistically significant' findings with p < 0.05 results in a high rate of false positives even in the absence of other experimental, procedural and reporting problems." This call was subsequently criticised by another large group, who argued that "redefining" the threshold would not fix current problems, would lead to some new ones, and that in the end, all thresholds needed to be justified case-by-case instead of following general conventions. A 2022 followup study examined these competing recommendations' practical impact. Despite high citation rates of both proposals, researchers found limited implementation of either the p < 0.005 threshold or the case-by-case justification approach in practice. This revealed what the authors called a "vicious cycle", in which scientists reject recommendations because they are not standard practice, while the recommendations fail to become standard practice because few scientists adopt them.
==== Addressing misinterpretation of p-values ====
Although statisticians are unanimous that the use of "p < 0.05" as a standard for significance provides weaker evidence than is generally appreciated, there is a lack of unanimity about what should be done about it. Some have advocated that Bayesian methods should replace p-values. This has not happened on a wide scale, partly because it is complicated and partly because many users distrust the specification of prior distributions in the absence of hard data. A simplified version of the Bayesian argument, based on testing a point null hypothesis was suggested by pharmacologist David Colquhoun. The logical problems of inductive inference were discussed in "The Problem with p-values" (2016). The hazards of reliance on p-values arises partly because even an observation of p = 0.001 is not necessarily strong evidence against the null hypothesis. Despite the fact that the likelihood ratio in favor of the alternative hypothesis over the null is close to 100, if the hypothesis was implausible, with a prior probability of a real effect being 0.1, even the observation of p = 0.001 would have a false positive risk of 8 percent. It would still fail to reach the 5 percent level. It was recommended that the terms "significant" and "non-significant" should not be used. p-values and confidence intervals should still be specified, but they should be accompanied by an indication of the false-positive risk. It was suggested that the best way to do this is to calculate the prior probability that would be necessary to believe in order to achieve a false positive risk of a certain level, such as 5%. The calculations can be done with various computer software. This reverse Bayesian approach, which physicist Robert Matthews suggested in 2001, is one way to avoid the problem that the prior probability is rarely known.
==== Encouraging larger sample sizes ==== To improve the quality of replications, larger sample sizes than those used in the original study are often needed. Larger sample sizes are needed because estimates of effect sizes in published work are often exaggerated due to publication bias and large sampling variability associated with small sample sizes in an original study. Further, using significance thresholds usually leads to inflated effects, because particularly with small sample sizes, only the largest effects will become significant.