kb/data/en.wikipedia.org/wiki/Replication_crisis-6.md

7.3 KiB

title chunk source category tags date_saved instance
Replication crisis 7/15 https://en.wikipedia.org/wiki/Replication_crisis reference science, encyclopedia 2026-05-05T03:45:08.741659+00:00 kb-cron

==== In medicine ==== Irreproducible medical studies commonly share these characteristics: investigators not being blinded to the experimental versus the control arms; failure to repeat experiments; lack of positive and negative controls; failing to report all the data; inappropriate use of statistical tests; and use of reagents that were not appropriately validated.

==== In AI research ==== In machine learning research, a range of questionable practices have emerged due to intense pressure to achieve state-of-the-art benchmark results. Common evaluation questionable practices include "benchmark overfitting" by repeatedly tuning hyperparameters on held-out test sets, selectively reporting the best of multiple random seeds or experimental runs, and "metric hacking" through unreported post hoc decisions such as choice of tokenization or evaluation scripts that inflate scores. Researchers also frequently cherry-pick tasks or datasets where their proposed model outperforms baselines, omit negative or lower-performing results, and obscure compute budgets to present energy-intensive methods as efficient. Such practices can give a misleading impression of model progress and hinder fair comparisons across methods. Moreover, many machine learning studies suffer from irreproducible research practices that impede replication and auditing. Key issues include incomplete disclosure of training details—such as dataset preprocessing, exact model hyperparameters, random seeds, and hardware configurations—and failure to release code or data used in experiments. Researchers often omit versions of dependencies or rely on proprietary datasets, preventing others from reproducing results. Even for public benchmarks, subtle "data leakage" through inadvertent inclusion of test data in training splits remains widespread. Together, these questionable practices undermine the validity and credibility of reported findings in machine learning and pose substantial barriers to cumulative scientific progress.

In October 2024, Communications of the ACM published a peer-reviewed critique of a 2021 Nature paper by researchers from Google. The critique described "a smorgasbord of questionable practices in ML, including irreproducible research practices, multiple variants of cherry-picking, misreporting, and likely data contamination (leakage)" with a reference to Leech et al. The criticized paper claimed progress in fast chip design using reinforcement learning, but supported the claim with experiments on proprietary training and test datasets whose statistical details were undisclosed, preventing independent reproduction outside the company. Specific execution times for both the proposed approach and prior techniques on individual test cases were not disclosed in the paper, but later studies found that Google's method ran orders of magnitude more slowly than the Cadence Design Systems tools available at the same time. The critique cross-checked such studies and highlighted such questionable practices in the Nature paper as (i) selective reporting of results on only a subset of benchmarks, (ii) unfavorable comparisons to weaker baseline methods, (iii) use of inconsistent evaluation metrics, and (iv) undisclosed use of commercial software data that was only admitted years after the original work and multiple rounds of criticism. Additionally, the research may have suffered from data leakage, where separation of training and testing data could not be verified using published data, considered a significant flaw in experimental design. After it rebranded as AlphaChip in 2024, multiple researchers voiced skepticism about the Google approach, as well as the data and source code released. The questionable research practices and independent researchers' inability to reproduce AlphaChip performance underscored the need for transparent, reproducible benchmarks and rigorous evaluation standards to ensure genuine progress in the field of electronic design automation.

==== Prevalence ==== According to IU professor Ernest O'Boyle and psychologist Martin Götz, around 50% of researchers surveyed across various studies admitted engaging in HARKing. In a survey of 2,000 psychologists by behavioral scientist Leslie K. John and colleagues, around 94% of psychologists admitted having employed at least one questionable research practice. More specifically, 63% admitted failing to report all of a study's dependent measures, 28% to report all of a study's conditions, and 46% to selectively reporting studies that produced the desired pattern of results. In addition, 56% admitted having collected more data after having inspected already collected data, and 16% to having stopped data collection because the desired result was already visible. According to biotechnology researcher J. Leslie Glick's estimate in 1992, 10% to 20% of research and development studies involved either questionable research practices or outright fraud. The methodology used to estimate questionable research practices has been contested, and more recent studies suggested lower prevalence rates on average.

=== Fraud === Questionable research practices are considered a separate category from more explicit violations of scientific integrity, such as data falsification. A 2009 meta-analysis found that 2% of scientists across fields admitted falsifying studies at least once and 14% admitted knowing someone who did. Such misconduct was, according to one study, reported more frequently by medical researchers than by others. Prominent examples (see also List of scientific misconduct incidents) include scientific fraud by social psychologist Diederik Stapel, cognitive psychologist Marc Hauser, and social psychologist Lawrence Sanna. In 2018, some researchers viewed scientific fraud as uncommon. A 2025 Northwestern University study found that "the publication of fraudulent science is outpacing the growth rate of legitimate scientific publications". The study also discovered broad networks of organized scientific fraudsters. In March 2024, Harvard Business School's investigative committee, after reviewing a nearly 1,300-page report unsealed during Professor Francesca Gino's $25 million lawsuit against Harvard and the Data Colada bloggers, found that Gino "committed research misconduct intentionally, knowingly, or recklessly" by falsifying data in four published studies. The report documented that Gino had altered participant responses—including changing 104 moral-impurity ratings (flipping low values to high in one experimental condition and vice versa in another) and manipulating four networking-intentions items, for a total of 168 modified observations—to make the data conform to her hypotheses. She also engaged in selective reporting by publishing only the "Posted" dataset while omitting the original Qualtrics archives and misrepresented the provenance of her data by attributing discrepancies to alleged third-party tampering rather than to deliberate changes by her research team. In June 2023 Gino was placed on unpaid administrative leave, and in May 2025 Harvard revoked her tenure—the first such action in roughly 80 years—citing egregious violations of academic integrity.