Widespread use of invalid statistical tests in… — Plain-Language Explanation

Original authors: Zeng, T., Li, H., Zhang, S., Tan, Y. Q., Tian, F., Orban, C., An, L., Che, W., Cheng, J., Chong, J. S. X., Dehestani, N., Dong, Z., Li, X., Li, Z., Lim, M. J. R., Lin, Y., Ling, Q., Ling, Z., Low, X.

Published 2026-05-22

📖 4 min read☕ Coffee break read

View on bioRxiv ↗PDF ↗

CC BY 4.0

Original authors: Zeng, T., Li, H., Zhang, S., Tan, Y. Q., Tian, F., Orban, C., An, L., Che, W., Cheng, J., Chong, J. S. X., Dehestani, N., Dong, Z., Li, X., Li, Z., Lim, M. J. R., Lin, Y., Ling, Q., Ling, Z., Low, X. Z., Mansour L., S., Ng, K. K., Nguyen, T. T., Ooi, L. Q. R., Pande, S., Qian, X., Ruan, J., Wang, Z., Xie, Y., Zhang, C., Zhang, Y., Patil, K., Parkes, L., Dhamala, E., Chopra, S., Zalesky, A., Holmes, A., Eickhoff, S., Zhou, J. H., Renaud, O., Dosenbach, N., Kording, K. P., Bzdok, D., Nichols, T., Yeo, B. T. T.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a judge trying to decide which of two new recipes makes the best cake. To be fair, you don't just bake one cake with each recipe and taste them once. Instead, you bake ten cakes with Recipe A and ten with Recipe B, then ask ten different friends to taste them.

The Problem: The "Group Hug" Mistake

In the world of biomedical machine learning (using computers to find patterns in medical data), scientists do something similar called "cross-validation." They split their data into ten chunks, train their computer models on nine chunks, and test them on the tenth, repeating this ten times.

The paper argues that most scientists make a critical error here. When they compare the results of these ten tests, they use standard math tools (like a paired t-test) that assume every test result is completely independent—like asking ten strangers who have never met to taste the cakes.

But in reality, these ten tests are not independent. They are all looking at the same underlying data, just sliced up differently. It's more like asking the same ten friends to taste the cakes ten times in a row. Because the friends know each other and have similar tastes, their opinions are "correlated."

The paper claims that by ignoring this connection, scientists are using a ruler that is slightly bent. They think they are being very precise, but they are actually seeing "statistical ghosts." They are finding differences between models that aren't really there, leading to a massive number of false alarms (false positives).

The Investigation: A Global Audit

The authors didn't just guess; they went on a detective hunt. They reviewed 210 high-profile studies from top medical journals (with high "impact factors," meaning they are very famous and influential).

The Finding: A staggering 97% of these studies made the "Group Hug" mistake. They treated their dependent test results as if they were independent.
The Scope: This wasn't a problem for just a few "bad" studies. It happened regardless of how famous the journal was, how strict the rules were, or whether the scientists shared their data openly. It is a widespread habit across the entire field.

The Simulation: How Bad Is It?

To prove how dangerous this is, the authors ran 420 different computer simulations. They found that when you ignore the fact that your test results are linked:

Your "false alarm" rate skyrockets.
If you repeat the test many times (a common practice called "repeated cross-validation"), the chance of getting a false alarm can rise to nearly 100%. It's like flipping a coin and being told you've won the lottery every single time, even though you haven't.

The Solution: The "SHARP" Test

The paper explains that fixing this is hard because, with standard methods, you can't tell if the results are similar because the models are actually good, or just because the data chunks are too similar to each other. It's like trying to figure out if a group of friends agrees because they are smart, or just because they are all copying each other.

To solve this, the authors propose a new method called SHARP (Split-HAlf RePeated).

How it works: Imagine instead of asking your ten friends to taste the cakes ten times, you split them into two separate groups. Group 1 tastes the cakes in the first half of the experiment, and Group 2 tastes them in the second half. Because these groups are distinct and separated, you can finally measure how much they agree on their own, without the "echo chamber" effect.
The Result: When the authors tested SHARP against 12 other methods, it was the clear winner. It was the only one that kept false alarms low while still being able to detect real differences between models.

The Conclusion

The paper ends by saying that the current way of comparing medical AI models is broken. It's like using a broken scale to weigh ingredients for a life-saving medicine. The authors are providing a new, simple rulebook (best practices) to help scientists fix their math, ensuring that when they claim one model is better than another, they are actually telling the truth.

Technical Summary: Widespread Use of Invalid Statistical Tests in Biomedical Machine Learning

The Problem
Machine learning has become a cornerstone of biomedical research, frequently employed to benchmark algorithms and identify scientific insights, such as ranking biomarkers. The standard approach for evaluating predictive performance is cross-validation (CV). However, a critical statistical flaw undermines this practice: prediction performance estimates derived from different CV folds are not independent. Standard statistical tests used to compare these performances, such as the paired t-test, rely on the assumption of independence. When this assumption is violated, the tests fail to control false positive rates, leading to invalid inferences. Despite the ubiquity of this issue, it remains largely unaddressed in high-impact scientific literature.

Methodology
The authors employed a multi-faceted methodological approach to diagnose and resolve this issue:

Meta-Analysis: A PRISMA-guided review of 210 studies published in journals with an impact factor $\ge$ 15 between June 1, 2020, and June 1, 2025. The review specifically examined how studies compared prediction performance and whether they accounted for fold dependence.
Simulation Studies: Extensive simulations were conducted across 420 scenarios utilizing four diverse datasets. These simulations tested the performance of standard tests under varying conditions, including the use of repeated cross-validation.
Statistical Analysis: The authors analyzed the theoretical limitations of existing "fold-dependence-aware" tests, noting that under standard cross-validation, the variance of fold-level statistics and the between-fold correlation cannot be disentangled, forcing existing methods to rely on strong, often unverified assumptions.
Proposal of a New Test: To overcome these limitations, the authors proposed the SHARP (Split-HAlf RePeated) test. This method modifies the standard cross-validation procedure to enable the direct estimation of both variance and correlation, thereby satisfying the requirements for valid statistical inference without relying on untestable assumptions.

Key Results

Prevalence of the Error: The meta-analysis revealed that 97% of the reviewed studies ignored fold dependence when comparing prediction performance. This oversight was found to be ubiquitous across scientific fields and was not mitigated by high impact factors, rigor-promoting policies, or open science practices.
Impact on False Positives: Simulations demonstrated that ignoring fold dependence leads to invalid false positive control in the majority of settings. The problem is exacerbated by repeated cross-validation; as the number of repetitions increases, false positive rates can rise toward 100%.
Performance of SHARP: When benchmarked against 12 other statistical tests, the SHARP test demonstrated the best overall balance across three critical metrics: false-positive control, statistical power, and confidence-interval calibration. It achieved this performance consistently across the various simulation schemes.

Significance and Claims
The paper asserts that the current reliance on standard tests for comparing machine learning models in biomedical research is fundamentally flawed, leading to a high risk of spurious scientific conclusions. By identifying that this issue persists despite high editorial standards, the authors highlight a systemic gap in statistical rigor.

The primary contribution of this work is the introduction of the SHARP test, which offers a practical solution to the problem of fold dependence by enabling direct estimation of necessary statistical parameters. Furthermore, the authors conclude by providing specific best practices and reporting guidelines intended to restore validity to model comparison inference in biomedical machine learning and related fields. The work serves as a call to action for the community to adopt these corrected methodologies to ensure that scientific applications, such as biomarker ranking, are based on statistically sound evidence.

Widespread use of invalid statistical tests in biomedical machine learning

Technical Summary: Widespread Use of Invalid Statistical Tests in Biomedical Machine Learning

More like this