📊 epidemiology

Drastic changes in collaboration networks and publication patterns in research using the CDC WONDER dataset

This study reveals a dramatic surge in CDC WONDER dataset publications driven by a network of researchers, primarily from Pakistan, who are likely producing low-quality, template-based papers to meet medical residency demands, highlighting the urgent need for proactive editorial screening and improved critical appraisal skills to safeguard scientific integrity against mass-produced research.

Original authors: Maupin, D., Suchak, T., Sengupta, A., Marra, M., Geifman, N., Spick, M.

Published 2026-01-15

📖 4 min read☕ Coffee break read

CC BY 4.0

Original authors: Maupin, D., Suchak, T., Sengupta, A., Marra, M., Geifman, N., Spick, M.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine the world of scientific research as a giant, public library. For years, researchers have been able to walk in and use the library's massive, free reference books (Open Data) to write their own stories (research papers). This was supposed to be a good thing, helping everyone learn and discover new things.

However, this paper describes how a specific section of that library—the CDC WONDER dataset (a huge collection of US health statistics)—has recently been overrun by a "factory" of low-quality stories.

Here is the breakdown of what the authors found, using simple analogies:

1. The "Fast-Food" Factory

The authors noticed that starting in 2023, the number of papers using this specific dataset exploded. It went from a steady stream of 88 papers a year to a flood of over 1,200 in just a few years.

They believe this isn't just a sudden surge of interest; it looks like a factory assembly line.

The Template: Instead of each writer crafting a unique story, they are using a "cookie-cutter" template. The titles all sound the same (e.g., "Trends and Disparities in [Disease]..."), they use the exact same computer software to crunch the numbers, and they even copy-paste the same "limitations" paragraph at the end of the paper.
The Ingredients: They are taking the same public ingredients (the CDC data) and serving up thousands of nearly identical dishes.

2. The "Ghost" Network

Usually, when scientists work together, they have a natural web of connections. But here, the authors found a strange pattern in who is writing these papers:

The "Super-Group": Many of these papers have huge teams—sometimes 15, 20, or even 31 authors.
The Pattern: A typical paper looks like this: A large group of authors from Pakistan and India team up with just one or two authors from the UK or US.
The Suspicion: The authors suggest this might be a "pay-to-play" or "gift-giving" scheme. It's as if someone is buying a spot on a team to make their name look more impressive, or perhaps junior doctors are being rushed through a "course" where they are forced to churn out papers to get ahead in their careers. The connections between these groups look artificial, like a network of people who only ever meet once to sign a paper and then never work together again.

3. The "Magic Wand" (AI)

The paper suggests that Generative AI is the "magic wand" making this possible. Just as a spell could instantly write a book, AI tools are likely helping these "factories" analyze the data and write the manuscripts incredibly fast. This allows them to mass-produce research that looks professional on the surface but lacks real depth or new discovery.

4. Why This is a Problem

The authors compare this to flooding a river.

Drowning out the good: When the river is filled with thousands of low-quality, repetitive papers, it becomes impossible to find the few, truly important discoveries.
Trusting the water: If people realize the water (the data) is being used to make fake or low-quality products, they might stop trusting the library entirely.
The Peer-Review Bottleneck: Imagine a gatekeeper at the library trying to check every single entry. With this flood of "fast-churn" papers, the gatekeepers (journal editors and reviewers) are overwhelmed and might accidentally let the bad stuff through.

5. The Solution Proposed

The authors aren't saying we should close the library. Instead, they suggest:

Better Gatekeeping: Editors need to learn to spot these "assembly line" papers quickly and reject them before they are published.
Education: Researchers need to be taught how to spot bad science and understand that just because data is free, it doesn't mean you should use it to churn out low-quality work just to get a publication.

In short: The paper argues that a specific group of researchers is using AI and a "factory" approach to mass-produce fake-looking science using public US health data, often involving strange international team-ups, which threatens to ruin the quality and trustworthiness of medical research.

Technical Summary: Drastic Changes in Collaboration Networks and Publication Patterns in Research Using the CDC WONDER Dataset

Problem Statement
The paper addresses the exploitation of Open Science datasets by "paper mills" and unethical actors, a phenomenon accelerated by generative AI. While previous studies have identified the systematic misuse of datasets like NHANES (often linked to Chinese-affiliated networks), this work investigates a new, distinct pattern of exploitation targeting the CDC WONDER (Centers for Disease Control and Prevention Wide-ranging Online Data for Epidemiologic Research) dataset. The authors posit that a network of researchers, primarily reporting affiliations in Pakistan and India with Western collaborators, is engaging in "fast-churn" science. This involves mass-producing formulaic manuscripts using publicly available data to meet demands from junior clinicians or trainees seeking publication volume for residency applications or visa purposes, thereby threatening the integrity of the scientific literature.

Methodology
The study employed a quantitative bibliometric and network analysis approach using the OpenAlex database:

Data Collection: A search for "CDC WONDER" in titles and abstracts was conducted for the period January 1, 1996, to December 22, 2025.
Temporal Analysis: The dataset was split into a control group (pre-2023, pre-public release of GenAI) and a case group (2023–2025).
Trend Forecasting: An AutoRegressive Integrated Moving Average (ARIMA) model (parameters $p=1, d=1, q=1$ ) was used to forecast expected publication trends based on pre-2023 data. Deviations between the forecast and actual observed counts were calculated to identify excess production.
Text Analysis: Token usage in titles was analyzed using CountVectorizer (scikit-learn) to detect formulaic language patterns (e.g., specific keywords like "trends" and "disparities").
Network Analysis: VOSviewer was utilized to visualize collaboration networks across countries and institutions. Author data was anonymized, while country and institution data were analyzed to identify clustering patterns.
Qualitative Audit: A random sample of 40 manuscripts was audited to assess methodological similarities, specifically looking for identical statistical methods, software usage, and structural templates.

Key Results

Publication Explosion: The number of publications using CDC WONDER surged from 88 in 2021 to 1,223 in 2025. The ARIMA analysis confirmed that the post-2023 publication volume falls significantly outside the 95% confidence interval of the expected trend.
Geographic Shift: There was a dramatic shift in authorship demographics. The proportion of papers with at least one author from Pakistan rose from 0.5% in 2021 to 27.2% in 2025. Similarly, authors from India increased from 0.1% to 2.4%. In contrast, the proportion of US authors dropped from 60.5% to 30.8%.
Collaboration Networks: Post-2023, 52.7% of US authors co-authored with a Pakistani author, a stark increase from 0.7% pre-2023. Network analysis revealed extensive clusters linking institutions in Pakistan (notably Dow University of Health Sciences) with Western institutions in the UK and US.
Author Inflation: The average number of authors per paper increased from a stable two to eight by 2025, with some papers listing up to 31 authors.
Formulaic Production: The audit of 40 papers revealed identical methodological approaches, specifically the use of Joinpoint regression via NCI software, identical data segmentation, and nearly identical "limitations" sections despite varying author lists. Title analysis showed a spike in formulaic terms like "Trends" and "Disparities."

Key Contributions

Identification of a New Exploitation Vector: The paper documents the first major case study of CDC WONDER being systematically targeted by a specific network distinct from the NHANES exploitation patterns.
Characterization of a New "Paper Mill" Profile: It identifies a specific operational model involving a "Global South" majority (Pakistan/India) collaborating with a minority of Western authors (UK/US), potentially driven by medical residency or visa-related publication demands.
Methodological Framework for Detection: The study demonstrates how combining ARIMA forecasting, token analysis, and author network visualization can effectively flag "fast-churn" science before individual paper audits are possible.

Significance and Claims
The authors claim that this work highlights a critical tension between the democratizing goals of Open Science and the need to safeguard literature integrity. They argue that the CDC WONDER dataset has become a victim of "template-driven, redundant publication" similar to NHANES but with unique geographic and network characteristics.

The paper asserts that identifying these patterns is essential to protect the scientific record from being flooded with low-quality research. It calls for:

Proactive Desk Rejection: Editors and reviewers should be more vigilant regarding formulaic submissions and unusual collaboration networks.
Education: Training is needed to help researchers critically appraise such outputs and understand appropriate use cases for Open Science resources.
Awareness: The community must recognize that the "arms race" between open data availability and unethical mass-production is evolving, requiring continuous monitoring.

The authors maintain a modest tone, acknowledging that their analysis is based on trends rather than a full systematic review of every paper, and that some collaborations may be genuine. However, they emphasize that the statistical anomalies and structural similarities strongly suggest systematic, unethical exploitation of the dataset.