Evidence of Unreliable Data and Poor Data Provenance in Clinical Prediction Model Research and Clinical Practice

This paper warns that widely used clinical prediction models based on two popular Kaggle datasets lack verifiable data provenance and appear to be fabricated, leading to potentially unreliable research and clinical applications, and calls for mandatory data provenance reporting to safeguard patient care.

Gibson, A. D., White, N. M., Collins, G. S., Barnett, A.

Published 2026-02-26
📖 4 min read☕ Coffee break read
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a chef trying to create a new, life-saving recipe for a soup that cures a specific illness. To do this, you need a cookbook with high-quality, verified ingredients.

This paper is essentially a food safety inspection that discovered a terrifying problem: many chefs are using a "cookbook" found on a popular online cooking forum (Kaggle) that turns out to be filled with fake ingredients.

Here is the breakdown of what the researchers found, using simple analogies:

1. The "Fake Ingredient" Problem

The researchers looked at two very popular datasets (collections of information) on Kaggle, a website where people share data to practice their computer skills. One dataset was about strokes, and the other was about diabetes.

  • The Analogy: Imagine these datasets are like a bag of flour sold at a market. The label says "100% Real Wheat Flour." But when you look closely, the flour is actually just chalk dust and sand mixed together.
  • The Evidence: The researchers found that the data looked "too perfect." Real patient data is messy; people forget to fill out forms, numbers get lost, and there are gaps. These datasets had almost no missing data, and the numbers followed strange, robotic patterns (like a computer program guessing the numbers rather than a doctor recording them).
  • The Source: The people who uploaded these files admitted they couldn't tell anyone where the data came from. One even said, "Don't use this for research!" Yet, people did anyway.

2. The "Fast-Food" Research Factory

The paper describes a trend called "fast-churn research."

  • The Analogy: Imagine a factory that wants to make as many burgers as possible to get rich, so they stop checking if the meat is safe. They just slap a bun on a rock, call it a burger, and sell it.
  • The Reality: Researchers are downloading these fake datasets, running computer programs on them, and publishing papers very quickly to get famous or get grants. They aren't trying to help patients; they are just trying to publish papers.

3. The Dangerous Consequences

This is where the story gets scary. Because these "fake ingredient" papers were published in scientific journals, other people started trusting them.

  • The Domino Effect:
    • 124 Papers: 124 different studies were written using this fake data.
    • 86 Reviews: These bad papers were cited in 86 other review articles, spreading the misinformation like a virus.
    • Real-World Use: Even worse, three of these models are actually being used in hospitals or medical devices to make decisions about real patients.
  • The Risk: If a doctor uses a model built on fake data, they might tell a patient they are healthy when they are sick, or vice versa. It's like a GPS built on a map of a city that doesn't exist; it will confidently drive you off a cliff.

4. The "No Receipt" Policy

The paper points out that the "supermarkets" (Kaggle) and the "restaurants" (Journals) have no rules about checking receipts.

  • The Analogy: If you buy a car, you expect a VIN number and a history report. But in medical research, you can buy a dataset with no history, no owner, and no proof of where it came from.
  • The Failure: The researchers checked the "receipts" (data provenance) for these two datasets using a strict checklist called TRIPOD+AI. Both datasets failed every single item. They had no information on who collected the data, when, where, or why.

5. The Solution: A New "Health Code"

The authors aren't just pointing fingers; they are proposing a new set of rules to fix the kitchen:

  • For the Supermarkets (Kaggle): They must force uploaders to fill out a "Provenance Sheet" (a detailed receipt) that says exactly where the data came from. If you can't prove it's real flour, you can't sell it.
  • For the Restaurants (Journals): They shouldn't accept a paper unless the chef shows the raw ingredients first. If the ingredients look fake, reject the paper immediately.
  • For the Chefs (Researchers): Stop trying to be fast. Check your ingredients. If a dataset looks too good to be true, it probably is.

The Bottom Line

This paper is a wake-up call. It says that we cannot trust our medical predictions if the foundation they are built on is made of sand.

If we want to save lives with AI and data, we need to stop using "fake flour" and start demanding verified, real ingredients. Otherwise, we risk building a hospital on a foundation that will collapse the moment a real patient walks through the door.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →