Falsification Testing of Sepsis Prediction Models: Evaluating Independent Biological Signal After Controlling for Care-Process Intensity

This pre-registered falsification study across four clinical datasets demonstrates that while sepsis prediction models at elite academic centers primarily detect genuine biological signals rather than care-process intensity, they reveal a systematic and consequential divergence between clinical sepsis definitions and administrative coding that undermines the validity of regulatory metrics and AI benchmarks relying on the latter.

Dickens, A. R.

Published 2026-03-18
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to build a weather app that predicts rain. You have two ways to train your app:

  1. The "Real Rain" Method: You teach it to look at clouds, humidity, and barometric pressure (the actual biology of the storm).
  2. The "Umbrella Method": You teach it to look at how many people are running outside with umbrellas (the reaction to the storm).

If your app learns the "Umbrella Method," it will be very accurate at predicting rain after people have already started running for cover. But it won't tell you it's going to rain before the storm hits. Worse, if you use your app to tell people to buy umbrellas, you might just be creating a self-fulfilling prophecy where everyone runs for cover because the app said so, not because it's actually raining.

This paper by Adam Dickens is a "lie detector test" for the computer programs (AI) hospitals use to predict Sepsis (a life-threatening reaction to infection).

The Big Question

For years, researchers have built AI models that claim to predict sepsis with high accuracy. But a nagging doubt remained: Are these models actually detecting the biological signs of a patient getting sick, or are they just learning to spot the pattern of doctors getting nervous and ordering a ton of tests?

If the AI is just spotting "doctor panic" (ordering blood tests, checking vitals every hour), it's not a true early warning system. It's just a mirror reflecting the doctors' own suspicion.

The Experiment: A Four-Part Falsification Test

The author didn't just build a model; he set up a trap to see if the models were cheating. He pre-registered his plan (like writing down the rules of a game before playing) to ensure he couldn't change the goalposts later. He tested four different "traps" across four massive hospital databases.

Here is what he found, using simple analogies:

1. The "Label Mismatch" (The Most Important Finding)

The study compared three different ways hospitals define "Sepsis":

  • The Doctor's Definition (Sepsis-2 & 3): Based on actual symptoms like fever, low blood pressure, and organ failure.
  • The Biller's Definition (CMS SEP-1): Based on the codes doctors write on insurance forms to get paid.

The Analogy: Imagine trying to count "apples."

  • The Doctors count the actual fruit.
  • The Billers count the fruit only if it was put in a specific red box for shipping.

The Result: The study found that the "Doctors" and the "Billers" were talking about almost completely different groups of people. They only agreed on about 20% of cases.

  • Why this matters: If the AI is trained on the "Biller's" data (which is often used for government reports and hospital rankings), it isn't learning about sick patients; it's learning about billing patterns. It's like training a weather app on "umbrella sales" instead of "rainfall."

2. The "Biological Signal" Test

The author asked: "If we remove all the data about how busy the doctors were (how many tests they ordered), can the AI still predict sepsis?"

The Result: At the elite academic hospital (MIMIC-IV), yes. Even without knowing how many tests the doctors ordered, the AI could still predict sepsis almost perfectly.

  • The Analogy: If you take away the "umbrellas" from the weather app, it can still predict rain just by looking at the clouds. This proves that at top-tier hospitals, the AI is actually learning the biology of the disease, not just the doctors' behavior.

3. The "Care-Intensity" Test

The author asked: "Can the AI predict sepsis using only the data about how many tests were ordered?"

The Result: It could do okay, but not great. It wasn't a perfect predictor on its own.

  • The Analogy: If you only look at people running with umbrellas, you can guess it's raining, but you'll miss the light drizzle or the storm that hasn't started yet.

4. The "Fake Patient" Test

The author created 50,000 fake patient records that looked exactly like real patients in terms of how many tests they ordered. He then asked the AI to tell the difference between a real sick patient and a fake one.

The Result: The AI could tell the difference.

  • The Analogy: The AI could tell the difference between a real storm and a fake storm made of umbrellas. This means the "real" biological data contains something the "fake" data (just the ordering patterns) doesn't have.

The Twist: It Depends on the Hospital

While the "elite" hospital showed that the AI was learning real biology, the study found something interesting when looking at community hospitals (the eICU database).

  • In smaller, diverse hospitals, the AI did rely more heavily on "doctor panic" (ordering patterns).
  • The Analogy: In a small town, if you see a doctor running around ordering tests, it's a huge red flag. In a giant, high-tech hospital, doctors order tests constantly, so the AI has to look deeper to find the real sickness.

The Bottom Line

  1. The AI isn't a liar (at top hospitals): At major academic centers, sepsis prediction models are actually detecting real biological changes in patients, not just copying what doctors do.
  2. The "Bill" is a bad teacher: The biggest problem isn't the AI; it's the data we use to judge it. Government reports and hospital quality scores often use "billing codes" to define sepsis. The study proves these codes identify a totally different group of patients than the actual sick ones.
  3. The Danger: If we train AI or judge hospitals based on billing codes, we might be optimizing for "good paperwork" rather than "saving lives."

In short: The AI is smart enough to smell the storm, but we've been using the wrong map (billing codes) to tell us where the storm is. We need to stop looking at the "umbrellas" (insurance codes) and start looking at the "clouds" (actual patient biology).

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →