Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare

This paper proposes a new evaluation framework using empirical prediction interval width and decision flip rate to quantify individual-level prediction instability in healthcare machine learning models, revealing that optimization and initialization randomness can cause clinically significant variability in risk estimates that standard aggregate metrics fail to detect.

Elizabeth W. Miller, Jeffrey D. Blume

Published 2026-03-03
📖 5 min read🧠 Deep dive

The Core Problem: The "Magic 8-Ball" of Medicine

Imagine you are a doctor trying to decide if a patient needs a life-saving surgery. You ask a computer model for help. The model says, "This patient has a 60% chance of dying within 30 days. You should operate."

You feel confident. But then, you ask the same model again, but this time you press "Reset" on the computer's internal clock (changing the random seed) and run the calculation again.

Surprise! The model now says, "This patient has a 40% chance of dying. Do not operate."

If you run it a third time, it might say 55%.

This is the problem Elizabeth W. Miller and Jeffrey D. Blume are highlighting. They discovered that for many modern, complex AI models (especially "overparameterized" ones like deep neural networks), the answer you get for a specific patient often depends on accidental randomness in how the computer started the calculation, rather than just the patient's actual medical data.

The Analogy: The "Rashomon" Effect in AI

In the movie Rashomon, four people witness the same event, but they all tell completely different stories. In machine learning, this is called the Rashomon Set.

Imagine you have a pile of 100 different chefs (models). They all taste the same soup (the training data) and agree that the soup is "delicious" (high aggregate accuracy).

  • Chef A (Logistic Regression): A simple, old-school chef who follows a strict recipe.
  • Chef B (Neural Network): A fancy, experimental chef with thousands of ingredients and no strict rules.

If you ask both chefs to describe the soup, they might both say, "It's 70% delicious." (This is the Aggregate Performance that doctors usually look at).

However, if you ask them to describe the exact taste of a single spoonful for a specific patient:

  • Chef A says: "It tastes exactly like salt and pepper." (Stable, reliable).
  • Chef B says: "Well, in my first attempt, it tasted like salt. In my second attempt, I added too much pepper, so it tasted spicy. In my third, it was bland." (Unstable, chaotic).

The paper argues that while both chefs get the "average" score right, Chef B is dangerous to trust with a specific patient's life because their answer changes based on how they woke up that morning (random initialization).

The New Tools: Measuring the "Wobble"

The authors propose two new ways to test if a model is "wobbly" before we let it make medical decisions.

1. The "Shaky Ruler" (ePIW)

Imagine you are measuring a patient's risk of heart attack with a ruler.

  • Stable Model: You measure the patient 100 times. The ruler always says "12 inches."
  • Unstable Model: You measure the patient 100 times. The ruler says "11 inches," then "13 inches," then "11.5 inches."

The authors call this the Empirical Prediction Interval Width (ePIW). It measures how much the risk number "jitters" just because the computer restarted. If the number jumps around wildly, the model is unreliable, even if the average of all those numbers looks correct.

2. The "Flip-Flop" (eDFR)

Now, imagine the doctor has a rule: "If the risk is over 50%, we operate. If under 50%, we don't."

  • Stable Model: The risk is always 60%. The doctor always operates.
  • Unstable Model: The risk bounces between 49% and 51%.
    • Run 1: 49% -> No surgery.
    • Run 2: 51% -> Surgery.
    • Run 3: 49% -> No surgery.

The authors call this the Empirical Decision Flip Rate (eDFR). It counts how often the model changes its mind about a life-or-death decision just because of a random computer glitch.

What They Found

They tested this on fake data and real data from heart attack patients (the GUSTO-I dataset).

  1. Simple vs. Complex: Simple models (like Logistic Regression) were very stable. They gave the same answer every time. Complex models (Neural Networks) were very "wobbly."
  2. The Hidden Danger: The complex models often had the same overall accuracy as the simple ones. If you only looked at the "average score," you would think they were equally good. But when you looked at individual patients, the complex models were flipping their recommendations randomly.
  3. The "Random Seed" Effect: Surprisingly, just changing the random starting point of the computer code caused the complex models to change their minds almost as much as if you had given them a completely different set of patient data.

The Takeaway: Why This Matters

In healthcare, consistency is trust.

If a doctor asks an AI, "Should I give this patient a drug?" and the AI says "Yes" today, "No" tomorrow, and "Yes" the next day, the doctor cannot trust it. It doesn't matter if the AI is "smart" on average; if it can't make up its mind for a specific person, it's useless for that person.

The authors' advice:
When choosing an AI for high-stakes medical decisions, don't just ask, "How accurate is it on average?"
Ask, "If I run this model 100 times, does it give me the same answer for the same patient?"

If two models are equally accurate, choose the simpler, more stable one. It's better to have a reliable, slightly less "fancy" tool than a brilliant tool that acts like a Magic 8-Ball.

Summary Checklist for Practitioners

Before using a model in a hospital, ask:

  1. The Jitter Test: If I re-run the model 100 times, does the patient's risk score stay the same, or does it wobble?
  2. The Flip Test: Does the model change its "Yes/No" decision just because I pressed "Restart"?
  3. The Specifics: Is the model unstable for the specific group of patients I care about (e.g., those right on the edge of needing surgery)?

Bottom Line: In medicine, a model that is "right on average" but "wrong for the individual" is a liability. We need models that are consistent, not just clever.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →