Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare

The Core Problem: The "Magic 8-Ball" of Medicine

Imagine you are a doctor trying to decide if a patient needs a life-saving surgery. You ask a computer model for help. The model says, "This patient has a 60% chance of dying within 30 days. You should operate."

You feel confident. But then, you ask the same model again, but this time you press "Reset" on the computer's internal clock (changing the random seed) and run the calculation again.

Surprise! The model now says, "This patient has a 40% chance of dying. Do not operate."

If you run it a third time, it might say 55%.

This is the problem Elizabeth W. Miller and Jeffrey D. Blume are highlighting. They discovered that for many modern, complex AI models (especially "overparameterized" ones like deep neural networks), the answer you get for a specific patient often depends on accidental randomness in how the computer started the calculation, rather than just the patient's actual medical data.

The Analogy: The "Rashomon" Effect in AI

In the movie Rashomon, four people witness the same event, but they all tell completely different stories. In machine learning, this is called the Rashomon Set.

Imagine you have a pile of 100 different chefs (models). They all taste the same soup (the training data) and agree that the soup is "delicious" (high aggregate accuracy).

Chef A (Logistic Regression): A simple, old-school chef who follows a strict recipe.
Chef B (Neural Network): A fancy, experimental chef with thousands of ingredients and no strict rules.

If you ask both chefs to describe the soup, they might both say, "It's 70% delicious." (This is the Aggregate Performance that doctors usually look at).

However, if you ask them to describe the exact taste of a single spoonful for a specific patient:

Chef A says: "It tastes exactly like salt and pepper." (Stable, reliable).
Chef B says: "Well, in my first attempt, it tasted like salt. In my second attempt, I added too much pepper, so it tasted spicy. In my third, it was bland." (Unstable, chaotic).

The paper argues that while both chefs get the "average" score right, Chef B is dangerous to trust with a specific patient's life because their answer changes based on how they woke up that morning (random initialization).

The New Tools: Measuring the "Wobble"

The authors propose two new ways to test if a model is "wobbly" before we let it make medical decisions.

1. The "Shaky Ruler" (ePIW)

Imagine you are measuring a patient's risk of heart attack with a ruler.

Stable Model: You measure the patient 100 times. The ruler always says "12 inches."
Unstable Model: You measure the patient 100 times. The ruler says "11 inches," then "13 inches," then "11.5 inches."

The authors call this the Empirical Prediction Interval Width (ePIW). It measures how much the risk number "jitters" just because the computer restarted. If the number jumps around wildly, the model is unreliable, even if the average of all those numbers looks correct.

2. The "Flip-Flop" (eDFR)

Now, imagine the doctor has a rule: "If the risk is over 50%, we operate. If under 50%, we don't."

Stable Model: The risk is always 60%. The doctor always operates.
Unstable Model: The risk bounces between 49% and 51%.
- Run 1: 49% -> No surgery.
- Run 2: 51% -> Surgery.
- Run 3: 49% -> No surgery.

The authors call this the Empirical Decision Flip Rate (eDFR). It counts how often the model changes its mind about a life-or-death decision just because of a random computer glitch.

What They Found

They tested this on fake data and real data from heart attack patients (the GUSTO-I dataset).

Simple vs. Complex: Simple models (like Logistic Regression) were very stable. They gave the same answer every time. Complex models (Neural Networks) were very "wobbly."
The Hidden Danger: The complex models often had the same overall accuracy as the simple ones. If you only looked at the "average score," you would think they were equally good. But when you looked at individual patients, the complex models were flipping their recommendations randomly.
The "Random Seed" Effect: Surprisingly, just changing the random starting point of the computer code caused the complex models to change their minds almost as much as if you had given them a completely different set of patient data.

The Takeaway: Why This Matters

In healthcare, consistency is trust.

If a doctor asks an AI, "Should I give this patient a drug?" and the AI says "Yes" today, "No" tomorrow, and "Yes" the next day, the doctor cannot trust it. It doesn't matter if the AI is "smart" on average; if it can't make up its mind for a specific person, it's useless for that person.

The authors' advice:
When choosing an AI for high-stakes medical decisions, don't just ask, "How accurate is it on average?"
Ask, "If I run this model 100 times, does it give me the same answer for the same patient?"

If two models are equally accurate, choose the simpler, more stable one. It's better to have a reliable, slightly less "fancy" tool than a brilliant tool that acts like a Magic 8-Ball.

Summary Checklist for Practitioners

Before using a model in a hospital, ask:

The Jitter Test: If I re-run the model 100 times, does the patient's risk score stay the same, or does it wobble?
The Flip Test: Does the model change its "Yes/No" decision just because I pressed "Restart"?
The Specifics: Is the model unstable for the specific group of patients I care about (e.g., those right on the edge of needing surgery)?

Bottom Line: In medicine, a model that is "right on average" but "wrong for the individual" is a liability. We need models that are consistent, not just clever.

1. Problem Statement

The paper addresses a critical gap in the deployment of machine learning (ML) models in healthcare: the disconnect between aggregate performance and individual-level reliability.

The Core Issue: While modern overparameterized models (e.g., deep neural networks) often achieve high aggregate metrics (AUC-ROC, log-loss), they suffer from individual-level prediction instability. This means that for a specific patient, the predicted risk score and the resulting clinical decision (e.g., treatment vs. no treatment) can vary significantly based solely on arbitrary stochastic elements of the training process, such as random weight initialization or the order of mini-batch updates in stochastic gradient descent (SGD).
The "Rashomon Set" Phenomenon: In overparameterized settings, many distinct models exist that yield nearly identical aggregate performance but rely on fundamentally different decision logic. Standard evaluation practices treat these models as equivalent, ignoring that retraining a model with a different random seed can lead to materially different clinical recommendations for the same patient.
The Consequence: This "procedural arbitrariness" undermines clinical trust. A clinician cannot rely on a model if a patient's eligibility for life-saving intervention hinges on the random seed used during training rather than their biological data.

2. Methodology

The authors propose a new evaluation framework designed to quantify instability across repeated instantiations of a learning pipeline, treating algorithmic randomness as a source of predictive uncertainty.

A. Experimental Setup

Datasets:
1. Simulated Data: Generated from a known logistic regression process with signal and noise features.
2. Clinical Data: The GUSTO-I dataset, predicting 30-day mortality following acute myocardial infarction (outcome prevalence ~7%).
Model Classes: To test the impact of model capacity, the study compares:
- Constrained Models: Logistic Regression (with various optimization routines: L-BFGS, SGD, and polynomial features).
- Flexible/Overparameterized Models: Feedforward Neural Networks (1-layer and 2-layer with varying widths).
Variability Sources: The study isolates two sources of instability:
1. Data Resampling: Retraining models on different subsets of the training data.
2. Optimization Stochasticity: Retraining on a fixed dataset but varying random seeds (initialization and batch ordering).
Protocol: Each model specification is retrained $B=100$ times. Performance is evaluated on a fixed test set to measure how individual predictions fluctuate across these 100 runs.

B. Proposed Diagnostics

The framework introduces two complementary metrics to quantify instability:

Empirical Prediction Interval Width (ePIW):
- Definition: For a specific patient $x_i$ , it measures the width of the central 95% interval of predicted risk scores across the 100 model runs ( $Q_{0.975} - Q_{0.025}$ ).
- Purpose: Quantifies the dispersion of continuous risk estimates. A high ePIW indicates that the model is highly sensitive to pipeline variations for that specific patient.
Empirical Decision Flip Rate (eDFR):
- Definition: For a fixed decision threshold $\tau$ (e.g., 0.5 or 0.07), it calculates the proportion of distinct pairs of model runs where the binary decision (treat vs. no treat) disagrees.
- Purpose: Measures the frequency of inconsistent clinical actions. An eDFR near 1 implies the decision is essentially random based on the training seed.

3. Key Results

A. Aggregate Performance vs. Individual Stability

Indistinguishable Aggregates: Across both simulated and clinical datasets, logistic regression and neural networks achieved statistically indistinguishable aggregate performance (similar BCE and Accuracy).
Divergent Stability: Despite similar aggregate scores, neural networks exhibited substantially greater instability at the individual level compared to logistic regression.

B. Impact of Optimization Stochasticity

Fixed Data, Different Seeds: Even when the training data was held constant, varying the random seed caused significant instability in neural networks.
Magnitude of Effect: For high-capacity neural networks, the instability induced solely by random initialization was comparable in magnitude to the instability caused by resampling the entire training dataset.
Deterministic vs. Stochastic: Logistic regression models trained with deterministic solvers (L-BFGS) showed near-zero instability under fixed data, whereas those using SGD showed some instability, but significantly less than neural networks.

C. Distribution of Instability

Decision Boundaries: Instability (both ePIW and eDFR) was highest for patients with risk scores near the clinical decision threshold.
Clinical Nuance (GUSTO-I): In the clinical dataset, the decision threshold was low ( $\tau \approx 0.07$ ). While neural networks showed high dispersion (ePIW) in the upper tail of the risk distribution, this did not always result in "decision flips" (eDFR) because the dispersion was distal to the threshold. However, the authors argue that high variance in risk estimates for high-risk patients still erodes clinician confidence in the model's precision, even if the binary classification remains stable.

4. Key Contributions

Demonstrating the Disconnect: The paper provides empirical evidence that high aggregate accuracy does not guarantee individual-level procedural consistency. A model can be "accurate" on average but "unreliable" for specific high-stakes patients.
Operationalizing Instability: The introduction of ePIW and eDFR provides a practical, model-agnostic framework to diagnose hidden instability in the learning pipeline.
Re-evaluating Model Selection: The study challenges the preference for flexible, overparameterized models in healthcare. It argues that when aggregate performance is comparable, constrained models (like logistic regression) should be preferred due to their superior stability and lower susceptibility to algorithmic noise.
Redefining Uncertainty: The authors frame optimization-induced variance as a form of procedural epistemic uncertainty, distinct from data-driven uncertainty, which must be accounted for in clinical validation.

5. Significance and Implications

Clinical Trust: The findings offer a technical justification for clinician skepticism toward AI. If a model's recommendation for a patient changes based on a random seed, it is not a reliable tool for decision support.
Validation Standards: Current validation paradigms (relying on single runs and aggregate metrics) are insufficient. The authors propose that stability diagnostics should become a mandatory part of routine model validation in healthcare.
Occam's Razor in ML: The paper advocates for a "stability-aware" interpretation of Occam's razor: in high-stakes healthcare, simpler, more constrained models that yield stable individual predictions are superior to complex models that offer marginal aggregate gains but introduce procedural arbitrariness.
Actionable Checklist: The paper provides a checklist for practitioners to assess whether a model's procedural consistency meets the standards required for deployment, focusing on risk score dispersion and decision flip rates.

In conclusion, the paper argues that for healthcare applications, individual-level reliability must be a primary criterion for model selection, potentially outweighing marginal improvements in aggregate accuracy, to ensure that clinical decisions are driven by patient data rather than algorithmic randomness.