Measuring the Unmeasurable: A Diagnostic Sensor for AI Reasoning Pathology in Sequential Clinical Decision-Making

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery. In a standard test, the police hand you a complete file with every clue, every witness statement, and every piece of evidence all at once. You read it, think for a moment, and write down your final conclusion. This is how most AI models are currently tested: they get the whole story at once and give an answer.

But in real life, being a doctor (or a detective) isn't like that. Information arrives in drips and drabs. First, you see the patient. Then, you get the blood test results. Then, the X-ray comes back. Then, a specialist calls with a new opinion. At every step, you have to update your theory.

This paper, "Measuring the Unmeasurable," asks a scary question: What happens when an AI detective tries to solve a mystery as the clues arrive one by one?

The author, Stella Wang, discovered that AI has a very human-like flaw: it gets confused by the latest news and forgets the truth it found earlier.

Here is the breakdown of the study using simple analogies:

1. The "Amnesia" Problem: Convergence Regression

The study found a specific failure mode called Convergence Regression.

The Scenario: Imagine the AI is solving a case. At Step 2, it correctly identifies the culprit (let's say, "It's the butler!"). It writes this down.
The Glitch: At Step 3, a new clue arrives that sounds very similar to a different suspect (the "Gardener"). Because the Gardener clue is fresh and loud, the AI's brain flips. It thinks, "Oh, the Gardener makes more sense now!" and it silently deletes "The Butler" from its list of suspects.
The Result: By the end, the AI confidently accuses the Gardener, even though it knew the Butler was the right answer just a moment ago.

The paper calls this an "Access-Stability Dissociation." The AI accessed the right answer (it saw it), but it couldn't stabilize on it (it couldn't hold onto it). It's like a student who knows the answer to a math problem but gets distracted by a new, confusing variable and changes their answer to the wrong one.

2. The "Safety Net": SIPS

To fix this, the author built a tool called SIPS (Sequential Information Prioritization Scaffold). Think of SIPS as a strict teacher or a flight recorder for the AI.

Without SIPS, the AI is allowed to think freely and change its mind without explaining why. With SIPS, the AI is forced to fill out a worksheet at every single step:

List your top suspects.
Did you add a new one? Why?
Did you kick a suspect off the list? You must write a paragraph explaining exactly why.
Are you still sure? What would change your mind?

The Magic Effect:
When the AI had to use SIPS, it stopped "forgetting" the right answer. Even if it got confused by new clues, the "teacher" forced it to keep the correct suspect on the list, even if it wasn't #1 anymore.

Without SIPS: The AI finds the right answer 90% of the time but only keeps it 60% of the time. (It loses 30% of its correct guesses).
With SIPS: The AI finds the right answer 80% of the time and keeps it 80% of the time. (It never loses the right answer once it finds it).

3. The "Hesitant Detective" Paradox

Here is the twist. While SIPS stopped the AI from forgetting the right answer, it made the AI too afraid to commit.

Because the AI had to write a long explanation for every change, it became very cautious. It kept too many suspects on the list.

The Trade-off: The AI became a better "tracker" (it kept the right answer in the mix) but a worse "decider" (it couldn't pick the single best answer).
The Metaphor: Imagine a judge who is so afraid of making a mistake that they keep every possible suspect in the courtroom, refusing to convict anyone. They are safe, but they aren't decisive.

The paper calls this the Convergence Hesitancy Paradox. The AI is now stable, but it's hesitant.

4. Why This Matters for the Real World

The author argues that we shouldn't just care if the AI gets the "right answer" at the end. We need to know how it got there.

The Danger: If a doctor trusts an AI that suffers from "Convergence Regression," the AI might give a confident, well-written explanation for the wrong diagnosis because it forgot the correct one it found earlier. The doctor, seeing a confident AI, might make a fatal mistake (this is called "automation bias").
The Solution: We need "Scaffolding" (like SIPS) not just to make AI smarter, but to make it auditable. We need to see the "flight recorder" to know if the AI changed its mind for a good reason or just got confused.

Summary in One Sentence

This paper proves that AI gets "distracted" when information comes slowly, causing it to forget the right answer; however, by forcing the AI to write down its reasoning step-by-step (like a strict teacher), we can stop it from forgetting, even if it becomes a little too cautious to make a final decision.

The Big Takeaway: In the future, we won't just ask AI "What is the answer?" We will ask, "Show me your notebook, and explain exactly why you changed your mind." That is the only way to trust AI in real medicine.

1. Problem Statement

Current Large Language Model (LLM) evaluations in medicine rely heavily on static, single-shot benchmarks (e.g., MedQA, USMLE) where the model receives a complete clinical vignette at once. This fails to replicate real-world clinical practice, where diagnostic information arrives sequentially over time (e.g., initial presentation $\to$ physical exam $\to$ labs $\to$ imaging).

The paper identifies a critical gap:

The Clinical Reasoning Gap: Static benchmarks measure pattern-matching to fixed information sets, whereas real clinical reasoning requires hypothesis management under uncertainty, where new evidence may confirm, contradict, or redirect the diagnostic trajectory.
The Observability Gap: Existing governance frameworks (WHO, FDA) demand transparency and auditability of AI reasoning, but current evaluation methods only measure final output accuracy, hiding the internal reasoning pathologies that occur during sequential processing.

2. Methodology

The study employs a three-condition ablation design using $N=50$ complex clinical cases derived from New England Journal of Medicine (NEJM) case reports. All experiments were conducted using Claude Sonnet 4 (20250514) at temperature 0 (deterministic).

Experimental Conditions

C1 (Single-Shot Baseline): The full clinical vignette is delivered in a single prompt.
C2 (Sequential, No Scaffold): Information is delivered in four sequential stages. The model updates its differential diagnosis freely without structural constraints.
C3 (Sequential, SIPS-Scaffolded): Information is delivered in four stages using the Sequential Information Prioritization Scaffold (SIPS). This framework enforces:
- Forced Differential Ranking: A ranked top-3 to top-5 list at every stage.
- Explicit Hypothesis Rotation: Mandatory documentation of what was added, removed, promoted, or demoted, with specific evidence triggers.
- Convergence Status Tracking: A declaration of whether the differential is [Baseline], [Changed], or [Stable], including confidence levels.

Measurement Instruments

5+2 Scoring Rubric: A multi-dimensional framework measuring reasoning quality beyond binary accuracy.
- Tier 1 (Core): Diagnostic Accuracy (Access-based), Reasoning Depth, Calibration, Hypothesis Tracking, Step Adherence.
- Tier 2 (Diagnostic): Early Convergence, Anchoring Resistance.
6-Code Failure Taxonomy: A mechanistic classification system for diagnostic failures:
- KV: Knowledge Void (Diagnosis never generated).
- RF: Ranking Failure (Generated but ranked too low).
- SD: Signal Dilution (Relevant signals missed).
- PC: Premature Closure (Locked on wrong diagnosis too early).
- LM: Logical Misinterpretation (Correct data, wrong inference).
- CR: Convergence Regression (Diagnosis found at an intermediate stage but abandoned later).

3. Key Contributions

Discovery of Convergence Regression (CR): A novel failure mode where LLMs correctly identify a diagnosis at an intermediate stage but systematically abandon it when subsequent evidence triggers pattern-matching to a "textbook" alternative.
The Access-Stability Dissociation: The paper demonstrates that under unstructured sequential delivery, models access the correct diagnosis in 90% of cases but retain it in only 60%, creating a 30% stability gap invisible to single-shot evaluation.
The SIPS Retention Effect: Structured scaffolding (SIPS) eliminates the stability gap (0% CR) by enforcing "forced hypothesis accountability," converting an unstable reasoner into a stable one.
The Convergence Hesitancy Paradox: While SIPS improves retention (Top-3 accuracy), it reduces Top-1 accuracy (from 60% to 40%). This reveals that retention (keeping hypotheses alive) and convergence (committing to a single answer) are architecturally distinct tasks requiring different mechanisms.
Operationalizing AI Governance: The paper maps the 5+2 rubric and 6-code taxonomy directly to WHO and FDA governance principles (Transparency, Accountability, Safety), transforming abstract requirements into quantifiable, reproducible audit scores.

4. Key Results

Performance Dynamics:
- C1 (Single-Shot): 60% Top-1 Accuracy.
- C2 (Sequential, No Scaffold): 60% Top-1 Accuracy, but 90% Access Rate vs. 60% Retention. The model "finds and loses" the truth 30% of the time.
- C3 (SIPS): 80% Top-3 Accuracy, 0% Convergence Regression, but 40% Top-1 Accuracy.
Token Efficiency:
- C2 consumed 3.1x more tokens than C1 with 0% gain in final accuracy (Unstructured Overthinking).
- C3 consumed only 1.28x more tokens than C2 but yielded a 20% gain in Top-3 accuracy. This proves that structured deliberation yields better returns on inference-time compute than unstructured scaling.
Failure Mode Analysis:
- In the deep-analysis subset, the only active failure modes were Knowledge Void (KV) and Convergence Regression (CR).
- SIPS completely eliminated CR but could not fix KV (which requires knowledge augmentation, not scaffolding).
Quality Dimensions:
- SIPS significantly improved Hypothesis Tracking (4.10 $\to$ 4.70) and Anchoring Resistance (3.33 $\to$ 4.13).
- Calibration remained flat across conditions, indicating scaffolding improves stability but not necessarily confidence expression.

5. Significance and Implications

Safety Signal: The 30% stability gap in unstructured sequential reasoning is a patient safety risk. In an environment of automation bias, a clinician might trust a confident, well-reasoned wrong answer (the result of CR) because the silent abandonment of the correct diagnosis is invisible without an observability layer.
Scaffolding as a Diagnostic Sensor: The paper reframes structured prompting not merely as a performance enhancer, but as a measurement instrument. SIPS makes reasoning pathologies visible, classifiable, and auditable, fulfilling regulatory requirements for "immutable audit trails."
Architectural Insight: The Convergence Hesitancy Paradox suggests that future clinical AI systems need a two-phase architecture:
1. Phase 1 (SIPS): Ensures stability and prevents silent abandonment (Observability).
2. Phase 2 (Proposed CDM): A "Clinical Decision Matrix" to provide quantitative evidence weighting to force decisive convergence (Accuracy).
Governance Compliance: The framework provides a concrete path for organizations to move from qualitative self-attestation to quantitative, reproducible AI auditing, directly addressing WHO Principle 3 (Transparency), Principle 4 (Accountability), and Principle 2 (Safety).

In conclusion, the paper argues that measuring the unmeasurable (reasoning stability) is more critical than optimizing for static accuracy. It provides the first standardized "diagnostic sensor" to detect, classify, and mitigate specific reasoning pathologies in LLMs before they are deployed in high-stakes sequential clinical workflows.

Measuring the Unmeasurable: A Diagnostic Sensor for AI Reasoning Pathology in Sequential Clinical Decision-Making

1. The "Amnesia" Problem: Convergence Regression

2. The "Safety Net": SIPS

3. The "Hesitant Detective" Paradox

4. Why This Matters for the Real World

Summary in One Sentence

1. Problem Statement

2. Methodology

Experimental Conditions

Measurement Instruments

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this

A case report on gendered biases in a Finnish healthcare AI assistant

An End-to-End Synthetic Oncology Clinical Trial Framework Integrating Radiographic Response, Circulating Tumor DNA, Safety, and Survival for Decision-Oriented Clinical Data Science

Who is leading medical AI? A systematic review and scientometric analysis of chest x-ray research

High-Throughput Observational Evidence Generation Using Linked Electronic Health Record and Claims Data

Perception of Safety in Behavioral Health Crisis Units among Patients and Care Partners versus Artificial Intelligence (AI): A Multimethod Study