Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains

Imagine you have hired a brilliant but slightly unreliable assistant (an AI) to read thousands of medical documents and pull out important facts, like drug side effects or X-ray findings. The problem is, this assistant is terrible at knowing when it's right and when it's wrong. Sometimes it's 100% sure it's correct when it's actually making a mistake (overconfident), and other times it's shaking in its boots when it's actually spot on (underconfident).

If you let this assistant run a hospital without checking its work, it could silently give dangerous advice.

This paper introduces a "Safety Net System" (called Conformal Prediction) to fix this. Instead of trusting the AI's "gut feeling," the system acts like a strict bouncer at a club, deciding which answers to let through and which to send back for human review, guaranteeing that the rate of mistakes stays below a safe limit.

Here is the breakdown of their discovery using simple analogies:

1. The Two Different Worlds

The researchers tested their safety net in two very different "rooms":

Room A: The Structured Library (FDA Drug Labels)
- The Vibe: These are official government documents. They are rigid, follow strict rules, and look like forms.
- The AI's Behavior: The AI was underconfident. It was like a nervous student who knows the answer but is afraid to raise their hand. It would say, "I'm only 60% sure this drug causes nausea," even though it was actually 100% correct.
- The Result: Because the AI was so cautious, the safety net didn't have to work hard. It let almost everything through because the AI was rarely wrong, just scared.
Room B: The Chaotic ER (Radiology Reports)
- The Vibe: These are doctors' handwritten notes (digitized). They are messy, use shorthand, and are full of "maybe" and "possibly."
- The AI's Behavior: The AI was overconfident. It was like a know-it-all who guesses wildly but speaks with total authority. It would say, "I'm 99% sure the lung is clear," even when the report said "we can't rule out pneumonia."
- The Result: The safety net had to work overtime. It had to throw out a huge chunk of the AI's answers to keep the error rate low.

2. The "Bouncer" Mechanism (Conformal Prediction)

Think of the AI's confidence score as a ticket.

In the Structured Library, the tickets were mostly low numbers (nervous AI), but the content was good. The bouncer said, "Okay, you can all come in."
In the Chaotic ER, the tickets were high numbers (confident AI), but many were fake. The bouncer had to check every single ticket.
- Model A (GPT-4.1): Was so overconfident and messy that the bouncer had to reject 60% of its answers to keep things safe.
- Model B (Llama-4): Was slightly better at knowing its limits. The bouncer only had to reject 20% of its answers.

3. The Big Surprise: "One Size Does Not Fit All"

The most important lesson from this paper is that you cannot use the same safety rules for every type of medical document.

If you set the safety net too loose for the messy ER notes, patients get hurt by false alarms.
If you set the safety net too tight for the structured drug labels, you waste time checking answers that were actually perfect.

The researchers found that the AI's "personality" (whether it's a nervous student or a cocky guesser) changes depending on the document type.

4. Why This Matters

In the past, people tried to "calibrate" the AI once and apply it everywhere, like tuning a radio for one station and expecting it to work on all others. This paper proves that doesn't work.

The Solution: We need a dynamic safety net.

When the AI reads a rigid drug label, we can trust it more.
When the AI reads a messy doctor's note, we need to be much more skeptical and have a human double-check more of the work.

The Takeaway

This paper gives us a mathematical "seatbelt" for AI in medicine. It ensures that no matter how confident the AI says it is, we can mathematically guarantee that the number of mistakes stays below a safe limit (like 5% or 10%). It teaches us that in medicine, context is king: the rules for trusting an AI must change depending on whether it's reading a form or a messy note.

Here is a detailed technical summary of the paper "Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains" by Manil Shrestha and Edward Kim.

1. Problem Statement

Large Language Models (LLMs) are increasingly used for extracting structured medical entities from unstructured clinical text (e.g., drug labels, radiology reports). However, their deployment in clinical settings is hindered by two critical issues:

Miscalibration: LLM softmax probabilities are often poorly calibrated. Models can be systematically overconfident (assigning high probability to incorrect predictions) or underconfident (assigning low probability to correct ones).
Lack of Formal Guarantees: Standard post-hoc calibration techniques (e.g., temperature scaling) require held-out validation data and do not provide finite-sample guarantees on error rates. In high-stakes clinical environments, a model that confidently hallucinates an entity can lead to silent errors propagating through decision pipelines.

The authors aim to address this by applying Conformal Prediction (CP) to control the False Discovery Rate (FDR)—the proportion of accepted extractions that are incorrect—across heterogeneous clinical domains.

2. Methodology

The authors propose a four-step pipeline for FDR-controlling conformal prediction applied to two distinct clinical domains:

A. Data and Models

Domain 1: FDA Drug Labels (Structured)
- Data: 1,000 FDA drug labels across 8 standardized sections (e.g., Indications, Adverse Reactions).
- Model: GPT-4.1.
- Ground Truth: Verified via an "LLM-as-a-judge" approach (GPT-5-mini) using the VeriFact framework, assigning fact-scores (0–3). Only fully verified entities (score 3) are considered correct.
- Scale: ~128,906 entities extracted.
Domain 2: Radiology Reports (Free-text)
- Data: 100 MIMIC-CXR chest X-ray reports.
- Schema: RadGraph (4 entity categories: anatomy present, observation present/absent/uncertain; and relations).
- Models: GPT-4.1 and Llama-4-Maverick (17B).
- Ground Truth: Physician-annotated gold standard (exact match required).
- Prompting: Evaluated with 5-shot and zero-shot prompting.

B. Technical Pipeline

Span Confidence Estimation:
- Models output per-token log-probabilities.
- Span-level confidence ( $\hat{p}_e$ ) is calculated as the geometric mean of token probabilities. This is chosen because a single low-probability token (e.g., a misspelling) often invalidates the entire span, making the geometric mean more sensitive to errors than the arithmetic mean.
Nonconformity Score:
- The confidence is transformed into a nonconformity score ( $s_e$ ) using the logit function: $s_e = \text{logit}(\hat{p}_e)$ . This spreads out the high-confidence region where most entities cluster.
FDR-Controlling Calibration:
- Data is split into calibration (50%) and test (50%) sets.
- Thresholds ( $\tau_{d,c}$ ) are selected to ensure the empirical FDR on the calibration set does not exceed a target $\alpha$ .
- Decision Rule: Entities with $s_e \geq \tau_{d,c}$ are accepted; others are rejected for human review.
- Guarantee: The expected proportion of accepted but incorrect extractions is bounded by $\alpha$ .

3. Key Contributions

Risk-Controlled Framework: A conformal prediction framework providing finite-sample FDR guarantees for LLM-based medical entity extraction across heterogeneous domains.
Discovery of Calibration Reversal: The empirical finding that LLM miscalibration direction reverses depending on document structure:
- Structured Text (FDA): Models are underconfident.
- Free-text (Radiology): Models are overconfident.
Heterogeneity Analysis: Demonstration that global thresholds mask significant per-category and per-section variations. Per-category analysis reveals that some sections (e.g., "Pediatric Use" in FDA labels) are unsuitable for automated acceptance at strict error tolerances, while others are highly reliable.
Model Discriminability: Showing that FDR-controlling thresholds depend not just on extraction accuracy (F1), but on the model's ability to discriminate between correct and incorrect extractions via confidence scores.

4. Key Results

FDA Drug Labels (Structured)

Calibration: GPT-4.1 is underconfident across most sections (Expected Calibration Error [ECE] ranges 0.012–0.055). The model assigns conservative probabilities to correct entities.
FDR Control:
- The global baseline FDR is low (~2.3%), trivially satisfying $\alpha = 0.05$ without rejecting any data.
- Per-Section Heterogeneity: When analyzed by section, three sections require aggressive filtering. "Drug Interactions" requires 59.8% rejection, "Contraindications" 41.5%, and "Pediatric Use" requires 100% rejection (no threshold can satisfy $\alpha=0.05$ due to high baseline error and overconfidence in this specific section).

Radiology Reports (Free-text)

Calibration: Both models are overconfident (curves below the diagonal), assigning near-certain probabilities to incorrect extractions. Llama-4-Maverick is better calibrated (ECE 0.085) than GPT-4.1 (ECE 0.147).
FDR Control:
- At $\alpha = 0.05$ , both models must reject 100% of extractions because the baseline error rate (15–20%) exceeds the target.
- At $\alpha = 0.10$ $α = 0.10$ , a sharp transition occurs:
  - Llama-4-Maverick: Rejects only 19.6% of extractions.
  - GPT-4.1: Rejects 59.3% of extractions.
- This highlights that Llama-4-Maverick's confidence scores better separate correct from incorrect predictions, allowing for higher acceptance rates at the same risk level.
Category Specifics: Both models reject 100% of "Uncertain Observations" (OBS-U), confirming that ambiguous clinical language cannot be reliably extracted automatically at this error tolerance.

Cross-Domain Comparison

Calibration Reversal: The same model family (GPT-4.1) is underconfident on structured FDA labels but overconfident on free-text radiology reports. This is attributed to the clarity of structured regulatory language vs. the ambiguity and implicit negation in clinical shorthand.
Threshold Sensitivity: FDR control exposes heterogeneity that global coverage-based approaches obscure. A single global threshold is insufficient; domain-specific and category-specific calibration is required.

5. Significance and Implications

Safety in Clinical Deployment: The study proves that "one-size-fits-all" calibration is dangerous. Safe deployment requires domain-specific conformal calibration that adapts to document structure and model architecture.
Regulatory Compliance: The framework provides mathematically provable bounds on error rates (FDR), which is essential for meeting regulatory requirements in healthcare AI.
Model Selection: The results suggest that for FDR-controlled workflows, a model with slightly lower F1 but better-calibrated confidence (Llama-4-Maverick) may be preferable to a model with higher F1 but poor calibration (GPT-4.1), as it requires significantly less human review (lower rejection rates).
Limitations & Future Work: The approach currently requires access to token-level log-probabilities (not available in all black-box models). Future work includes extending these guarantees to black-box models, comparing against post-hoc baselines, and testing in live clinical workflows.