VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization

Imagine you are a doctor writing a "Brief Hospital Course" (BHC) for a patient's discharge summary. This is a short story about what happened to the patient during their stay, meant to be read by the next doctor who will take over their care. It needs to be accurate, detailed, and trustworthy.

Now, imagine you hire a very smart but slightly overconfident AI assistant to write this story for you. The AI is great at writing, but sometimes it gets too creative. It might invent a surgery that never happened or claim a lab test improved when it actually didn't. In the medical world, these "creative lies" are dangerous.

This paper introduces a new system called VERI-DPO to fix this problem. Think of it as a three-step process to train the AI to be a perfect, honest medical scribe.

The Problem: The "Lazy" or "Hallucinating" AI

Current AI models have two bad habits when writing these summaries:

Hallucination: They make things up to sound impressive (e.g., "The patient took a new drug" when they didn't).
The "Say-Less" Trick: To avoid lying, some AI models learn to just say very little. They write a tiny, vague summary like "The patient was here." This is technically safe (no lies), but it's useless because it doesn't give the next doctor any real information.

The Solution: The "Fact-Checker" and the "Coach"

The authors built a system with three main characters:

1. The Fact-Checker (The Verifier)

Imagine a strict, super-fast librarian who has access to every single note, lab result, and X-ray report from the patient's file.

How it works: When the AI writes a sentence (a "claim"), the Fact-Checker looks at the patient's actual records.
The Verdict: It gives the sentence one of three stamps:
- ✅ Supported: "Yes, I found this in the notes."
- ❌ Not Supported: "No, this never happened. You made this up."
- ❓ Not Addressed: "I don't see this in the notes, but I also don't see proof it didn't happen."

The researchers trained this Fact-Checker to be very good at spotting the "❌ Not Supported" lies.

2. The Coach (Preference Mining)

Now, imagine the AI writes eight different versions of the same hospital story.

The Fact-Checker reads all eight versions and stamps them.
The Coach looks at the results and picks the "Best" version and the "Worst" version to create a lesson.
- The Winner (Chosen): A story that is long, detailed, and has very few "❌" stamps.
- The Loser (Rejected): A story that is either short and vague (the "say-less" trick) or full of "❌" stamps (lies).
The Goal: The Coach teaches the AI: "You want to be like the Winner, not the Loser. Be detailed, but don't lie."

3. The Training (Direct Preference Optimization - DPO)

This is the actual learning phase. Instead of just telling the AI "You were wrong," the system uses the Winner vs. Loser pairs to retrain the AI's brain.

It's like a sports coach showing an athlete a video of a perfect play next to a video of a mistake, saying, "Do it exactly like the first one."
The AI learns to internalize the Fact-Checker's rules. It learns that being detailed is good, but being honest is better.

The Results: A Miracle in the ICU

The researchers tested this on 100 real ICU patients. Here is what happened:

Before (The Old AI): The AI wrote summaries with about 11% false claims. If it said a patient had a surgery, there was a 1 in 10 chance it was a lie.
After (VERI-DPO): The new AI dropped the false claims to just 1.9% (using their internal Fact-Checker) and 6.4% (using a different, powerful AI judge).
The "Say-Less" Problem Solved: Crucially, the new AI didn't just get shorter to avoid lying. It actually wrote longer, more detailed summaries that were packed with useful, verified facts.

Why This Matters

Think of this like a quality control inspector in a factory.

Old AI: The factory produces 100 widgets, but 10 are broken.
Old Fix: The factory stops making widgets to ensure 0 are broken (but then you have no widgets).
VERI-DPO: The factory installs a smart inspector who catches the broken ones while the machine is learning. The machine learns to make 100 perfect widgets without slowing down.

The Bottom Line

VERI-DPO is a way to teach AI to be a truthful, detailed, and helpful medical writer. It uses a "Fact-Checker" to catch lies, a "Coach" to pick the best examples, and a "Training" method to make the AI learn from those examples. The result is a system that can write hospital summaries that doctors can actually trust, without having to read through pages of lies or vague nonsense.

Here is a detailed technical summary of the paper VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization.

1. Problem Statement

The generation of Brief Hospital Course (BHC) narratives—critical components of discharge summaries—requires models to be both clinically informative and strictly faithful to fragmented Electronic Health Record (EHR) evidence. Current Large Language Model (LLM) approaches face two primary challenges:

Hallucinations: Models frequently generate plausible but unsupported statements (e.g., claiming a surgery occurred when notes do not document it).
"Say-Less" Degeneration: Standard alignment techniques often encourage models to omit information to avoid making unsupported claims, resulting in summaries that are factually safe but clinically useless due to lack of detail.

Existing factuality supervision is expensive (requiring granular clinician annotation) and difficult to scale. Furthermore, alignment under imperfect signals often leads to shortcut behaviors where models reduce output length or vagueness rather than improving accuracy.

2. Methodology: VERI-DPO

The authors propose VERI-DPO, a three-stage pipeline that integrates a lightweight retrieval-augmented verifier with Direct Preference Optimization (DPO) to align summarizers with evidence without requiring inference-time reranking.

Stage A: Retrieval-Augmented Verifier Training

Objective: Train a compact verifier to classify the relationship between a specific claim and retrieved patient evidence.
Labels: The verifier outputs a single token: A (Supported), B (Not Supported), or C (Not Addressed).
Data: Built on the MIMIC-III-Ext-VeriFact-BHC dataset (100 ICU patients). Claims are decomposed from BHC narratives and paired with retrieved evidence snippets from the patient's longitudinal notes.
Architecture: Fine-tuned 8B-class instruction-following LLMs (e.g., Llama-3.1, Med42) using LoRA and 4-bit quantization.
Calibration: A scalar logit bias ( $b$ ) is added to the "Not Supported" logit to tune the precision-recall trade-off. A specific precision-stage bias ( $bias_{prec}$ ) is selected to ensure high-confidence detection of contradictions (HCNS) for preference mining.

Stage B: Verifier-Driven Preference Mining

Instead of using human preferences, the system mines preference pairs $(y^+, y^-)$ from multiple generated candidates using the verifier as an automated judge.

Candidate Generation: For a given evidence window, multiple BHC candidates are sampled.
Scoring: Each candidate is decomposed into sentence-level claims and scored by the verifier.
Utility Function ( $U$ ): A summary-level utility is calculated to select the "chosen" ( $y^+$ $y^{+}$ ) and "rejected" ( $y^-$ $y^{-}$ ) pairs. The utility penalizes contradictions (Not Supported claims) heavily while rewarding coverage and penalizing length reduction to prevent "say-less" degeneration.
- $U = \lambda_A n_A - \lambda_B n_B - \lambda_C n_C + \lambda_{cov} \min(n, n_0) - \dots$
Constraints: Pairs are selected only if the utility gap is significant, the chosen candidate has fewer contradictions (specifically High-Confidence Not Supported, or HCNS), and the length/coverage mismatch is minimal.

Stage C: DPO Alignment

Optimization: The mined preference pairs are used to train the summarizer via Direct Preference Optimization (DPO).
Goal: Distill the verifier's evidence-aware preferences into the summarizer's policy, enabling it to generate evidence-consistent outputs in a single pass without needing to re-rank multiple candidates at inference time.

3. Key Contributions

Evidence-Aware Verifier: Development of a retrieval-augmented, lightweight verifier trained with patient-level splits that provides auditable, claim-level labels (Supported/Not Supported/Not Addressed) and confidence margins.
Verifier-Driven Preference Mining: Introduction of a method to mine preference pairs for long-form clinical summaries that explicitly anchors on high-confidence contradictions while enforcing constraints to prevent omission-based degeneration.
Single-Sample DPO Alignment: Demonstration that DPO can distill complex verifier signals into a single-sample summarizer, achieving superior factuality compared to baselines while maintaining informative length.

4. Experimental Results

The system was evaluated on a held-out test set (120 prompts) using two judges: a local verifier and an external GPT-4o claim-level judge.

Reduction in Hallucinations:
- Local Verifier: Reduced "Not Supported" (NS) claim rates from 10.7% (Base) to 1.9% (VERI-DPO).
- GPT-4o Judge: Reduced NS rates from 11.6% (Base) to 6.4% (VERI-DPO).
Mitigation of Degeneration:
- Unlike Supervised Fine-Tuning (SFT) baselines, which failed to reduce hallucinations and sometimes increased them, VERI-DPO improved validity (from 76.7% to 82.5%) and increased the average output length and number of supported claims.
- This confirms the model did not simply "say less" to avoid errors but improved the quality of the content.
Comparison to Reranking:
- A "Best-of-K" reranking baseline (generating 8 candidates and picking the best) achieved good factuality but required significant inference compute. VERI-DPO achieved comparable or better factuality as a single-sample policy, making it more efficient for clinical deployment.

5. Significance and Impact

Scalable Factuality Supervision: VERI-DPO offers a scalable alternative to expensive clinician annotation by using an automated, evidence-linked verifier to generate training signals.
Auditability: The framework produces intermediate artifacts (claim labels, evidence identifiers, confidence margins) that allow for efficient error localization and post-hoc auditing by clinicians.
Clinical Safety: By explicitly penalizing contradictions while preserving coverage, the method addresses a critical safety gap in clinical AI: the trade-off between being "safe" (omitting info) and being "useful" (providing actionable details).
Generalizability: While tested on ICU data, the pipeline (Verifier $\to$ Utility Mining $\to$ DPO) provides a blueprint for evidence-aware alignment in other domains where ground truth is fragmented and verification is critical.

Limitations: The study relies on a single dataset (100 patients) and automated judges. The approach is retrieval-dependent; if relevant evidence is missing or poorly indexed, the verifier may misclassify claims as "Not Addressed" rather than "Not Supported," potentially affecting the preference signal.