Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification

Imagine you have a brilliant, fast-talking medical intern named "AI-Robot." This intern is incredibly good at looking at X-rays and describing exactly what they see in perfect, fluent English. They can say, "I see a shadow here" or "The heart looks big there" with great confidence.

However, there's a catch: AI-Robot is a terrible logician.

Sometimes, AI-Robot sees a shadow but forgets to conclude, "This means there is fluid in the lung." Other times, it sees a perfectly healthy lung but confidently writes, "I diagnose pneumonia," just because it thinks that's what usually happens. It's like a student who memorized the answers to a test but didn't learn the math behind them. They might get the right answer by luck, or they might write a beautiful essay that makes no logical sense.

In the medical world, this is dangerous. Doctors can't just trust the intern's "fluent" report; they have to double-check every single sentence to make sure the conclusion actually follows from the evidence. This is exhausting and slows everything down.

The Problem: The "Fluency Trap"

Currently, we test these AI interns by comparing their reports to a "gold standard" report written by a human doctor. We use computer metrics (like counting how many words match) to grade them.

But this is like grading a math student only on their handwriting. If the student writes a beautiful, long essay that says "2 + 2 = 5," but uses fancy words, a word-counting system might give them a decent score because the words look similar to a real report. It fails to catch that the logic is broken.

The Solution: The "Logic Police" (Neurosymbolic Verification)

The researchers in this paper built a new system to act as a Logic Police Officer for these AI interns. They call it a "Neurosymbolic Verification Framework."

Here is how it works, using a simple analogy:

1. The Translator (Autoformalization)

First, the AI's messy, free-flowing English report is translated into a strict, rigid language that a computer can understand perfectly.

The Analogy: Imagine the AI writes a story: "The sky is dark and the ground is wet."
The Translator converts this into a strict code: IF (Sky == Dark) AND (Ground == Wet) THEN (It is Raining).
This removes all the ambiguity and "fluff" from the text.

2. The Rulebook (Clinical Knowledge Base)

The system has a giant, perfect rulebook written by real doctors.

The Analogy: This rulebook says things like: "If you see a dark shadow at the bottom of the lung, it MUST mean there is fluid there." It's the absolute law of the land.

3. The Judge (The SMT Solver)

Now, the system brings the AI's translated code and the Rulebook to a super-strict judge (a computer program called Z3). The judge asks a simple question: "Does the evidence force the conclusion?"

The judge checks the logic mathematically, not by guessing. It can find three types of errors that normal tests miss:

The Hallucination: The AI says, "I see a broken bone," but the evidence section never mentioned a bone. The Judge says: "FALSE. You made this up."
The Missed Deduction: The AI says, "I see a broken bone," but forgets to write, "Therefore, the patient needs a cast." The Judge says: "FALSE. You saw the evidence but failed to do the math."
The Consistent Report: The AI sees the evidence, applies the rules, and writes the correct conclusion. The Judge says: "TRUE. This logic is sound."

What Did They Find?

The researchers tested this "Logic Police" on seven different AI models using chest X-rays.

Old Tests Failed: The standard word-matching tests gave high scores to models that were actually making logical mistakes. They were like a teacher who only checked if the essay looked pretty.
The Logic Police Exposed Flaws: The new system found that many AI models were "stochastic hallucinators"—they were just guessing diagnoses based on patterns, not reasoning.
The "Safety Filter": When they used this system to fix the reports (by deleting any diagnosis that couldn't be logically proven by the evidence), the reports became much safer.
- The Trade-off: The AI became slightly less "chatty" (it stopped guessing some things), but it became much more accurate when it did speak. It stopped lying.

The Big Picture

This paper proposes a shift in how we trust AI in medicine. Instead of asking, "Does this report sound like a human wrote it?" we should ask, "Can this report be proven true using strict logic?"

It's the difference between trusting a magician who makes things look real, and trusting an engineer who can prove, with math, that the bridge won't collapse. By adding this "Logic Police" step, we can finally use AI assistants in hospitals without worrying that they are confidently making things up.

1. Problem Statement

The paper addresses a critical safety gap in deploying Vision-Language Models (VLMs) for radiology report generation. While VLMs (e.g., MedGemma, LLaVA-Med) can draft reports, they suffer from logical inconsistencies and hallucinations:

Deductive Failures: Models may correctly identify visual findings (e.g., "blunted costophrenic angle") but fail to deduce the logically entailed diagnosis (e.g., "pleural effusion"), or conversely, hallucinate diagnoses unsupported by the visual evidence.
Inadequate Evaluation: Standard NLP metrics (BLEU, ROUGE) rely on lexical similarity to ground-truth reports. These fail to capture logical validity, penalize clinically accurate paraphrasing, and offer no safety guarantees in reference-free settings where ground truth is unavailable.
The "Illusion of Reasoning": VLMs function as probabilistic text generators optimized for fluency, not verifiable logic, creating a dangerous disconnect between perception and diagnosis.

2. Methodology: Neurosymbolic Verification Framework

The authors propose a neurosymbolic pipeline that decouples probabilistic visual perception from deterministic logical reasoning. The core objective is to verify if the "Impression" (diagnosis) section of a report is logically entailed by the "Findings" (perceptual evidence) section using a formal knowledge base.

A. Ontological Grounding & Autoformalization

Formal Ontology ( $O$ ): Defined as $\langle F, D, K \rangle$ , where $F$ are atomic observational predicates (findings), $D$ are diagnostic predicates, and $K$ is a clinical knowledge base of propositional rules (e.g., $d \implies \neg d'$ ).
Autoformalization ( $T$ ): A constrained LLM (GPT-OSS-20B, temperature 0.0) maps free-text "Findings" into a structured binary evidence vector $V$ . Under a Closed-World Assumption (CWA), unmentioned findings are treated as absent.
Extraction: Diagnoses in the "Impression" section are extracted via strict schema-constrained string matching.

B. Diagnostic Entailment via Satisfiability (SAT)

The verification is framed as a formal satisfiability problem using an SMT solver (Z3):

Context Construction: The findings vector $V$ is converted into a propositional context $\Phi_V$ .
Verification Logic: For a claimed diagnosis $d$ , the system checks the satisfiability of the negated conclusion:
$\text{IsSat}(\Phi_V \land K \land \neg d)$
Classification of Errors:
- Supported (Entailed): The solver returns Unsat. The diagnosis is mathematically forced by the findings.
- Unsupported (Hallucinated): The solver returns Sat. The diagnosis is asserted but not forced by the evidence.
- Missed (Omitted): The diagnosis is forced by evidence but absent from the report.
- Correctly Excluded: The diagnosis is neither forced nor claimed.

3. Key Contributions

Reference-Free Neurosymbolic Framework: A novel pipeline that bridges probabilistic text generation and deterministic logic, enabling runtime verification of diagnostic logic without needing ground-truth reference reports.
Discovery of Failure Modes: The framework exposes distinct reasoning failure modes (conservative observation vs. stochastic hallucination) that are invisible to traditional lexical metrics.
Post-Hoc Guarantees: Demonstrates that enforcing solver-backed entailment acts as a rigorous filter, systematically eliminating unsupported hallucinations and significantly improving diagnostic soundness and precision.

4. Experimental Results

The framework was evaluated on 7 VLMs (including MedGemma, LLaVA variants, and Lingshu) across 5 chest X-ray benchmarks (MIMIC-CXR, CheXpert, NIH-CXR, etc.).

A. Failure of Lexical Metrics

Standard BLEU and ROUGE-L scores were near-zero across all models, confirming that lexical overlap is a poor proxy for clinical reasoning quality.

B. Reference-Free Auditing (Internal Consistency)

Soundness ( $S$ ): The proportion of generated diagnoses logically supported by findings.
Completeness ( $C$ ): The proportion of logically forced diagnoses that were actually verbalized.
Findings:
- Conservative Models (e.g., Qwen3-VL-8B): High Soundness (~0.99) but lower Completeness; they rarely hallucinate but miss forced conclusions.
- Stochastic Models (e.g., Llava-Vicuna-7B): Low Precision and Completeness; they frequently hallucinate diagnoses unsupported by evidence.
- Balanced Models (e.g., MedGemma-27B): Exhibit high soundness and completeness, indicating reliable reasoning.

C. Impact of Symbolic Filtering

When applied to labeled datasets (CheXpert, NIH-CXR14) as a post-hoc filter:

Soundness & Precision: Increased significantly for all models (e.g., Soundness improved by ~3-4% for MedGemma-27B).
Recall & Completeness: Slightly decreased (typically <1-2%), indicating that the filter removes only unsupported claims while preserving the majority of logically entailed diagnoses.
Conclusion: The filter acts as a conservative safeguard, trading a minimal loss in sensitivity for a major gain in reliability and auditability.

5. Significance

Paradigm Shift: Moves clinical AI evaluation from surface-level text similarity to verifiable internal consistency.
Safety-Critical Assurance: Provides a mathematical guarantee (under the assumption of correct text-to-symbol translation) that a generated diagnosis is not a hallucination, addressing the "trust gap" in medical AI.
Clinical Workflow Integration: Offers a practical mechanism for "human-in-the-loop" verification, where the system flags logical inconsistencies for clinician review rather than blindly trusting the model's output.
Formal Verification in Medicine: Establishes a precedent for using formal methods (SAT/SMT solvers) to audit generative AI in high-stakes domains, ensuring that "fluency" does not override "correctness."