PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models

Imagine you have a very smart, highly trained medical assistant named "Dr. AI." This assistant can look at X-ray images and answer questions like, "Is there a broken bone?" or "Is the heart enlarged?"

You would hope that Dr. AI is consistent. If you ask, "Is there a broken bone?" and get a "Yes," you'd expect that if you asked, "Do I have a fracture?" (which means the exact same thing), the answer would still be "Yes."

But this paper reveals a scary glitch: Dr. AI is easily confused by how you phrase your question.

Here is the breakdown of the research, explained simply with some analogies.

1. The Problem: The "Paraphrase Sensitivity" Glitch

The researchers built a giant test called PSF-Med. They took thousands of chest X-rays and asked the same medical questions in dozens of different ways.

Question A: "Is there a collapsed lung?"
Question B: "Does this X-ray show a pneumothorax?" (Same meaning, different words).

The Result: The AI models often gave contradictory answers.

To Question A, it said: "No."
To Question B, it said: "Yes."

This is like a weather forecaster saying, "It's going to rain," when you ask, "Will it be wet?" but saying, "It will be dry," when you ask, "Is there a chance of precipitation?" In a hospital, this inconsistency is dangerous. If two doctors ask the same question in different ways and get opposite answers, they can't trust the machine.

2. The Trap: "Low Flip Rates" Don't Mean "Good"

The researchers found something even more surprising. Some models were very consistent (they rarely changed their answer when you rephrased the question). You might think, "Great! That model is reliable!"

But wait. The researchers discovered that some of these "reliable" models weren't actually looking at the X-ray at all. They were just guessing based on the words.

The Analogy: Imagine a student taking a test who is so good at memorizing the questions that they know the answers without looking at the textbook. If you ask, "Is the sky blue?" they say "Yes." If you ask, "Is the atmosphere blue?" they still say "Yes." They are consistent, but they aren't actually seeing the sky.
The Finding: The most consistent models often ignored the image entirely and relied on "language shortcuts" (e.g., "If the question asks about a broken bone, the answer is usually 'Yes'").

3. The Detective Work: Finding the "Switch"

To understand why the AI was flipping its answers, the researchers used a special tool called a Sparse Autoencoder (SAE). Think of this as an X-ray for the AI's brain. It lets you see the tiny electrical switches (neurons) firing inside the model.

They found one specific switch, which they named Feature 3818.

What does this switch do? It acts like a tone detector.
- When the question is formal and clinical (e.g., "Is there radiographic evidence of..."), the switch turns ON. The AI becomes conservative and says "No" (to be safe).
- When the question is casual (e.g., "Can you see..."), the switch turns OFF. The AI becomes permissive and says "Yes."

The Glitch: The AI wasn't looking at the lung; it was just reacting to whether the doctor sounded like a professor or a friend.

4. The Fix: Flipping the Switch Back

Once they found this "tone detector" switch, they tried to fix it. They essentially told the AI: "Ignore this switch. Don't let the tone of the question change your answer."

The Result: By "clamping" (turning off) this specific switch, they reduced the number of contradictory answers by 31%.
The Trade-off: The model got slightly less accurate overall (by about 1%), but it became much more reliable. It stopped guessing based on word choice and started actually looking at the picture.

5. The Big Lesson

The paper concludes that we can't just measure if an AI is "accurate" or "consistent" in isolation.

Consistency is good, but only if the AI is actually looking at the image.
Accuracy is good, but only if the AI isn't just guessing based on how the question is phrased.

The Takeaway:
Before we let AI doctors into our hospitals, we need to test them not just on what they know, but on how they react to different ways of asking. We need to make sure they are looking at the X-ray, not just listening to the tone of our voice.

In short: The paper teaches us that a smart AI that changes its mind based on your vocabulary is dangerous, and a consistent AI that ignores the picture is useless. We need an AI that does both: looks at the image and stays calm, no matter how you ask.

1. Problem Statement

Medical Vision Language Models (VLMs) are increasingly deployed in clinical workflows to interpret radiological images (e.g., chest X-rays) and answer diagnostic questions. However, a critical safety failure mode has been identified: Paraphrase Sensitivity.

The Issue: When a clinician rephrases a semantically identical question (e.g., changing "Is there a pneumothorax?" to "Does this X-ray show a collapsed lung?"), the model may produce contradictory answers (e.g., "No" vs. "Yes").
The Risk: Inconsistency undermines trust in diagnostic tools. If two clinicians asking equivalent questions receive opposite answers, the system cannot be relied upon for patient safety.
The Gap: Existing benchmarks (like VQA-RAD) focus on accuracy against fixed question sets but fail to measure consistency under natural language variation. Furthermore, low flip rates do not necessarily imply robust visual reasoning; models might achieve consistency by ignoring the image and relying on language priors.

2. Methodology

The authors propose a comprehensive framework involving benchmark construction, robustness evaluation, mechanistic interpretability, and mitigation.

A. The PSF-Med Benchmark

Dataset: Constructed from MIMIC-CXR and PadChest, comprising 5,534 images and 19,748 clinical questions.
Paraphrase Generation: Each question is paired with 3–5 semantically equivalent paraphrases (totaling ~92,000 pairs) generated using GPT-4.
- Transformations: Lexical substitution, syntactic restructuring, formality shifts (clinical vs. colloquial), and scope variation.
- Filtering: BioClinicalBERT embeddings ensure semantic similarity >0.90, and manual logic excludes pairs where semantics are inverted (e.g., "Is X present?" vs. "Is X absent?").
Metric: Flip Rate. The fraction of questions where at least one paraphrase causes the model to change its Yes/No decision compared to the original question.

B. Robustness vs. Grounding Analysis

To determine if consistency stems from visual reasoning or text shortcuts, the authors introduced two baselines:

Text-Only Agreement: Measuring if the model gives the same answer when the image is replaced with a blank/gray image. High agreement suggests reliance on language priors.
Swap Sensitivity: Measuring if the answer changes when the image is swapped with a different patient's image. Low sensitivity suggests the model ignores visual evidence.
Attention Analysis: Correlating model attention maps with ground-truth pathology bounding boxes to see if "flip" cases correspond to poor visual grounding.

C. Mechanistic Interpretability (SAEs)

To understand why flips occur, the authors applied Sparse Autoencoders (SAEs) to MedGemma 4B:

Tool: GemmaScope 2 SAEs (16k width) applied to the residual stream at Layer 17.
FlipBank: A curated set of 158 high-confidence flip cases used for causal analysis.
Causal Patching: The authors identified specific features that changed significantly between the original question and the paraphrase. They then "patched" (subtracted) these features from the paraphrase activation to see if the flip reversed.

D. Mitigation Strategies

Feature Clamping: Forcing the activation of the identified problematic feature to zero during inference.
Prompt Normalization: Converting all input questions into a standardized clinical template (e.g., "Is [finding] present in this chest radiograph?") to remove surface-level variation.

3. Key Contributions

PSF-Med Benchmark: The first large-scale benchmark specifically designed to measure paraphrase sensitivity in medical VLMs, covering ~92,000 question-paraphrase pairs.
Discovery of the Robustness-Grounding Trade-off: Demonstrated that low flip rates do not guarantee visual grounding. Models with the lowest flip rates (e.g., MedGemma-27B) often showed the highest "Text-Only Agreement," indicating they rely on language priors rather than analyzing the image. Conversely, models that attend more to visual evidence sometimes exhibit higher sensitivity to phrasing.
Mechanistic Explanation: Identified Feature 3818 at Layer 17 of MedGemma 4B as a "prompt-framing" feature.
- This feature activates strongly for formal clinical language (e.g., "Is there radiographic evidence...") and deactivates for casual language (e.g., "Does this show...").
- Activation correlates with conservative responses ("No"), while deactivation leads to permissive responses ("Yes").
Causal Validation & Mitigation: Proved that removing Feature 3818's contribution causally reverses flips. Clamping this feature reduced flip rates by 31% with only a 1.3% drop in accuracy, while simultaneously reducing reliance on text priors.

4. Key Results

Flip Rate Variance: Across six medical VLMs, flip rates ranged from 8% to 58%. MedGemma-27B was the most consistent (8.1%), while RadFM and LLaVA-Rad were the most sensitive (~55-58%).
Paraphrase Type Sensitivity:
- Negation-adjacent paraphrases (e.g., "Is there X?" vs. "Is there any sign of X?") caused the highest flip rates (25–35%).
- Lexical substitutions (synonyms) were the most robust (6–8%).
Visual Grounding Correlation:
- Models with high flip rates often attended more correctly to pathology regions (higher swap sensitivity).
- Models with low flip rates often ignored the image (high text-only agreement).
Feature 3818 Impact:
- Causal Patching: Removing Feature 3818 recovered 44.8% of the decision margin on average and fully reversed 15% of flips.
- Clamping: Reducing flip rates by 31% (MIMIC-CXR) and 20% (PadChest) with minimal accuracy cost.
- Combined Approach: Feature clamping + Prompt normalization reduced flip rates by 41% (15.6% $\to$ 9.2%).

5. Significance and Implications

Redefining Evaluation: The paper argues that flip rate alone is insufficient for evaluating medical VLMs. A model can be "consistent" simply by ignoring the image. Evaluations must jointly measure paraphrase stability and visual grounding (via text-only baselines and attention analysis).
Safety in Deployment: In clinical settings, where questions are phrased naturally by diverse users, high paraphrase sensitivity poses a direct patient safety risk.
Interpretability for Safety: The study demonstrates that mechanistic interpretability (using SAEs) can identify specific internal mechanisms driving safety failures and enable targeted, low-cost interventions (clamping) without retraining the model.
Generalizability: The findings suggest that prompt-framing features are a common failure mode in VLMs, and the proposed mitigation strategies (clamping and normalization) are applicable across different architectures and datasets.

In conclusion, PSF-Med provides a critical lens for assessing the reliability of medical AI, revealing that current state-of-the-art models often trade visual reasoning for linguistic consistency, and offering a mechanistic path to improve both.