Mechanistically Guided LoRA Improves Paraphrase Consistency in Medical Vision-Language Models

Imagine you have a highly trained medical AI assistant, like a super-smart digital radiologist. You show it an X-ray and ask, "Is there a pneumothorax (collapsed lung)?" It confidently says, "Yes."

But then, you rephrase the question slightly: "Does this image show a pneumothorax?"

Surprisingly, the AI suddenly changes its mind and says, "No."

This is the problem the paper tackles. Even though the two questions mean the exact same thing to a human, the AI gets confused by the way the words are arranged. In a real hospital, this is dangerous. If a doctor asks the same question in two different ways and gets two different answers, they can't trust the machine.

Here is how the researchers fixed it, explained through simple analogies:

1. The Problem: The "Fickle Friend"

Think of the AI as a friend who is great at diagnosing problems but has a very short attention span. If you ask them a question politely, they give a great answer. If you ask the same question casually, they get confused and give a different answer.

The researchers found that this AI (called MedGemma) was flipping its answers about 15% of the time just because the wording changed. It wasn't that the AI didn't know the answer; it was that its internal "brain" was reacting to the style of the question rather than the meaning.

2. The Detective Work: Finding the "Glitch"

To fix this, the researchers didn't just guess; they acted like detectives using a special tool called a Sparse Autoencoder (SAE). Think of this tool as an X-ray for the AI's brain. It lets you see which specific neurons are firing when the AI processes a question.

They discovered a specific "neuron" (Feature 3818) in the middle of the AI's brain (Layer 17) that was acting like a sensitive mood ring.

When the question was phrased as "Is there...?" (a presence question), this neuron lit up like a Christmas tree.
When the question was phrased as "Can you rule out...?" (an exclusion question), this neuron went completely dark.

This neuron was so sensitive to the style of the question that it was actually pushing the AI's final decision in the wrong direction, causing it to flip from "Yes" to "No."

3. The Fix: The "Balanced Diet" Training

The researchers tried to fix this by teaching the AI to ignore the wording. However, they hit a snag.

The Trap: When they tried to force the AI to be consistent, the AI got lazy. It realized that if it just answered "Yes" to every single question, it would be perfectly consistent (since "Yes" always equals "Yes"). But this is useless because it stops being a doctor and becomes a broken record. This is called Mode Collapse.

The Solution: The researchers created a new training recipe called a "Combined Loss."
Think of this like training a student with two rules:

Rule A (Consistency): "If I ask you the same question in two different ways, you must give the same answer."
Rule B (Accuracy): "But you still have to get the right answer based on the X-ray."

By balancing these two rules, the AI learned to be consistent without giving up on being accurate. It stopped guessing randomly and started paying attention to the image, not just the grammar.

4. The Surprise: Fixing the Foundation, Not the Roof

The researchers expected to fix the problem by tweaking the specific "mood ring" neuron they found (Layer 17). They thought, "If we fix the glitch in the middle, the problem goes away."

But when they tested different layers of the AI, they found something surprising: The best fix was actually at the very beginning (Layers 0–10).

The Analogy: Imagine a factory assembly line making cars.

The glitch was happening at the painting station (Layer 17), where the car was getting the wrong color because of the worker's mood.
The researchers thought they needed to fire the painter.
But they discovered that if they fixed the blueprint at the very start of the line (Layer 0–10), the painter never got confused in the first place.

By making small adjustments to the early layers of the AI, they prevented the confusion from ever happening, rather than trying to correct it after the AI had already made a mistake.

The Results

After this training:

Confusion dropped: The AI stopped flipping its answers from 15% of the time down to just 4%.
Stability increased: Even when the answer didn't flip, the AI's confidence level became much more stable.
Accuracy stayed high: The AI didn't get lazy; it remained a good doctor, keeping its accuracy high.

Why This Matters

This paper shows that we can make medical AI safer and more reliable. By understanding how the AI thinks (mechanistic interpretability) and training it with a balanced approach, we can ensure that a doctor gets the same trustworthy answer, no matter how they phrase their question. It turns a fickle, confusing assistant into a steady, reliable partner.

1. Problem Statement

Medical Vision-Language Models (VLMs), such as MedGemma-4B, exhibit paraphrase sensitivity, where semantically equivalent clinical questions (e.g., "Is there evidence of pneumothorax?" vs. "Does this show a pneumothorax?") yield different binary (Yes/No) answers or significantly different confidence margins.

Impact: This inconsistency undermines clinical trust and poses safety risks, as different clinicians may phrase questions differently but expect reliable, identical outputs.
Baseline Metrics: On the PSF-Med benchmark (MIMIC-CXR subset, $n=158$ ), the baseline model exhibited a 14.6% flip rate (inconsistent answers) and a mean margin difference of 1.63 logits (instability in confidence) between paraphrase pairs.

2. Methodology

The authors employed a two-stage approach: Mechanistic Interpretability to understand the failure mode, followed by Targeted Parameter-Efficient Fine-Tuning (LoRA) to mitigate it.

A. Mechanistic Analysis

The authors utilized Sparse Autoencoders (SAEs) from Gemma Scope 2 to decompose model activations into interpretable features.

SAE Transfer Validation: They validated that SAEs trained on base Gemma transfer effectively to the fine-tuned MedGemma-4B, achieving an $R^2 \approx 0.997$ on both medical and general text.
FlipBank Construction: A curated dataset of 158 high-confidence "flip" cases (where the model answers differently to paraphrases) was created to isolate the failure mode.
Feature Identification: Through delta analysis of activations, Feature 3818 at Layer 17 was identified as a candidate mechanism.
- Behavior: This feature responds to question register (specifically distinguishing between "presence-focused" vs. "exclusion-focused" phrasing) rather than surface-level formality.
- Causal Validation: Using activation patching, the authors intervened on Feature 3818. Removing its differential contribution in a flip case recovered 28% of the margin shift (vs. 8% for random features), confirming a causal link between this feature and the model's inconsistent decisions.

B. Mitigation via Combined Loss LoRA

To fix the inconsistency without degrading performance, the authors trained Low-Rank Adaptation (LoRA) adapters.

Architecture: LoRA adapters were inserted into layers 15–19 (targeting the mechanistic finding), applied to both Attention and MLP modules ( $r=16, \alpha=32$ ).
The "Mode Collapse" Problem: Training with pure consistency loss (minimizing divergence between paraphrase outputs) caused the model to collapse into predicting "Yes" for all inputs, trivially minimizing divergence but destroying accuracy.
Proposed Solution: A Combined Loss Function balancing two objectives:
$L = L_{consistency} + \lambda L_{accuracy}$
- $L_{consistency}$ : Symmetric Kullback-Leibler (KL) divergence between the probability distributions of the original and paraphrased questions.
- $L_{accuracy}$ : Cross-entropy loss against ground truth labels to maintain discriminative ability.
- $\lambda$ : Set to 1.0 to weigh both objectives equally.

3. Key Contributions

Systematic Characterization: Distinguished between "flip rate" (binary answer changes) and "margin instability" (confidence shifts) in MedGemma-4B.
SAE Transfer Validation: Demonstrated that pre-trained SAEs from base models transfer effectively to fine-tuned medical VLMs ( $R^2 \approx 0.997$ ).
Mechanistic Case Study: Identified Feature 3818 (Layer 17) as a register-sensitive mechanism causally influencing decision margins.
Novel Training Strategy: Developed a combined consistency-accuracy loss for LoRA that prevents mode collapse while significantly improving paraphrase invariance.

4. Experimental Results

Primary Results (PSF-Med MIMIC-CXR, $n=158$ )

Metric	Baseline	+ LoRA (Combined Loss)	Improvement
Flip Rate	14.6%	4.4%	69.6% reduction ( $p=0.002$ )
Mean Margin Diff	1.63 logits	0.33 logits	79.5% reduction
Accuracy	84.2%	82.3%	-1.9pp (Not significant)

Statistical Significance: The reduction in flip rate is statistically significant (two-proportion z-test, $p=0.002$ ). Accuracy remained stable, proving the model did not collapse.

Cross-Dataset Generalization (PadChest Balanced, $n=250$ )

The model, trained only on MIMIC-CXR, was tested on the Spanish PadChest dataset:

Flip Rate: Reduced from 13.6% to 7.8%.
Margin Difference: Reduced from 1.08 to 0.35 (67.9% reduction).
Accuracy: Increased from 66.4% to 69.4%, suggesting the consistency training may have regularized the model.

Layer Ablation Study

The authors tested LoRA insertion in different layer ranges:

Early Layers (0–10): Achieved the best margin reduction (86%), outperforming the mechanistically targeted middle layers (15–19, 80% reduction).
Implication: While the mechanism (Feature 3818) manifests in the middle layers, the optimal intervention point is earlier. Early layers likely prevent the sensitivity from developing, whereas middle-layer interventions attempt to correct it after it has manifested.

5. Significance and Conclusion

This work bridges mechanistic interpretability and parameter-efficient fine-tuning to solve a critical safety issue in medical AI.

Safety: By reducing paraphrase sensitivity, the model becomes more reliable for clinical deployment where phrasing varies.
Methodological Insight: The study highlights that while SAEs can identify where a problem manifests (Layer 17), the most effective intervention may occur upstream (Early Layers).
Practicality: The combined loss approach offers a scalable solution that avoids the "mode collapse" pitfall of pure consistency training, maintaining high diagnostic accuracy while ensuring robustness.

The code and data are set to be released upon acceptance, facilitating further research into VLM stability and interpretability.

Mechanistically Guided LoRA Improves Paraphrase Consistency in Medical Vision-Language Models

1. The Problem: The "Fickle Friend"

2. The Detective Work: Finding the "Glitch"

3. The Fix: The "Balanced Diet" Training

4. The Surprise: Fixing the Foundation, Not the Roof

The Results

Why This Matters

1. Problem Statement

2. Methodology

A. Mechanistic Analysis

B. Mitigation via Combined Loss LoRA

3. Key Contributions

4. Experimental Results

Primary Results (PSF-Med MIMIC-CXR, n=158n=158n=158)

Cross-Dataset Generalization (PadChest Balanced, n=250n=250n=250)

Layer Ablation Study

5. Significance and Conclusion

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization

Primary Results (PSF-Med MIMIC-CXR, $n=158$ )

Cross-Dataset Generalization (PadChest Balanced, $n=250$ )