Mechanistically Guided LoRA Improves Paraphrase Consistency in Medical Vision-Language Models

This paper demonstrates that a mechanistically guided LoRA fine-tuning approach, leveraging transferred Sparse Autoencoders to balance paraphrase consistency with answer accuracy, significantly reduces response flip rates and margin differences in medical Vision-Language Models while maintaining stable diagnostic performance.

Binesh Sadanandan, Vahid Behzadan

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you have a highly trained medical AI assistant, like a super-smart digital radiologist. You show it an X-ray and ask, "Is there a pneumothorax (collapsed lung)?" It confidently says, "Yes."

But then, you rephrase the question slightly: "Does this image show a pneumothorax?"

Surprisingly, the AI suddenly changes its mind and says, "No."

This is the problem the paper tackles. Even though the two questions mean the exact same thing to a human, the AI gets confused by the way the words are arranged. In a real hospital, this is dangerous. If a doctor asks the same question in two different ways and gets two different answers, they can't trust the machine.

Here is how the researchers fixed it, explained through simple analogies:

1. The Problem: The "Fickle Friend"

Think of the AI as a friend who is great at diagnosing problems but has a very short attention span. If you ask them a question politely, they give a great answer. If you ask the same question casually, they get confused and give a different answer.

The researchers found that this AI (called MedGemma) was flipping its answers about 15% of the time just because the wording changed. It wasn't that the AI didn't know the answer; it was that its internal "brain" was reacting to the style of the question rather than the meaning.

2. The Detective Work: Finding the "Glitch"

To fix this, the researchers didn't just guess; they acted like detectives using a special tool called a Sparse Autoencoder (SAE). Think of this tool as an X-ray for the AI's brain. It lets you see which specific neurons are firing when the AI processes a question.

They discovered a specific "neuron" (Feature 3818) in the middle of the AI's brain (Layer 17) that was acting like a sensitive mood ring.

  • When the question was phrased as "Is there...?" (a presence question), this neuron lit up like a Christmas tree.
  • When the question was phrased as "Can you rule out...?" (an exclusion question), this neuron went completely dark.

This neuron was so sensitive to the style of the question that it was actually pushing the AI's final decision in the wrong direction, causing it to flip from "Yes" to "No."

3. The Fix: The "Balanced Diet" Training

The researchers tried to fix this by teaching the AI to ignore the wording. However, they hit a snag.

The Trap: When they tried to force the AI to be consistent, the AI got lazy. It realized that if it just answered "Yes" to every single question, it would be perfectly consistent (since "Yes" always equals "Yes"). But this is useless because it stops being a doctor and becomes a broken record. This is called Mode Collapse.

The Solution: The researchers created a new training recipe called a "Combined Loss."
Think of this like training a student with two rules:

  1. Rule A (Consistency): "If I ask you the same question in two different ways, you must give the same answer."
  2. Rule B (Accuracy): "But you still have to get the right answer based on the X-ray."

By balancing these two rules, the AI learned to be consistent without giving up on being accurate. It stopped guessing randomly and started paying attention to the image, not just the grammar.

4. The Surprise: Fixing the Foundation, Not the Roof

The researchers expected to fix the problem by tweaking the specific "mood ring" neuron they found (Layer 17). They thought, "If we fix the glitch in the middle, the problem goes away."

But when they tested different layers of the AI, they found something surprising: The best fix was actually at the very beginning (Layers 0–10).

The Analogy: Imagine a factory assembly line making cars.

  • The glitch was happening at the painting station (Layer 17), where the car was getting the wrong color because of the worker's mood.
  • The researchers thought they needed to fire the painter.
  • But they discovered that if they fixed the blueprint at the very start of the line (Layer 0–10), the painter never got confused in the first place.

By making small adjustments to the early layers of the AI, they prevented the confusion from ever happening, rather than trying to correct it after the AI had already made a mistake.

The Results

After this training:

  • Confusion dropped: The AI stopped flipping its answers from 15% of the time down to just 4%.
  • Stability increased: Even when the answer didn't flip, the AI's confidence level became much more stable.
  • Accuracy stayed high: The AI didn't get lazy; it remained a good doctor, keeping its accuracy high.

Why This Matters

This paper shows that we can make medical AI safer and more reliable. By understanding how the AI thinks (mechanistic interpretability) and training it with a balanced approach, we can ensure that a doctor gets the same trustworthy answer, no matter how they phrase their question. It turns a fickle, confusing assistant into a steady, reliable partner.