See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis

The Big Idea: Don't Look at the Patient Alone; Look at the "Normal" Too

Imagine you are trying to find a single typo in a massive, 500-page book. If you just look at that one page in isolation, it's hard to tell if a word is misspelled or if it's just a weird font choice. But, if you have a perfect, error-free copy of the same page right next to it, the typo jumps out immediately.

This is exactly what doctors do every day. When they look at an X-ray or a skin scan, they rarely just stare at the patient's image. They subconsciously (or explicitly) compare it to what a "healthy" version of that body part looks like. They ask: "Is this shadow in the lung normal, or is it different from a healthy lung?"

The Problem:
Current AI models (called Vision-Language Models or VLMs) are like students who have only ever studied single pages in isolation. They are great at describing what they see, but they struggle to spot subtle differences because they haven't been taught to compare. They try to diagnose a disease based on one image alone, which is like trying to find a needle in a haystack without knowing what a needle looks like.

The Solution: "See-in-Pairs" (SiP)
The researchers behind this paper created a new way to teach AI. Instead of showing the AI just the "sick" image, they show it two images at once:

The Query: The patient's image (the one with the potential problem).
The Reference: A healthy image from a different person (the "perfect copy").

They then ask the AI: "Compare these two. What is different?"

How It Works (The Analogy of the Art Critic)

Think of the AI as an art critic trying to spot a forgery.

The Old Way (Single Image): The critic looks at one painting and tries to guess if it's fake. They might get confused by the lighting, the frame, or the artist's unique style. It's a hard guess.
The New Way (SiP): The critic is given the painting in question and a known authentic painting right next to it. They are told, "Look at the brushstrokes here. Are they the same?" Suddenly, the forgery is obvious because the AI can ignore the "noise" (like the frame or lighting) and focus purely on the difference.

What Did They Do?

Tested the "Zero-Shot" Idea: First, they asked existing AI models (which hadn't been trained on this specific task) to just look at pairs of images. Surprisingly, even without special training, the AI got better at diagnosing diseases just by having a healthy reference image to compare against.
The "Lightweight" Upgrade (SFT): To make it even better, they gave the AI a small amount of extra training. They showed it thousands of pairs of (Sick Image + Healthy Image) and told it the answer. This is like giving the art critic a crash course in spotting forgeries. They didn't need to retrain the whole brain of the AI; they just tweaked the part that makes decisions.
Testing Different "References": They wondered, "Does the healthy image have to be a perfect match?"
- Does the healthy person need to be the same age?
- Does the photo need to be taken with the same machine?
- The Result: It turns out, it doesn't matter much! Whether they picked a random healthy image, a matching one, or one from a different hospital, the AI still got better. This is great news because it means the system is robust and easy to use in the real world.

Why Is This a Big Deal?

It Mimics Real Doctors: It finally makes AI think like a human doctor, who always compares the sick to the healthy.
It Catches Subtle Clues: Many diseases look very similar to normal anatomy. By comparing, the AI learns to ignore the "normal" stuff and focus only on the "weird" stuff.
It's Efficient: They didn't need millions of new labeled images. They just used the healthy images that already exist in hospitals and paired them up.
It's More Trustworthy: When the researchers looked at where the AI was looking (using heatmaps), they saw that the "See-in-Pairs" AI stopped looking at random background noise and started focusing exactly on the disease, just like a human would.

The Bottom Line

This paper introduces a simple but powerful trick: Don't let the AI diagnose in a vacuum. Give it a healthy friend to compare against. By doing this, the AI becomes a sharper, more reliable diagnostician, capable of spotting the tiny, life-saving differences that were previously invisible to it. It's a shift from "What do I see?" to "What is different here?"

1. Problem Statement

Medical image diagnosis is inherently challenging due to the subtle nature of pathological findings, which often appear as minor deviations within large amounts of normal anatomy, compounded by significant inter-patient variability.

Clinical Gap: In clinical practice, clinicians routinely perform comparative diagnosis, juxtaposing a patient's query image with healthy control images or prior exams to isolate abnormalities.
AI Limitation: Existing Medical Vision-Language Models (VLMs) are primarily optimized for single-image or single-series (longitudinal) analysis. They lack explicit mechanisms for cross-subject comparative diagnosis (comparing a patient against a different healthy individual).
Research Question: Can incorporating clinically motivated cross-subject comparisons (query image + healthy reference image) enhance the diagnostic performance of VLMs, even when trained on limited data?

2. Methodology: See-in-Pairs (SiP)

The authors propose the See-in-Pairs (SiP) framework, which integrates reference images into the VLM inference and training pipeline.

A. Inference Strategy (Zero-Shot)

The model is prompted with a triplet: (Query Image $X$ , Reference Image $X'$ , Question $Q$ ).

Reference Selection: The reference image $X'$ $X^{'}$ is typically a "healthy control" (negative label). The paper evaluates five selection strategies:
1. Random Sampling: Uniformly sampling from healthy controls.
2. Demographic Matching: Matching gender, view, or projection.
3. Embedding-based Retrieval: Selecting the most similar healthy image in feature space.
4. Cross-Center Sampling: Using images from a different dataset/institution to test domain robustness.
5. Bagging: Using multiple references per query and aggregating predictions via majority voting.
Input Serialization: The images are fused (e.g., via co-attention or scale-then-compress architectures) and concatenated with the text prompt before being fed to the language decoder.

B. Training Strategy (Supervised Fine-Tuning - SFT)

To adapt general-purpose VLMs (like Qwen-VL, Phi-3, NVILA) for this task without prohibitive computational costs:

Lightweight SFT: Only the language decoder is fine-tuned using LoRA (Low-Rank Adaptation), while the vision encoder and projection layers remain fixed.
Data Construction: Training data is constructed as (Query, Reference, Label) triplets. Crucially, the reference images are selected to share diagnosis-irrelevant features (e.g., scanner type, demographics) but lack the pathology, forcing the model to focus on the difference.
Objective: The model learns to generate diagnostic answers conditioned on the comparative context, effectively learning to identify deviations from a normative baseline.

3. Key Contributions

New Perspective: Identifies cross-subject comparative diagnosis as an essential, overlooked direction for medical VLMs, arguing that models should mimic the clinical practice of comparing patient scans against reference cases.
Zero-Shot Feasibility: Demonstrates that general-purpose VLMs with multi-image priors (e.g., Qwen-VL-2.5, Phi-3.5) can outperform single-image baselines in zero-shot settings when provided with structured (query, reference) prompts, even without specific medical training.
Scalable SFT Framework: Proposes a lightweight fine-tuning method using (query, reference, label) triplets. This injects comparative medical knowledge into general-purpose models using small amounts of labeled data.
Robustness to Reference Selection: Shows that the method is robust across various reference selection strategies (random, demographic, embedding-based, cross-center), meaning strict metadata matching is not always required for success.
Mechanistic Insight: Provides theoretical and empirical evidence that SiP improves sample efficiency and feature alignment, helping models focus on pathology-specific deviations while suppressing nuisance variations (e.g., scanner differences, anatomical variations).

4. Experimental Results

The study was evaluated across six diagnostic tasks spanning four modalities: Radiology (Pneumonia, Edema), Ophthalmology (Glaucoma, Retinopathy), and Dermatology (Melanoma, DermaTri).

Zero-Shot Performance:
- General-purpose VLMs (Qwen, Phi) showed significant improvements in Balanced Accuracy (BAcc) and F1 scores when using SiP compared to single-image inputs, particularly in tasks with subtle morphological differences (e.g., Retinopathy, Glaucoma).
- Medical-specific VLMs (NVILA, LLaVA-Med) showed mixed results; while they performed well on single-image radiology tasks, SiP significantly boosted their performance on non-radiology tasks where single-image priors were less aligned with the data distribution.
Supervised Fine-Tuning (SFT) Results:
- SiP-based SFT consistently outperformed single-image SFT baselines (including random, cluster, and spatial sampling baselines) across all three tested backbones (Qwen-VL-7B, Phi-3-4B, NVILA-8B).
- Performance Gains: SiP achieved the best or second-best performance in nearly all tasks. For example, on the Retinopathy task, SiP improved BAcc from 50% (single-image) to **79%**.
- Bagging: Using multiple references per query (Bagging) further stabilized performance and reduced variance.
Reference Selection Analysis:
- Performance remained stable even when using Cross-Center references (different datasets) or Embedding-based retrieval, proving the method does not rely on perfect metadata matching.
- Increasing the number of references (up to 20-30x) showed diminishing returns but generally improved robustness.
Qualitative Analysis (Attribution):
- Heatmap analysis (Occlusion Sensitivity) revealed that single-image models often rely on spurious, global correlations (e.g., background markers).
- SiP models produced more spatially coherent, anatomically plausible attributions, focusing specifically on the lesion areas and ignoring background noise.

5. Significance and Conclusion

Clinical Alignment: The SiP framework bridges the gap between AI and clinical workflow by formalizing the "compare-and-contrast" diagnostic reasoning used by human doctors.
Data Efficiency: It demonstrates that high-performance medical diagnosis can be achieved with lightweight fine-tuning on small datasets by leveraging abundant healthy control images as references, rather than requiring massive, disease-specific training sets.
Interpretability: By forcing the model to compare against a healthy baseline, SiP reduces reliance on nuisance variables (scanner artifacts, demographics) and improves the interpretability of the model's decision-making process.
Future Impact: The authors advocate for a paradigm shift in medical AI toward comparative inference architectures, suggesting that future VLMs should be natively designed to handle multi-subject comparisons for more reliable and robust diagnosis.

See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis

The Big Idea: Don't Look at the Patient Alone; Look at the "Normal" Too

How It Works (The Analogy of the Art Critic)

What Did They Do?

Why Is This a Big Deal?

The Bottom Line

1. Problem Statement

2. Methodology: See-in-Pairs (SiP)

A. Inference Strategy (Zero-Shot)

B. Training Strategy (Supervised Fine-Tuning - SFT)

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation