LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs

The Big Picture: The "Confident but Clueless" Doctor

Imagine a brilliant medical AI doctor (a Vision-Language Model or VLM) that has read every medical textbook in the world. It can look at an X-ray or a tissue slide and guess what's wrong with 99% accuracy. This is the "Zero-Shot" ability—it knows the theory perfectly.

But here's the problem: When this AI is unsure, it doesn't know how unsure it is. Sometimes it guesses confidently when it's actually wrong. In medicine, that's dangerous. If a doctor says, "It's definitely a broken bone," but it's actually a tumor, the patient gets the wrong treatment.

We need a system that says: "I'm 90% sure it's a broken bone, but there's a small chance it's a tumor. Let's check both." This is called Conformal Prediction. It gives a "safety net" of possible answers instead of just one guess.

The Two Big Problems with Current Safety Nets

Even with safety nets, current methods have two annoying flaws:

The "Shotgun" Approach (Inefficiency): To be safe, the AI often throws a huge net. Instead of saying "It's likely a broken bone or a tumor," it might say, "It could be a broken bone, a tumor, a bruise, a cyst, or a scar." The list is so long it's useless.
The "Unfair Net" (Imbalance): The safety net works great for common diseases (like a cold) but is terrible for rare ones (like a rare cancer). It might miss the rare disease entirely while being overly cautious about the common one.

The Catch: To fix these nets, people usually try to "teach" the AI new tricks using a few labeled examples. But if you teach it and then test it on the same examples, you cheat the system. It's like a student studying the exact test questions before taking the exam; their score looks great, but they aren't actually smarter. This breaks the mathematical guarantee that the safety net is real.

The Solution: LATA (The "Group Chat" Refinement)

The authors propose LATA (Laplacian-Assisted Transductive Adaptation). Think of LATA as a smart group chat that happens after the AI makes its initial guesses but before it gives the final answer.

Here is how it works, step-by-step:

1. The "Group Chat" (Transductive Adaptation)

Imagine the AI looks at 100 patients. It makes a quick guess for each one.

Patient A has a rash that looks like Poison Ivy.
Patient B has a rash that looks exactly like Patient A's.
Patient C has a rash that looks like Poison Ivy, but the AI is confused.

In the old way, the AI treats everyone alone. In LATA, the AI puts all 100 patients in a "group chat." It looks at the images and says, "Hey, Patient A and Patient B look identical. If I'm confident about A, I should probably be confident about B too."

It smooths out the guesses. If the AI was confused about Patient C, but Patient C looks just like the confident Patient A, the AI realizes, "Oh, I should probably be more confident about C too."

The Magic Trick: LATA does this without changing the AI's brain (no training) and without looking at the correct answers (no labels). It just uses the visual similarities between the patients to refine the guesses. Because it treats the "test" patients and the "calibration" patients exactly the same way, it doesn't cheat. The safety net remains valid.

2. The "Stress Detector" (Failure-Aware Scoring)

Sometimes, a patient has a weird, rare condition that the AI has never seen before. The AI might guess confidently, but it's actually a "hard" case.

LATA has a special "Stress Detector" (called ViLU). It looks at the image and the text description and asks: "Is this a tricky case?"

If yes: It widens the safety net. It says, "This is hard, so let's include more possibilities to be safe."
If no: It tightens the net. It says, "This is easy and clear, so let's give a short, precise list."

This prevents the AI from being overly cautious on easy cases (saving time) and overly reckless on hard cases (saving lives).

3. The "Prior Knowledge" Knob (Optional)

Sometimes, we know that in a specific hospital, a certain disease is very rare. LATA has a little "knob" (a prior) that can gently nudge the AI to remember this fact. It's like a doctor saying, "Remember, we rarely see this specific tumor here, so don't guess it unless you're really sure." This can be done without looking at the specific patient's diagnosis, just the general statistics of the hospital.

Why is this a Big Deal?

It's a "Black Box" Upgrade: You don't need to retrain the massive AI model (which takes weeks and supercomputers). You just run this "group chat" step on the results. It's fast and cheap.
It's Fairer: It fixes the "Unfair Net" problem. Rare diseases get better safety nets, and common diseases don't get bogged down in huge lists.
It's Honest: Unlike other methods that "cheat" by studying the test data, LATA keeps the mathematical promise that the safety net actually works.

The Bottom Line

LATA is like giving a brilliant but slightly arrogant medical AI a team of peers to double-check its work.

It looks at the group of patients to see who looks like whom.
It uses a "stress detector" to know when to be extra careful.
It does all this without changing the AI's personality or cheating on the test.

The result? Smaller, more accurate lists of possible diagnoses, fewer missed rare diseases, and a system that doctors can actually trust.

1. Problem Statement

Medical Vision-Language Models (VLMs) have emerged as powerful zero-shot recognizers for medical imaging. However, their deployment in safety-critical settings faces two major challenges:

Reliability under Domain Shift: Standard VLMs often suffer from domain shifts (e.g., different imaging modalities or institutions) and class imbalance, leading to uncalibrated uncertainty.
Limitations of Split Conformal Prediction (SCP): While SCP provides finite-sample coverage guarantees, it often produces inefficient prediction sets (too large) and unbalanced coverage across classes (high Class-Conditioned Coverage Gap, or CCV), especially in few-shot or imbalanced regimes.
The Validity-Adaptation Trade-off: Attempts to improve performance by fine-tuning models or using calibration labels to adapt the model often break the exchangeability assumption required for SCP, thereby invalidating the theoretical coverage guarantees. Existing transductive methods either lack coverage guarantees or are computationally prohibitive (requiring per-query retraining).

2. Methodology: LATA

The authors propose LATA (Laplacian-Assisted Transductive Adaptation), a training-free, label-free, and black-box refinement framework that operates on the joint pool of calibration and test data.

A. Core Components

Transductive Refinement via Graph Laplacian:
- Instead of updating model weights, LATA refines the zero-shot probability distributions ( $q(x)$ ) directly.
- It constructs a sparse image-image $k$ -Nearest Neighbor (kNN) graph over the joint pool of calibration and test samples.
- It optimizes a regularized objective to find refined distributions ( $\tilde{z}(x)$ ) that remain faithful to the original zero-shot predictions while varying smoothly across the graph (Laplacian regularization).
- Solver: The optimization is solved efficiently using the Concave-Convex Procedure (CCCP) with mean-field updates. This is a deterministic, iterative process that requires no backpropagation.
- Validity Preservation: Because the same deterministic transformation is applied identically to both calibration and test samples, the exchangeability assumption is preserved, ensuring SCP validity.
Failure-Aware Conformal Scoring (ViLU Integration):
- LATA integrates a frozen Vision-Language Uncertainty (ViLU) module to generate two signals for each image:
  - Instance-level failure probability ( $u(x)$ ): Estimates how likely the model is to fail on a specific input.
  - Label-attention vector ( $\alpha(x)$ ): Indicates which labels are plausible given the image-text context.
- These signals modulate the nonconformity score ( $S^*$ ):
  $S^*(x, y) = S_{base}(\tilde{z}(x), y) \cdot (1 + \lambda u(x)) - \eta \alpha_y(x)$
- Effect: Hard inputs (high $u$ ) have their scores inflated (protecting coverage), while labels deemed implausible by the attention mechanism have their scores deflated (reducing set size).
Optional Prior Knob:
- LATA can optionally incorporate a label-informed prior ( $\beta$ ) derived from calibration marginals. This is applied symmetrically to both calibration and test sets, allowing for a trade-off between efficiency and coverage without violating exchangeability.

B. Algorithm Flow

Input: Frozen VLM, labeled calibration set ( $D_{cal}$ ), unlabeled test set ( $D_{test}$ ).
Refinement: Build a kNN graph on the joint pool. Apply CCCP updates to smooth zero-shot probabilities into $\tilde{z}(x)$ .
Scoring: Compute failure-aware nonconformity scores ( $S^*$ ) using the ViLU module.
Calibration: Compute the threshold $\hat{s}$ on the calibration set using the new scores.
Prediction: Construct prediction sets for test samples containing all labels with scores $\le \hat{s}$ .

3. Key Contributions

LATA Framework: A novel, deterministic, label-free transductive refinement method that sharpens zero-shot predictions without retraining the VLM or breaking conformal guarantees.
Failure-Aware Scoring: A plug-in mechanism that leverages multimodal uncertainty signals to balance set efficiency and class-wise fairness.
Theoretical Guarantee: The method preserves the exchangeability required for Split Conformal Prediction, offering finite-sample coverage guarantees while adapting to the target domain.
Efficiency: It is computationally lightweight (no backprop, no per-query refits) and operates in a black-box manner.

4. Experimental Results

The method was evaluated across three medical VLMs (CONCH for histology, FLAIR for ophthalmology, CONVIRT for chest X-rays) and nine downstream tasks (including fine-grained classification, imbalanced datasets, and domain shifts).

Performance vs. Baselines:
- Efficiency: LATA consistently reduced the average prediction set size by 7–12% compared to strong unsupervised transductive baselines (like SCA-T) and Conf-OT, while maintaining or improving coverage.
- Fairness: It significantly reduced the Class-Conditioned Coverage Gap (CCV) by 10–15%, indicating more uniform coverage across different medical classes.
- Comparison to Label-Using Methods: LATA (specifically the label-informed variant LATA-LI) approached the performance of FCA (Full Conformal Adaptation), which uses labels and heavy computation, but did so without using target-domain labels during adaptation and with significantly lower compute.
- Validity: Unlike methods that "double-dip" (using calibration labels for both adaptation and calibration), LATA maintained nominal coverage (e.g., $\approx 0.90$ for $\alpha=0.10$ ), whereas invalid baselines systematically under-covered.
Ablation Studies:
- The graph-based refinement (LATA) provided the bulk of the gains in set size and CCV.
- The ViLU failure-aware scoring further improved fairness and set compactness.
- The method is robust to hyperparameters (graph degree $k$ , smoothing weight $\gamma$ , number of iterations) and temperature scaling.
Computational Cost:
- LATA adds negligible overhead (~~0.05–0.06 seconds per image) and minimal GPU memory (~~0.8 GB) compared to baselines that require entropy minimization or optimal transport.

5. Significance

This work addresses a critical gap in the deployment of Medical VLMs: how to adapt models to new domains without sacrificing the rigorous uncertainty guarantees required for clinical safety.

Safety-Critical Adaptation: It demonstrates that one can improve model reliability and efficiency in few-shot, imbalanced medical scenarios without retraining the foundation model or violating statistical guarantees.
Practicality: By being label-free and training-free, LATA is highly deployable in real-world settings where labeled data is scarce and computational resources are limited.
Fairness: The reduction in CCV ensures that the model's uncertainty estimates are reliable across all disease classes, not just the majority ones, which is vital for equitable healthcare AI.

In summary, LATA provides a robust, efficient, and theoretically sound framework for calibrating medical VLMs, bridging the gap between high-performance zero-shot recognition and the strict reliability requirements of medical diagnostics.