Logit-Level Uncertainty Quantification in Vision-Language Models for Histopathology Image Analysis

🏥 The Big Picture: AI Doctors Need a "Second Opinion"

Imagine you have a team of AI doctors (called Vision-Language Models or VLMs) that can look at microscope images of tissue and tell you what's wrong. They are incredibly smart, but like any human doctor, they can sometimes be unsure, make mistakes, or "hallucinate" (make things up).

In a hospital, if a doctor is unsure, they ask for a second opinion or double-check their work. This paper asks a crucial question: How do we know when an AI doctor is unsure?

The authors built a special "uncertainty meter" to test three different AI models. They wanted to see which model stays calm and consistent, and which one starts panicking and giving random answers when the questions get hard.

🧪 The Experiment: The "Temperature" Test

To test these AI doctors, the researchers used a concept called Temperature. Think of this like a "creativity dial" on a radio:

Low Temperature (0.0): The AI is robotic and strict. It always gives the exact same answer, no matter how many times you ask. It's like a calculator.
High Temperature (1.0): The AI is chaotic and creative. It takes risks and might give a different answer every time you ask. It's like a jazz musician improvising.

The researchers turned this dial from 0 to 1 and asked the AI 100 different tissue images with three types of questions:

Easy: "What does this cell look like?"
Medium: "Is this tissue cancerous?"
Hard: "Give me a detailed, quantitative analysis of the tumor."

They then measured how much the AI's answers changed when they turned the dial.

🤖 The Three Contestants

The study tested three different AI models, each with a different personality:

1. VILA-M3-8B (The Generalist Student)

Who it is: A smart AI trained on everything (general internet data, not just medicine).
The Result: It's okay at simple tasks, but when the questions get hard, it gets very confused.
The Analogy: Imagine a brilliant high school student who knows a little bit about everything. If you ask them to solve a basic math problem, they get it right. But if you ask them to perform advanced surgery, they start sweating, their hands shake, and they give you a different, wild answer every time you ask.
Verdict: High uncertainty on complex medical tasks.

2. LLaVA-Med v1.5 (The Medical Intern)

Who it is: An AI trained specifically on medical textbooks and papers.
The Result: It's a superstar for simple questions but falls apart on complex ones.
The Analogy: Think of a medical intern who has memorized the textbook perfectly. If you ask, "What is a red blood cell?" they answer instantly and correctly. But if you ask them to analyze a rare, complex tumor pattern, they freeze up. They try to guess, and their answers swing wildly from one extreme to another.
Verdict: Great for basics, dangerous for complex diagnoses because it gets too "creative" when stressed.

3. PRISM (The Specialized Surgeon)

Who it is: An AI built only for pathology (the study of disease).
The Result: It is incredibly stable. Even when the researchers turned the "chaos dial" all the way up, this AI barely changed its answer.
The Analogy: Imagine a veteran surgeon who has done this specific operation 10,000 times. No matter how much you shake the table or turn up the noise, their hand remains steady. They give the same precise answer every time, regardless of how "random" the environment gets.
Verdict: The most trustworthy for this specific job. It is "deterministic," meaning it doesn't get confused by the chaos.

📊 The "Uncertainty Meter" Results

The researchers used four different ways to measure how much the AI's answers changed (like checking a car's engine for vibrations):

Cosine Similarity: Do the answers point in the same direction? (PRISM said "Yes" almost always; the others said "No" when things got hard).
Divergence (KL & JS): How different are the probability clouds? (The generalist and medical intern models had huge clouds of uncertainty; PRISM had a tiny, tight dot).
Mean Absolute Error: How far off are the raw numbers?

The Big Discovery:
When the questions got hard (like complex cancer analysis), the general AI and the medical intern started acting like they were drunk—swaying, stumbling, and giving different answers. The specialized AI (PRISM) stayed sober and steady.

💡 Why This Matters

In the real world, you don't want an AI doctor that gets "drunk" (random) when a patient has a complex illness.

Trust: If an AI says "I'm 90% sure," but its internal numbers are jumping around wildly, you shouldn't trust it.
Safety: This study shows that for serious medical work, you need a model that is specialized (like PRISM) and stable.
The "Second Opinion": This framework acts like a digital second opinion. If the AI's "uncertainty meter" spikes, the system can flag it and tell a human doctor, "Hey, the AI is confused here. Please look at this yourself."

🏁 The Takeaway

This paper is a warning and a guide. It tells us that while AI is amazing, we can't just trust any AI with medical data. We need to test them to see if they stay calm under pressure.

General AIs: Good for chat, bad for complex surgery.
Medical AIs: Good for basics, risky for the hard stuff.
Specialized AIs: The safest bet for critical medical decisions.

By measuring how "jittery" an AI is when the temperature rises, doctors can know exactly when to trust the machine and when to take the wheel themselves.

1. Problem Statement

Vision-Language Models (VLMs) have shown promise in healthcare, particularly for histopathology image analysis, by enabling multimodal reasoning. However, their deployment in high-stakes medical domains raises critical concerns regarding trustworthiness, reliability, and transparency.

The Gap: Existing studies often rely on surface-level token diversity or single metrics to assess uncertainty. There is a lack of systematic, logit-level uncertainty quantification (UQ) specifically for histopathology VLMs.
The Challenge: Medical decisions require high accuracy and safety. Understanding how VLMs behave under varying levels of stochasticity (controlled by temperature) and prompt complexity is essential to prevent hallucinations and ensure safe human-AI collaboration.

2. Methodology

The authors propose a model-agnostic, logit-level uncertainty quantification framework that analyzes the internal probability distributions of VLMs rather than just their final text outputs.

A. System Architecture

The framework evaluates three distinct VLMs with different architectural backgrounds:

VILA-M3-8B: A general-purpose VLM (8B parameters).
LLaVA-Med v1.5: A biomedical-tuned VLM (7B parameters).
PRISM: A pathology-specific VLM (0.6B parameters) using the Virchow encoder.

B. Experimental Design

Dataset: 100 representative histopathology patches from the ARCH dataset, selected to cover the embedding spaces of all models.
Prompts: Three levels of diagnostic complexity:
- Q1: Basic cellular morphology assessment.
- Q2: Intermediate tissue diagnosis with grading.
- Q3: Advanced systematic quantitative analysis.
Temperature Sweep: The models were run with 11 temperature values ( $T \in \{0.0, 0.1, \dots, 1.0\}$ $T \in {0.0, 0.1, \dots, 1.0}$ ).
- $T=0.0$ : Deterministic greedy decoding.
- $T=1.0$ : Maximal sampling entropy.
Repetitions: For each image-prompt-temperature combination, 30 independent iterations were performed to capture stochastic variability.
Data Capture: The framework saves the raw output logits at every decoding step, not just the final tokens.

C. Evaluation Metrics

To quantify uncertainty, the authors computed pairwise comparisons between the 30 logit sequences for each configuration using four complementary metrics:

Cosine Similarity (CS): Measures directional alignment between logit vectors (consistency).
Jensen-Shannon (JS) Divergence: A symmetric measure of distributional difference (uncertainty).
Kullback-Leibler (KL) Divergence: An asymmetric measure of probability distribution mismatch.
Mean Absolute Error (MAE): Measures the raw magnitude difference between logit values.

3. Key Contributions

Logit-Level Analysis: Unlike previous works focusing on token-level diversity, this study captures uncertainty directly in the continuous probability space of logits, offering a more granular view of model behavior.
Multi-Model Comparative Assessment: A systematic comparison of general-purpose, biomedical, and pathology-specific VLMs under identical conditions.
Temperature-Dependent Characterization: A rigorous quantification of how temperature scaling influences model stability and prediction confidence across different diagnostic complexities.
Prompt Complexity Stratification: An evaluation of how diagnostic task difficulty (from basic morphology to advanced quantification) impacts model uncertainty.

4. Experimental Results

A. Model Behavior Comparison

PRISM (Pathology-Specific):
- Exhibited near-deterministic behavior across all temperatures and prompt complexities.
- Metrics: High Cosine Similarity ( $>0.90$ ), extremely low JS/KL divergence ( $<0.10$ ), and minimal temperature effects ( $\Delta T < 0.10$ ).
- Observation: Its architecture lacks standard temperature scaling mechanisms, making it highly stable but potentially less adaptable to standard UQ techniques without perturbation.
LLaVA-Med (Biomedical-Tuned):
- Showed question-dependent robustness.
- Q1 (Basic): Highly stable and consistent (low uncertainty).
- Q2/Q3 (Complex): Suffered from abrupt uncertainty transitions. As temperature increased, JS divergence rose sharply (up to $\sim0.9$ ), and Cosine Similarity dropped significantly.
- Insight: Fine-tuning helps with basic tasks but fails to maintain consistency in complex diagnostic reasoning.
VILA-M3 (General-Purpose):
- Demonstrated balanced but non-negligible sensitivity to temperature.
- Showed a monotonic decrease in consistency and increase in divergence as temperature rose.
- Q3 (Advanced): Exhibited the highest mean uncertainty (JS divergence $\mu \approx 0.57$ ), indicating that general-purpose models struggle significantly with complex histopathology prompts.

B. Metric Correlations

Strong negative correlation between Cosine Similarity and divergence metrics ( $r \approx -0.92$ ), confirming they capture complementary aspects of uncertainty.
Near-perfect correlation between JS and KL divergence ( $r = 0.997$ ), validating their theoretical relationship.

C. Optimal Operating Points

LLaVA-Med: Recommended $T \leq 0.5$ for basic tasks and $T \leq 0.3$ for complex tasks to maintain reliability.
VILA-M3: Reliable at $T \leq 0.4$ .
PRISM: Standard temperature scaling is ineffective; alternative perturbation methods (e.g., Gaussian noise) are required for meaningful UQ.

5. Significance and Conclusion

Clinical Impact: The study demonstrates that trustworthiness in VLMs is highly context-dependent. A model may be reliable for basic morphology but dangerously unstable for complex quantitative analysis.
Safety Mechanism: Logit-level UQ acts as a "numerical second opinion." High uncertainty scores can flag outputs that require specialist review, mitigating the risk of AI hallucinations in diagnosis.
Future Directions: The findings suggest that domain-specific models (like PRISM) offer superior stability, while general models require strict temperature constraints. The framework provides a blueprint for integrating uncertainty-aware assessments into clinical AI workflows to ensure safety and transparency.

In summary, this paper establishes that logit-level uncertainty quantification is essential for evaluating the reliability of VLMs in histopathology, revealing that model architecture and prompt complexity are critical factors in determining whether an AI system is safe for clinical deployment.