Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

Imagine you have a very smart, multi-talented assistant (a Multimodal Large Language Model, or MLLM) who can look at pictures, listen to audio, watch videos, and read text. This assistant is great at answering questions, but sometimes, like a confident student who doesn't actually know the answer, it makes up a plausible-sounding but completely wrong story. This is called a "hallucination."

The problem is: How do you know when your assistant is guessing versus when it actually knows the truth?

This paper introduces a new tool called UMPIRE (Uncertainty using Model Probability Indicators and Response Embeddings) to solve this. Think of UMPIRE as a "Confidence Detective" that doesn't need to ask the assistant to explain itself or hire a second expert to check the work. It just watches the assistant's own behavior to figure out if it's unsure.

Here is how UMPIRE works, broken down into simple concepts:

1. The "Group Brainstorming" Analogy

Imagine you ask your assistant a question: "What animal is in this picture?"

If the assistant is sure: You ask it to answer the same question 50 times (sampling). It will likely give you the same answer every time, like "It's a dog." The answers are all identical and fit perfectly with the picture.
If the assistant is unsure: You ask it 50 times. It might say "Dog" 10 times, "Cat" 10 times, "Bear" 10 times, and "A weird blob" 20 times. The answers are all over the place (diverse), and they don't seem to agree with the picture (incoherent).

2. The Two Clues UMPIRE Looks For

UMPIRE measures uncertainty by looking at two specific things during this "group brainstorming":

Clue A: The "Semantic Volume" (How spread out are the answers?)

Imagine you plot all 50 answers on a giant map.

Low Uncertainty: All the dots are clustered tightly together in one spot (e.g., all say "Dog"). The "volume" of this cluster is tiny.
High Uncertainty: The dots are scattered across the entire map (e.g., some say "Dog," some "Cat," some "Car"). The "volume" they cover is huge.
The Metaphor: Think of this like a balloon. If the answers are all the same, the balloon is small and deflated. If the answers are all different, the balloon is blown up huge. A bigger balloon means more confusion.

Clue B: The "Incoherence Score" (How shaky is the assistant's confidence?)

Even if the answers are all "Dog," the assistant might be saying it with a shaky voice. UMPIRE checks the assistant's internal "confidence meter" (probability) for each answer.

High Confidence: The assistant says "Dog" with 99% certainty.
Low Confidence: The assistant says "Dog" but only with 20% certainty, while also thinking "Cat" is possible.
The Metaphor: Think of this as shaking hands. If the assistant is confident, its hand is steady. If it's unsure, its hand is shaking. UMPIRE weighs the "spread out" answers (Clue A) by how much the assistant is "shaking" (Clue B).

3. The Magic Formula: "Incoherence-Adjusted Semantic Volume"

UMPIRE combines these two clues into one score.

If the answers are spread out (Big Balloon) AND the assistant is shaking (Low Confidence), UMPIRE screams: "STOP! This is a guess! Don't trust this answer!"
If the answers are clustered (Small Balloon) and the assistant is steady (High Confidence), UMPIRE says: "Go ahead, this looks reliable."

Why is UMPIRE Special?

Most other tools for checking confidence have big problems:

They are picky: Some only work for text, others only for images. UMPIRE works for everything (images, audio, video, text) because it uses the assistant's own internal brain features.
They are slow: Some tools require hiring a second, expensive AI to check the work. UMPIRE does it alone, very quickly.
They need training: Some tools need to be taught how to spot lies. UMPIRE is training-free; it works right out of the box.

Real-World Impact

Imagine a doctor using an AI to read X-rays.

Without UMPIRE: The AI might confidently say, "This is a broken bone," but it's actually a shadow. The doctor trusts it, and the patient gets the wrong treatment.
With UMPIRE: The AI tries to answer, but UMPIRE notices the AI is "shaking" and giving "spread out" answers. UMPIRE flags the case as "High Risk." The system then says, "Hey, I'm not sure about this one. Let's ask a human doctor to double-check."

Summary

UMPIRE is a smart, fast, and universal way to tell when a multi-talented AI is confident and when it's just making things up. It does this by watching how much the AI's answers vary and how shaky its confidence is, ensuring that we only trust the AI when it truly knows what it's talking about.

1. Problem Statement

Multimodal Large Language Models (MLLMs) have demonstrated rapid capabilities in processing diverse inputs (text, image, audio, video). However, they suffer from hallucinations and confabulations (producing plausible but erroneous outputs), which hinders their deployment in high-stakes scenarios like medical analysis.

Existing uncertainty quantification (UQ) methods face significant limitations:

Modality Specificity: Many methods are designed only for text-only LLMs or specific input pairs (e.g., image-text), failing to generalize across audio or video.
External Dependencies: Some approaches rely on external tools (e.g., entailment models, reference generators) or supervised training, which are computationally expensive and impractical for resource-constrained or black-box settings.
Lack of Coherence: Current metrics often fail to capture multimodal coherence—the alignment between the generated response and all input modalities (e.g., a text answer that ignores the visual context).

The authors propose a framework that is training-free, modality-agnostic, and relies solely on the MLLM's internal features to estimate uncertainty effectively.

2. Methodology: UMPIRE

The proposed framework, UMPIRE (Uncertainty using Model Probability Indicators and Response Embeddings), estimates uncertainty by computing the Incoherence-adjusted Semantic Volume of sampled responses. It operates without external tools, using only the model's own embeddings and conditional probabilities.

Core Components

The method involves four steps for a given task instance $t$ :

Sampling (U1): Generate $k$ responses ( $Y_t = \{y_1, ..., y_k\}$ ) from the MLLM using standard inference settings (e.g., temperature $T=1$ , nucleus sampling).
Semantic Embedding (U2): Extract the normalized embedding vector $\phi_i$ (from the last layer's EOS token) for each response. These vectors form a matrix $\Phi_t$ . The geometric spread of these vectors represents semantic diversity.
Incoherence Scoring (U3): Compute an incoherence score $c_i$ $c_{i}$ for each response based on the model's internal confidence.
- The score is defined as $c_i = \exp(\alpha(1 - p_i))$ , where $p_i$ is the model-generated probability of the response $y_i$ given the multimodal input.
- This captures local incoherence: if the model is uncertain (low $p_i$ ), the response is considered "incoherent" with the input, contributing more to the uncertainty score.
Volume Calculation (U4): Compute the uncertainty metric $V_t$ using a Quality-Diversity Kernel inspired by Determinantal Point Processes (DPPs).
$V_t = \frac{1}{2k} \log \det \left[ C_t (\Phi_t \Phi_t^\top + \epsilon I_k) C_t \right]$
Where $C_t$ is a diagonal matrix of incoherence scores.

Theoretical Decomposition

The metric $V_t$ decomposes into two complementary terms:
$V_t = U_t + \alpha Q_t$

$U_t$ (Unadjusted Semantic Volume): Measures the global semantic diversity of the sampled responses (geometric volume spanned by embeddings). High diversity indicates the model is unsure which semantic "mode" is correct.
$Q_t$ (Quadratic Entropy Estimate): A Monte Carlo estimate of the expected incoherence (based on $1-p_i$ ). It captures the dispersion of probability mass. Low $Q_t$ implies the model is confident (high probability) in a single mode, while high $Q_t$ implies uncertainty.

The hyperparameter $\alpha$ balances these terms. The authors propose an adaptive heuristic to set $\alpha$ without labeled data by matching the medians of $U_t$ and $Q_t$ .

3. Key Contributions

Desiderata for MLLM Uncertainty: The paper defines a clear set of requirements for MLLM uncertainty metrics, including Discrimination (distinguishing correct vs. wrong), Risk-linearity (calibration), Multimodal Generalizability (working across text, image, audio, video), and Multimodal Coherence (sensitivity to input modality corruption).
Training-Free Framework: UMPIRE requires no fine-tuning, external verifiers, or modality-specific engineering. It leverages the MLLM's inherent multimodal embeddings and probability distributions.
Theoretical Analysis: The authors provide a theoretical decomposition showing how the metric combines semantic volume and quadratic entropy, proving that the combination satisfies the proposed desiderata better than either term alone.
Generalization: The framework is demonstrated to work not only on multimodal QA (image, audio, video) but also on non-text generation tasks (image and audio generation), a domain rarely covered by existing UQ methods.

4. Experimental Results

The authors evaluated UMPIRE against strong baselines (Neighborhood Consistency, Semantic Entropy, Eigenscore, Length-Normalized Entropy) across multiple benchmarks.

Discrimination (AUROC): UMPIRE consistently achieved the highest AUROC across image-text (VQAv2, OKVQA, AdVQA), audio-text (SLUE), and video-text (VidMME) datasets. It particularly excelled in adversarial and out-of-distribution settings where modality-specific baselines failed.
Calibration (CPC & ECE): UMPIRE showed superior risk-score quality. It achieved a high Pearson Correlation (CPC $\approx$ 0.90) between the uncertainty score and error rate, and the lowest Expected Calibration Error (ECE $\approx$ 0.062), indicating its scores are reliable proxies for error probability.
Multimodal Coherence (R4): In experiments where image inputs were corrupted or removed, UMPIRE's performance degraded predictably, proving it utilizes multimodal information. In contrast, text-only baselines (like Eigen) remained invariant to image removal, failing to detect the loss of context.
Efficiency (R5): UMPIRE is computationally efficient, requiring negligible overhead compared to MLLM inference. It avoids the heavy costs of external tools (e.g., NLI models used in Semantic Entropy) and performs well even with small sampling budgets ( $k=5$ ).
Blackbox Adaptation: The method was successfully applied to blackbox models (e.g., GPT-4o) using a smaller whitebox proxy model to generate embeddings, maintaining high performance.

5. Significance

Reliable Deployment: By providing a robust, calibration-aware uncertainty metric, UMPIRE enables the escalation of unreliable queries to human experts or larger models, a critical step for deploying MLLMs in safety-critical fields like healthcare.
Modality Agnosticism: It solves the scalability problem of creating new uncertainty metrics for every new data modality (e.g., smell, touch, new video formats) by relying on the model's internal representation space.
Theoretical Insight: The work bridges the gap between geometric diversity (volume) and probabilistic confidence (entropy), offering a unified view of uncertainty that is more robust than previous approaches relying on a single signal.
Practicality: Being training-free and computationally lightweight makes it immediately applicable to existing MLLM pipelines without requiring architectural changes or massive computational resources.