Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification

Imagine you have a super-smart robot assistant that can see pictures and talk about them. It's amazing at describing a sunset or a cat playing with yarn. But sometimes, when you show it a tricky picture or a weirdly edited image, the robot starts to hallucinate. It might confidently say, "That's a purple elephant," when it's actually a dog, or it might get tricked by a hidden message in the image into saying something mean or dangerous.

The authors of this paper asked a simple question: How can we tell when the robot is confused, lying, or being tricked before it gives us a bad answer?

Most current methods try to guess if the robot is unsure by asking it to "think" multiple times or by looking at how confident its final words sound. But the authors realized these methods are like trying to guess why a car broke down just by listening to the engine noise. You know something is wrong, but you don't know what is wrong.

The Two Types of Confusion: "The Argument" vs. "The Blank Mind"

The researchers discovered that when these AI models mess up, it usually comes from one of two specific types of mental confusion:

The Internal Argument (Conflict): Imagine the robot is looking at a picture of a goldfish bowl. One part of its brain says, "That's a fish!" but another part, looking at the text written on the bowl, says, "No, that's a car!" The robot is stuck in a tug-of-war. It has too much information, but the information is fighting against itself. This is Conflict.
The Blank Mind (Ignorance): Now imagine the robot is shown a picture of a strange, futuristic flying machine it has never seen before. It looks at the shape and color, but it has absolutely no idea what it is. It's not arguing; it's just empty-handed. It's guessing because it lacks the necessary knowledge. This is Ignorance.

The Solution: A "Truth Detective" (EUQ)

The paper introduces a new tool called Evidential Uncertainty Quantification (EUQ). Think of this as a special "Truth Detective" that sits inside the robot's brain.

Instead of waiting for the robot to speak, this detective looks at the raw signals the robot is processing before it decides on an answer. It treats every piece of information the robot sees as a "witness" giving testimony.

Positive Witnesses: "I saw a fish!"
Negative Witnesses: "Wait, the text says 'car'!"

The detective uses a mathematical rulebook (called Dempster-Shafer Theory) to weigh these witnesses.

If the positive and negative witnesses are screaming at each other, the detective flags High Conflict.
If the witnesses are silent or there are no witnesses at all, the detective flags High Ignorance.

Why This Matters: The "One-Pass" Magic

Old methods were like asking the robot to write the same story ten times and comparing them to see if they match. This is slow and expensive.

The new method is like a single glance. The detective looks at the robot's internal signals once and instantly knows:

"Ah, this hallucination is happening because the robot is arguing with itself."
"This failure is happening because the robot has no idea what it's looking at."

The Results: A Smarter Safety Net

The researchers tested this on four different super-smart robots. They found that their "Truth Detective" was much better at spotting errors than previous methods.

Hallucinations (making things up) were almost always caught by the Conflict detector.
Out-of-Distribution failures (seeing something totally new) were almost always caught by the Ignorance detector.

The Big Picture

This paper gives us a new way to understand AI. Instead of just saying "The AI is wrong," we can now say, "The AI is wrong because it's confused by conflicting clues," or "The AI is wrong because it's out of its depth."

This is a huge step forward for safety. If we know why an AI is misbehaving, we can fix it better. We can teach it to resolve arguments or tell it when to say, "I don't know," instead of guessing. It's like giving the robot a mirror so it can see its own confusion and stop before it causes trouble.

1. Problem Statement

Large Vision-Language Models (LVLMs) have achieved significant success in multimodal tasks but suffer from misbehaviors when encountering challenging, distribution-shifted, or adversarial inputs. These misbehaviors include:

Hallucinations: Generating unfaithful text (e.g., describing non-existent objects).
Jailbreaks: Producing harmful content aligned with adversarial visual prompts.
Adversarial Vulnerabilities: Failing under imperceptible pixel perturbations.
Out-of-Distribution (OOD) Failures: Inability to generalize to unseen data styles or qualities.

Existing Uncertainty Quantification (UQ) methods struggle to detect these issues effectively because:

They typically measure total predictive uncertainty without distinguishing the source of the uncertainty.
They often rely on computationally expensive sampling strategies (e.g., generating multiple outputs) or require model retraining (e.g., Bayesian approaches).
They fail to capture the specific epistemic roots of misbehaviors: internal contradictions (conflict) and lack of information (ignorance).

2. Methodology: Evidential Uncertainty Quantification (EUQ)

The authors propose EUQ, a training-free framework that explicitly decomposes epistemic uncertainty into two distinct components: Conflict (CF) and Ignorance (IG). The method operates in a single forward pass using the model's internal features.

Core Components:

Feature Extraction:
- Instead of using final logits or probabilities, EUQ utilizes pre-logits features ( $Z$ ) from the LVLM's output projection head. These features encode high-level cross-modal signals directly linked to decision-making.
Evidence Construction via Basic Belief Assignment (BBA):
- The paper interprets the linear projection of features as an evidence fusion process based on Dempster-Shafer Theory (DST).
- Pre-logit features are treated as evidence supporting or opposing specific hypotheses (output tokens).
- An affine transformation converts features into evidence weights ( $E$ $E$ ), which are decomposed into:
  - Positive Evidence ( $E^+$ ): Support for a hypothesis.
  - Negative Evidence ( $E^-$ ): Contradiction or opposition to a hypothesis.
- The parameters for this transformation are derived via a Least Commitment Principle (LCP) optimization, ensuring conservative belief assignment without additional training.
Uncertainty Quantification (DST Fusion):
- Conflict (CF): Quantifies the degree of contradiction between positive and negative evidence. High CF indicates the model holds conflicting internal signals (e.g., seeing an object but the text description contradicts it).
- Ignorance (IG): Quantifies the lack of supporting information. High IG indicates the model lacks sufficient evidence to make a confident decision (e.g., "guessing" due to missing data).
- The method uses Dempster's Rule of Combination to fuse evidence weights. Crucially, the authors derive a closed-form solution (Theorem 1) that allows computing CF and IG without enumerating the full power set of hypotheses, avoiding the combinatorial explosion typical in DST.

3. Key Contributions

Novel Decomposition of Epistemic Uncertainty: This is the first work to explicitly characterize and quantify two distinct sources of LVLM misbehavior: Internal Conflict (contradictory evidence) and Ignorance (missing evidence).
Training-Free Efficiency: Unlike Bayesian methods or Evidential Deep Learning (EDL) which require retraining, EUQ is a post-hoc, training-free method that operates on standard model outputs. It requires only a single forward pass, making it highly efficient.
Layer-Wise Dynamics Analysis: The authors provide a novel perspective on how uncertainty evolves through the decoder layers, observing that Ignorance decreases while Conflict increases as the model processes information deeper into the network.
Misbehavior Differentiation: The framework demonstrates that different misbehaviors manifest differently in the uncertainty space (e.g., Hallucinations = High Conflict; OOD Failures = High Ignorance).

4. Experimental Results

The method was evaluated on four advanced LVLMs (DeepSeek-VL2-Tiny, Qwen2.5-VL-7B, InternVL2.5-8B, MoF-Models-7B) across a comprehensive benchmark (Misbehavior-Bench) covering hallucinations, jailbreaks, adversarial attacks, and OOD inputs.

Performance Gains: EUQ consistently outperformed strong baselines, including sampling-based methods (Semantic Entropy, Self-Consistency) and probability-based methods (Predictive Entropy).
- Achieved relative improvements of up to 10.5% in AUROC and 5.5% in AUPR.
- Specifically, Conflict (CF) was the strongest detector for Hallucinations (AUROC ~0.761), while Ignorance (IG) was superior for OOD failures (AUROC ~0.948).
Efficiency:
- EUQ requires only ~0.009 seconds per example (single forward pass).
- In contrast, sampling-based baselines (e.g., Semantic Entropy) incur 10x overhead (~0.9 seconds) due to multiple generations.
Ablation Studies:
- The method is robust to temperature variations.
- It scales effectively across model sizes (4B to 38B+), with larger models showing clearer separation in uncertainty signals.
- Layer-wise analysis confirmed that deeper layers accumulate more supporting cues (reducing IG) but also amplify conflicting signals (increasing CF).

5. Significance and Impact

Safety and Reliability: EUQ provides a practical, low-latency tool for detecting harmful or unreliable LVLM outputs in critical applications (e.g., autonomous driving, medical diagnosis) without the computational cost of retraining or repeated sampling.
Interpretability: By distinguishing between "conflict" and "ignorance," the method offers a deeper diagnostic understanding of why a model fails, moving beyond simple "low confidence" scores.
Theoretical Advancement: The paper bridges Dempster-Shafer Theory with modern deep learning inference, demonstrating that linear projection layers can be mathematically interpreted as evidence fusion operators. This opens new avenues for uncertainty quantification in large-scale models without architectural modifications.

In conclusion, EUQ represents a significant step forward in making LVLMs safer and more trustworthy by providing a fine-grained, efficient, and theoretically grounded mechanism to detect and categorize model misbehaviors.

Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification

The Two Types of Confusion: "The Argument" vs. "The Blank Mind"

The Solution: A "Truth Detective" (EUQ)

Why This Matters: The "One-Pass" Magic

The Results: A Smarter Safety Net

The Big Picture

1. Problem Statement

2. Methodology: Evidential Uncertainty Quantification (EUQ)

Core Components:

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank