HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token

Imagine you have a very smart, but slightly overconfident robot assistant. You show it a picture of a cat sitting on a mat, and you ask, "Is the cat wearing a hat?"

A normal human might look closely, realize there's no hat, and say, "No." But this robot, eager to please and trained on millions of books, might confidently say, "Yes, the cat is wearing a tiny red beret!" even though there is no hat in the picture. This is called a hallucination.

For a long time, the only way to catch the robot lying was to wait until it finished its whole sentence, read it, and then check if it was true. But by then, the robot has already wasted time and energy "talking," and if it's a critical situation (like a self-driving car or a medical diagnosis), waiting for the full lie to be spoken is too late.

Enter HALP: The "Lie Detector" that works before the robot opens its mouth.

The Core Idea: Reading the Robot's Mind

The researchers behind this paper, HALP, realized they don't need to wait for the robot to speak. They can peek at the robot's "brain" (its internal computer signals) before it generates a single word.

Think of it like this:

The Old Way: You wait for the robot to tell a story. Once it's done, you check the facts. If it lied, you have to delete the whole story and start over.
The HALP Way: You watch the robot's brain activity while it's thinking about the picture but before it starts talking. You can see the "stress signals" or "confusion sparks" in its brain that say, "I'm not sure about this!" or "I'm about to make something up!"

How It Works (The Three "Sensors")

The researchers built a tiny, lightweight "detector" (a probe) that checks three different parts of the robot's brain during a single quick scan:

The "Eyes" Sensor (Visual Features): This checks what the robot sees before it even tries to understand the question. It's like checking if the robot's eyes are blurry or if it's seeing things that aren't there.
The "Bridge" Sensor (Vision Tokens): This checks how the robot is trying to mix the picture with the question. It's like watching the robot try to connect a puzzle piece from the picture to a puzzle piece from the question. If the pieces don't fit, the bridge sensor lights up.
The "Thinking" Sensor (Query Tokens): This is the most powerful one. It checks the robot's brain right after it has looked at the picture and read the question, but before it starts typing the answer. It's like catching the robot just as it's about to speak, sensing the hesitation or the "fake confidence" in its thoughts.

The Results: A Crystal Ball for Truth

The team tested this on eight different modern AI models (like Llama, Gemma, and Qwen). Here is what they found:

It Works: The "Thinking Sensor" (Query Tokens) was incredibly good at predicting lies. For some models, it was 93% accurate at knowing if the robot was about to hallucinate, without the robot ever saying a word.
It's Fast: Because it only does one quick scan of the brain, it's super fast. It adds almost no delay to the process.
Different Robots, Different Brains: They found that different AI models "think" differently.
- Some models (like Gemma) show their "lie signals" clearly right at the very end of their thinking process.
- Others (like Qwen) show the signals clearly just by looking at the picture, even before they start thinking about the question.
- One model (FastVLM) was weird; it showed its "lie signals" in the middle of its thinking, not at the end.

Why This Matters: The "Do Not Enter" Sign

Imagine a security guard at a gate.

Before HALP: The guard lets everyone in, listens to their story, and then kicks them out if they are lying. This is slow and annoying.
With HALP: The guard has a magic scanner that beeps if someone is about to lie. If the scanner beeps, the guard says, "Stop! I don't think you know the answer. Let's not waste time."

This allows AI systems to:

Refuse to Answer: If the risk of lying is high, the AI can simply say, "I'm not sure," instead of making up a fake fact.
Route Smartly: If the risk is high, the system can send the question to a super-smart (but slow) AI, while letting easy questions go to the fast, cheap AI.
Save Money and Time: It stops the AI from wasting energy generating long, fake stories that have to be thrown away later.

The Bottom Line

HALP is like a pre-crime detector for AI. It doesn't stop the AI from having bad ideas, but it gives us a way to spot those bad ideas before they become words. This makes AI safer, faster, and more trustworthy, especially in situations where getting the facts right is a matter of life and death.

1. Problem Statement

Vision-Language Models (VLMs) frequently suffer from hallucinations, where they generate descriptions of non-existent objects, fabricate attributes, or make factually incorrect claims inconsistent with the input image.

Current Limitations: Existing detection methods are largely reactive. They operate post-generation (after the text is fully produced), relying on metrics like CHAIR or POPE. This makes intervention costly, untimely, and unsuitable for real-time applications (e.g., autonomous navigation, medical imaging).
The Gap: There is a lack of methods that can predict hallucination risk pre-generation by analyzing the model's internal states before any token is decoded.

2. Methodology: HALP Framework

The authors propose HALP (HALlucination Prediction via Pre-Generation Probing), a lightweight framework that predicts hallucination risk using a single forward pass of the VLM, without generating any output tokens.

A. Internal Representation Extraction

HALP intercepts the VLM pipeline at three critical stages to extract features:

Visual Features (VF): Pooled visual vectors from the vision encoder before multimodal projection. This captures pure perception signals.
Vision Token Representations (VT): Hidden states from the text decoder at the final position of the visual token sequence. This captures how visual information is integrated into the text decoder.
Query Token Representations (QT): Hidden states from the text decoder at the final position of the query token sequence (the last token of the concatenated vision+text input). This captures fully contextualized multimodal reasoning just before generation begins.

B. Probing Mechanism

Architecture: For each representation type and layer, a lightweight 3-layer MLP probe (with hidden dimensions [512, 256, 128]) is trained.
Task: Binary classification to predict if the model will hallucinate ( $b_j \in \{0, 1\}$ ).
Training Data: A custom 10,000-sample benchmark covering diverse tasks (object presence, math, OCR, spatial reasoning, etc.) from six established datasets (AMBER, POPE, MathVista, etc.). Labels are generated using an LLM-as-a-Judge (GPT-4) to compare model outputs against ground truth.
Efficiency: Once trained, the probe requires no decoding, only the extraction of specific hidden states during the pre-fill stage.

3. Key Contributions

Pre-Generation Detection: Demonstrates that hallucination risk is encoded in internal representations before text generation, enabling real-time risk assessment.
Architecture-Agnostic yet Specific: While the framework works across diverse VLMs, the optimal detection layer and modality vary significantly by architecture.
Lightweight & Efficient: The approach adds negligible computational overhead (<1% relative to full generation) compared to post-hoc evaluation methods.
Comprehensive Benchmarking: Evaluates 8 state-of-the-art VLMs (including Gemma-3, Llama-3.2-Vision, Phi-4-VL, Qwen2.5-VL, Molmo) across 10k examples.

4. Key Results & Analysis

A. Overall Performance

Query Tokens (QT) Dominate: For most models (7 out of 8), Query Token representations from deep decoder layers provide the strongest hallucination prediction, achieving AUROC scores up to 0.93 (e.g., Gemma-3-12B, Phi-4-VL, Molmo).
Visual Features (VF) Variance: Pure visual features are highly effective for "vision-centric" models (e.g., Qwen2.5-VL-7B: ~0.79 AUROC; Llama-3.2-11B: ~0.77 AUROC) but perform poorly for "fusion-centric" models (e.g., LLaVA-Next, Phi-4-VL).
Vision Tokens (VT): Generally show stable but lower performance (~0.65–0.70 AUROC), except for specific architectures like FastVLM-7B where they outperform query tokens.

B. Layer-Wise Insights

Progression: Hallucination signals generally become more concentrated in deeper layers for Query Tokens (monotonic improvement from Layer 1 to $L$ ).
Optimal Layers:
- Gemma-3: Best at final layer ( $L$ ).
- Molmo: Best at intermediate layer ( $L/2$ ).
- FastVLM: Unique behavior where Vision Tokens at $L/2$ are most predictive, while Query Tokens degrade in deeper layers.

C. Domain and Error Type Analysis

High-Risk Domains: Temporal/Video and Knowledge/Identity tasks have high hallucination rates and lower detection AUROC (~0.45–0.69), requiring enhanced monitoring.
Low-Risk Domains: Attribute Recognition and OCR show robust detection (AUROC > 0.85).
Error Types: Query tokens are most effective at detecting Attribute and Relationship hallucinations, which are typically harder to detect than Object hallucinations.

5. Significance and Applications

Early Intervention: HALP enables Early Refusal/Deferral. If the probe score exceeds a threshold, the system can decline to answer or ask for clarification before wasting compute on generating a hallucinated response.
Selective Routing: High-risk inputs can be routed to stronger models or tool-augmented pipelines, while low-risk inputs are handled by the base model, optimizing cost and latency.
Safety & Efficiency: By avoiding the generation of hallucinated tokens, the system improves reliability in safety-critical fields (medical, autonomous driving) without the heavy computational cost of full sequence generation and re-evaluation.

Conclusion

HALP establishes that hallucination signals are accessible in the internal states of VLMs prior to decoding. By leveraging lightweight probes on specific internal representations (particularly deep Query Tokens), the framework offers a practical, low-cost solution for real-time hallucination mitigation, shifting the paradigm from reactive correction to proactive risk prevention.