Causal Decoding for Hallucination-Resistant Multimodal Large Language Models

Imagine you have a very smart, well-read robot assistant named MLLM (Multimodal Large Language Model). You show it a picture of a pizza on a table and ask, "What's in this picture?"

Ideally, the robot should say, "I see a pizza on a plate."

But sometimes, this robot gets a little too creative. Because it has read millions of stories about pizza parties, it might confidently say, "I see a pizza, a plate, a knife, and a fork."

The problem? There is no knife or fork in the picture. The robot is hallucinating. It's making up objects that aren't there because it's relying too much on what it expects to see based on the words it just wrote, rather than what it's actually seeing.

This paper introduces a new method called COAD (Causal Object-Aware Decoding) to fix this. Here is how it works, explained simply.

The Problem: The "Echo Chamber" Effect

Think of the robot's brain like a noisy room.

The Image: You show the robot a picture of a pizza.
The Text: The robot starts writing, "A slice of pizza..."
The Glitch: As soon as it writes "pizza," its brain starts thinking about "pizza parties." It starts imagining a knife and fork, even though they aren't in the photo. It then writes, "...with a knife and fork."

The robot is trapped in an echo chamber. It hears its own previous words ("pizza") and assumes they must be followed by other words associated with pizza ("knife"), ignoring the actual visual evidence. It's like a person telling a story who gets so carried away with the plot that they invent characters who never showed up.

The Solution: COAD (The "Fact-Checker" Robot)

The authors built a system called COAD to stop this. They use two main tricks: A Detective and A Time-Traveler.

1. The Detective (The Object Detector)

Before the robot starts writing its story, COAD brings in a specialized Detective (an object detection tool).

The Detective looks at the photo and says, "I see a pizza. I see a plate. I do not see a knife. I do not see a fork."
This list of facts is handed to the robot as a strict rulebook.

2. The Time-Traveler (Causal Inference)

This is the clever part. The robot usually writes its story word-by-word. If it writes "pizza," it might get tempted to write "knife" next.

Normal Robot: "I see a pizza... maybe a knife?" (It guesses based on its memory).
COAD Robot: The system uses a mathematical trick called Causal Inference. It essentially asks the robot: "If we magically removed the influence of the words you just wrote, and looked only at the Detective's list and the picture, what would you say?"

It's like asking a witness in a courtroom: "Ignore what the lawyer just said to you. Based only on what you saw with your own eyes, what happened?"

By "cutting the wire" between the robot's previous guesses and its next guess, COAD forces the robot to rely on the Detective's list (the visual truth) rather than its own imagination (the text echo).

How It Works in Practice

Imagine you are writing a story about a beach.

Without COAD: You write "The sun is hot." Your brain thinks, "Hot sun means beach," so you write "There is a surfboard." But wait, there is no surfboard in the photo! You just hallucinated it because "sun" and "surfboard" often go together in your training data.
With COAD:
1. A Detective checks the photo and says: "Sun: Yes. Surfboard: No."
2. The Causal System intervenes. It tells the writer: "Stop guessing based on the word 'sun'. Look at the Detective's note. The note says 'No surfboard'. Therefore, do not write 'surfboard'."
3. The writer says: "The sun is hot. The sand is golden." (Accurate and truthful).

The Results

The researchers tested this on many different pictures and questions.

Old Way: The robot made up objects (like the knife and fork) about 30% of the time in some tests.
COAD Way: The robot made up objects less than 6% of the time.

It's like taking a storyteller who loves to make things up and giving them a strict editor who holds a magnifying glass over the photo, ensuring every single word matches reality.

Why This Matters

This is a big deal for real-world uses.

Medical: If a robot looks at an X-ray, it shouldn't invent a broken bone that isn't there.
Legal: If a robot describes a crime scene photo, it shouldn't hallucinate a weapon that wasn't present.

COAD doesn't just "try harder" to be accurate; it changes how the robot thinks. It forces the robot to stop listening to its own internal chatter and start listening to the visual evidence, making it a much more reliable assistant.

1. Problem Statement

Multimodal Large Language Models (MLLMs), such as LLaVA and MiniGPT, excel at vision-language tasks but suffer significantly from object hallucination. This occurs when the model generates descriptions containing objects that are not present in the input image (e.g., describing a "fork" when only a "knife" is visible).

Root Cause: The paper argues that hallucination arises from spurious correlations (confounding) in the decoding process. Specifically, the model's hidden states ( $z$ ) regarding object presence become entangled with previously generated text ( $x$ ). If the model generates a word like "knife," the hidden state may spuriously infer the presence of a "fork" (a common co-occurrence in training data) rather than relying solely on the visual evidence ( $S$ ).
Limitations of Prior Work: Existing solutions fall into two categories:
1. External Knowledge: Relying on retrieval or additional training data, which is resource-intensive and not always feasible.
2. Internal Decoding Tweaks: Methods like OPERA, VCD, or DoLa that adjust attention or logits. These often fail to model the true causal effect from the image to the response, leaving the model susceptible to the "confounding" effect of its own generated text.

2. Methodology: Causal Object-Aware Decoding (COAD)

COAD proposes a framework that integrates causal inference directly into the decoding process to break the spurious link between generated text and hallucinated object beliefs.

A. Causal Framework

The authors model the MLLM decoding process using a causal graph where:

$S$ : Input Image.
$x$ : Previously generated text.
$z$ : The model's internal belief about object presence (a latent variable).
$y$ : The next token to be predicted.

In standard MLLMs, $z$ is influenced by both $S$ and $x$ . This creates a "backdoor path" where $x \to z \to y$ , causing the model to hallucinate objects based on text context rather than visual reality. COAD aims to estimate the interventional distribution $P(y | do(x), z)$, effectively blocking the influence of $x$ on the inference of $z$ .

B. Core Components

Object Detector as a Proxy:
An external object detector (e.g., RTMDet) is used to identify objects in the image $S$ and output a probability distribution $D(S)$ . This serves as a ground-truth proxy for the object belief variable $z$ , ensuring $z$ depends only on $S$ and not on $x$ .
Dual-Model Architecture:
- Pretrained Model ( $M_p$ ): The standard MLLM that generates tokens based on $x$ and $S$ .
- Finetuned Model ( $M_f$ ): A version of $M_p$ finetuned to accept the object vector $z$ (from the detector) as an additional input alongside $x$ and $S$ . This forces the model to learn to condition its predictions on explicit object constraints.
Hypothetical Oracle & Mixture Model:
The authors introduce a hypothetical "Oracle" model ( $M^*$ ) that perfectly predicts the next token given $x, S,$ and true $z$ . They hypothesize that the finetuned model $M_f$ is a probabilistic mixture of the Oracle ( $M^*$ ) and the Pretrained model ( $M_p$ ):
$y_f \approx \gamma \cdot y^* + (1-\gamma) \cdot y_p$
where $\gamma$ is a mixing coefficient.
Causal Inference & Decoding Formula:
By applying do-calculus and rearranging the mixture equation, COAD derives a closed-form solution to estimate the Oracle's prediction ( $y^*$ ), which represents the hallucination-free output:
$P(y^* | S, do(x)) = (1 + \alpha) \sum_z P(z|S) P(y_f | S, x, z) - \alpha P(y_p | S, x)$
- $\alpha$ is a hyperparameter derived from the Beta distribution prior of $\gamma$ .
- The term $\sum_z P(z|S) P(y_f | S, x, z)$ is approximated using the object detector's probabilities (either via Monte Carlo sampling or direct probability injection).
- This formula effectively subtracts the "spurious" influence of the pretrained model ( $y_p$ ) and amplifies the "grounded" signal from the finetuned model ( $y_f$ ).

3. Key Contributions

Causal Formulation: The first work to explicitly formulate object hallucination as a confounding problem in MLLM decoding and propose a causal intervention ($do(x)$) to resolve it.
COAD Framework: A novel decoding strategy that combines object detection with a dual-model causal fusion mechanism, requiring no external knowledge retrieval during inference.
Theoretical Derivation: Provides a mathematical derivation to approximate the "Oracle" prediction using observable quantities (pretrained and finetuned model outputs), bridging the gap between causal theory and practical implementation.
State-of-the-Art Performance: Demonstrates that targeted causal interventions can significantly outperform heuristic decoding tweaks and external knowledge methods.

4. Experimental Results

The authors evaluated COAD on LLaVA-1.5-7B against state-of-the-art baselines (DoLa, PAI, OPERA, VCD, CAD, HALC, etc.) across three benchmarks:

CHAIR (Caption Hallucination Assessment):
- COAD achieved the lowest hallucination rates: CHAIRI = 3.4 and CHAIRS = 5.3.
- This significantly outperformed the next best method (HALC: 5.2 CHAIRI, 11.1 CHAIRS), reducing hallucinations by a large margin.
- Qualitative analysis showed COAD successfully suppressed non-existent objects (e.g., "fork") while maintaining descriptive quality.
MMHal-Bench (Multimodal QA):
- COAD achieved the highest average score (2.52) and the lowest hallucination rate (0.52) across 8 dimensions (attributes, counting, spatial relations, etc.).
- It showed particular strength in "Adversarial Object" and "Comparison" tasks, indicating robustness against prompts designed to induce hallucinations.
POPE (Object Probing Evaluation):
- On the Adversarial subset, COAD achieved the highest Accuracy (79.8) and F1 score (81.2), demonstrating superior robustness to prompts that trick models into hallucinating.
Efficiency:
- COAD requires running two models (pretrained + finetuned) per step, resulting in roughly half the throughput of the base model (~10.5 tokens/s vs. 24.4 tokens/s).
- However, it is significantly faster than iterative refinement methods like OPERA (4.5 tokens/s) and comparable to other decoding modifications. The object detection overhead is negligible (~0.1s per image).

5. Significance and Conclusion

This paper makes a significant contribution by shifting the paradigm of hallucination mitigation from heuristic penalties to causal intervention.

Reliability: It proves that explicitly modeling the causal relationship between visual inputs and object beliefs can drastically reduce false positives in MLLMs, a critical step for high-stakes applications like medical imaging or autonomous driving.
Generalizability: The approach does not rely on massive external databases, making it applicable in resource-constrained or privacy-sensitive environments.
Future Directions: The authors suggest extending this causal framework to other hallucination types (attributes, spatial relations) and integrating open-vocabulary detectors to handle a broader range of objects.

In summary, COAD offers a principled, mathematically grounded solution to the "object hallucination" problem, achieving state-of-the-art faithfulness without sacrificing the generative capabilities of MLLMs.