NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

Imagine you have a very smart, well-read friend who loves to look at photos and describe what they see. This friend is an expert at reading, but they have a funny habit: sometimes, when they look at a picture of a quiet beach, they confidently say, "I see a pirate ship and a parrot!" even though neither is there.

This is exactly what happens with Large Vision-Language Models (LVLMs)—the AI systems that look at images and talk about them. They often "hallucinate" objects that don't exist.

The paper you shared, titled NoLan, asks a simple but crucial question: Who is the problem? Is it the "eyes" (the part that sees the image) failing to see the truth? Or is it the "brain" (the part that speaks) getting too confident in its own guesses?

The Detective Work: Eyes vs. Brain

The researchers decided to play detective. They tested the "eyes" (the Vision Encoder) and found they were actually doing a great job. If you showed the AI a picture with a bear, the "eyes" correctly identified the bear.

So, the problem wasn't the eyes. The problem was the Brain (the Language Decoder).

Think of the AI's brain like a person who has read millions of books but has never actually left their house. If you show them a picture of a snowy mountain, their brain might immediately jump to, "Ah, a polar bear!" because in all the books they've read, snowy mountains and polar bears always go together. They are relying on Language Priors—their internal database of "what usually goes with what"—instead of actually looking at the picture.

The Solution: NoLan (No-Language-Hallucination)

The researchers created a simple, clever trick called NoLan. They didn't need to retrain the AI or teach it new things. Instead, they gave it a "reality check" during the thinking process.

Here is how it works, using a Chef Analogy:

The Old Way (Regular Decoding): Imagine a chef (the AI) trying to cook a dish based on a photo of ingredients on a counter. The chef is so used to cooking "Spaghetti Carbonara" that even if the photo only shows eggs and bacon, the chef's brain automatically adds "pasta" and "cheese" because that's what usually goes with bacon. The chef is ignoring the photo and following their memory.
The NoLan Way: Now, imagine the chef has a second, smaller assistant.
- First, the chef looks at the photo and says, "I think I see eggs and bacon."
- Then, the assistant asks, "If I didn't show you the photo, and just asked you about bacon, what would you say?" The assistant says, "Well, usually bacon goes with eggs, pasta, and cheese."
- The Magic Step: NoLan compares these two answers. It sees that the chef's "photo answer" is very similar to the "memory answer" (both are talking about pasta). NoLan realizes, "Wait, the chef is just guessing based on memory, not looking at the photo!"
- So, NoLan says, "Stop! Ignore the pasta and cheese. Stick to what you actually see." It dynamically suppresses the chef's urge to add the extra ingredients.

Why is this cool?

It's Training-Free: You don't need to spend weeks teaching the AI new things. You just add this "reality check" step when it's generating an answer.
It's Fast: It doesn't slow the AI down much.
It Works Everywhere: They tested it on different AI models (like LLaVA and Qwen-VL) and different tasks, and it consistently stopped the AI from making things up.

The Result

Before NoLan, the AI might look at a picture of a dog and say, "I see a dog, a ball, and a frisbee." (The frisbee wasn't there).
After NoLan, the AI looks at the same picture and says, "I see a dog." (Accurate!).

In short, NoLan is like giving the AI a pair of glasses that forces it to trust what it sees in front of it, rather than what it thinks it should see based on its past reading. It makes the AI more honest, reliable, and less prone to making things up.

1. Problem Statement

Object Hallucination is a critical failure mode in Large Vision-Language Models (LVLMs) where the model generates text describing objects that are not present in the input image. This issue poses significant risks in high-stakes applications like autonomous driving, robotics, and healthcare.

While previous research often attributed hallucinations to weak visual signals or poor alignment between vision and language modules, this paper posits a different root cause: the strong language priors embedded in the Large Language Model (LLM) decoder. The authors argue that when visual evidence is ambiguous or insufficient, the LLM defaults to its pre-trained linguistic knowledge, generating plausible but factually incorrect objects (e.g., predicting a "whale" in a picture of a "bear" because whales are common in training data for animal-related queries).

2. Methodology: NoLan (No-Language-Hallucination Decoding)

The authors propose NoLan, a training-free, plug-and-play decoding framework designed to dynamically suppress these language priors during inference.

A. Core Insight & Preliminary Analysis

Before proposing the solution, the authors conducted systematic experiments to identify the source of hallucinations:

Vision Encoder Capability: They tested if the vision encoder (e.g., CLIP) could detect objects in hallucination cases. Results showed the vision encoder could accurately detect object presence (83% accuracy), ruling out the vision module as the primary cause.
Language Decoder Dominance: They compared the output probability distributions of the LVLM (image + text input, $p_m$ $p_{m}$ ) versus the LLM alone (text-only input, $p_u$ $p_{u}$ ).
- Finding: In cases of hallucination, the divergence between $p_m$ and $p_u$ is minimal (low KL Divergence). This indicates the model's output is dominated by the language prior ( $p_u$ ) rather than the visual input.
- Conclusion: Hallucinations occur when the language prior overrides the visual signal.

B. The NoLan Framework

NoLan refines the output distribution by contrasting the multimodal logits with the unimodal (text-only) logits.

Logit Calculation:
- $l_m$ : Logits from the LVLM given image ( $v$ ) and text ( $x$ ).
- $l_u$ : Logits from the LLM given only text ( $x$ ).
Modulation:
The method computes a modulation term based on the difference between these logits:
$l_{\Delta} = \alpha \times (l_m - l_u)$
The final output logits are:
$l_{final} = l_m + l_{\Delta} = l_m + \alpha(l_m - l_u)$
This effectively amplifies the visual signal ( $l_m$ ) while subtracting the pure language bias ( $l_u$ ).
Two Variants:
- NoLan-Base: Uses a fixed modulation rate $\alpha = 1$ . The formula simplifies to $l_{final} = 2l_m - l_u$ . This is a simple, static approach that already yields significant improvements.
- NoLan-Plus: Introduces a dynamic, self-adjusting mechanism based on the Symmetric Kullback-Leibler (KL) Divergence between $l_m$ $l_{m}$ and $l_u$ $l_{u}$ .
  - It calculates the divergence $\gamma$ .
  - It maps $\gamma$ to a dynamic $\alpha$ using a $\tanh$ function: $\alpha = \beta \times (\tanh(1/\gamma) + 1)$ .
  - Logic: When the distributions are very similar (low divergence, indicating high risk of hallucination), the mechanism increases $\alpha$ to aggressively suppress the language prior. When they differ significantly (high divergence, strong visual signal), it reduces the suppression to preserve visual details.

3. Key Contributions

Root Cause Analysis: The paper provides empirical evidence that object hallucinations in LVLMs are primarily driven by the language decoder's priors rather than failures in the vision encoder.
Training-Free Framework: NoLan requires no additional training, fine-tuning, or external tools (like other pre-trained models). It operates purely at the decoding stage.
Dynamic Suppression: Unlike previous contrastive decoding methods that assume a uniform prior, NoLan-Plus dynamically adjusts the suppression strength based on the token-level similarity between multimodal and unimodal outputs.
Generalizability: The method is architecture-agnostic and has been validated across multiple LVLM families (LLaVA, InstructBLIP, Qwen-VL) and scales (7B to 13B+).

4. Experimental Results

The authors evaluated NoLan on several benchmarks, including POPE (object probing), MME (comprehensive evaluation), LLaVA-Bench, MM-Vet, and HallusionBench.

POPE Benchmark:
- NoLan-Plus achieved state-of-the-art performance among training-free methods.
- LLaVA-1.5 7B: Improved accuracy by 6.45% and F1 score by 7.21% over regular decoding.
- Qwen-VL 7B: Improved accuracy by 7.21% and F1 score by 8.78%.
- Outperformed other contrastive decoding baselines like VCD (Visual Contrastive Decoding) and VDD (Visual Debias Decoding) in the majority of cases.
MME & Attribute-Level Hallucinations: NoLan significantly reduced hallucinations in object existence, counting, position, and color attributes.
Open-Ended Generation (LLaVA-Bench): Case studies showed NoLan-Plus successfully removed hallucinated objects (e.g., "suitcase" or "truck" appearing where only a "taxi" existed) while maintaining the richness and coherence of the generated text.
Efficiency: NoLan is computationally efficient. While methods like VCD require two forward passes (original + distorted image), NoLan only requires the standard image pass and a text-only pass. NoLan-Base is faster and uses less memory than VCD/VDD.

5. Significance

Paradigm Shift: The paper challenges the prevailing assumption that hallucinations are a visual perception problem, reframing them as a language prior dominance issue.
Practical Deployment: Because NoLan is training-free and requires minimal computational overhead, it can be immediately integrated into existing LVLM inference pipelines without retraining massive models.
Robustness: The dynamic nature of NoLan-Plus ensures that the model does not over-correct (suppressing valid visual details) but aggressively targets instances where the language model is "hallucinating" based on priors.
Safety: By reducing object hallucinations, the method enhances the reliability of LVLMs in critical domains like medical imaging and autonomous systems, where factual accuracy is paramount.

In conclusion, NoLan offers a simple yet highly effective solution to a complex problem in multimodal AI, demonstrating that carefully modulating the interaction between visual and linguistic signals during decoding can drastically improve model faithfulness.

NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

The Detective Work: Eyes vs. Brain

The Solution: NoLan (No-Language-Hallucination)

Why is this cool?

The Result

1. Problem Statement

2. Methodology: NoLan (No-Language-Hallucination Decoding)

A. Core Insight & Preliminary Analysis

B. The NoLan Framework

3. Key Contributions

4. Experimental Results

5. Significance

More like this

One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

The Geometric Anatomy of Capability Acquisition in Transformers

Disentangling Prompt Element Level Risk Factors for Hallucinations and Omissions in Mental Health LLM Responses

ASCAT: An Arabic Scientific Corpus and Benchmark for Advanced Translation Evaluation

Semantic Shifts of Psychological Concepts in Scientific and Popular Media Discourse: A Distributional Semantics Analysis of Russian-Language Corpora