AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM

Imagine you have a very smart, artistic assistant named LVLM (Large Vision-Language Model). You show it a picture of a busy street, and you ask, "What do you see?"

Ideally, it should say, "I see a red car, a dog, and a tree." But often, this assistant gets a bit "dreamy." It might confidently say, "I see a red car, a dog, a tree, and a flying elephant," even though there is no elephant in the picture. This is called a hallucination.

The Old Fix: The "Flashlight" Problem

Previously, researchers tried to fix this by shining a giant, blinding flashlight on the picture itself. They told the assistant: "Look at the photo! Look at the photo! Ignore everything else!"

The Good: The assistant stopped seeing the flying elephant.
The Bad: Because the assistant was so focused on the photo, it forgot how to speak properly. It started repeating itself like a broken record: "I see a red car. I see a red car. I see a red car." It lost its ability to tell a smooth, interesting story.

The New Idea: Listening to Your Own Voice

The authors of this paper, AdaIAT, realized something clever. They noticed that when the assistant is telling the truth, it pays attention to what it just said. When it starts hallucinating (making things up), it stops listening to its own previous words.

Think of it like a conversation:

Truthful mode: "I see a car. And next to the car, there is a dog." (It remembers the car to describe the dog).
Hallucination mode: It forgets the car and suddenly says, "There is an elephant!" because it's not listening to the context it just built.

The Solution: AdaIAT (The Smart Editor)

Instead of just shining a flashlight on the photo, the authors teach the assistant to pay more attention to its own voice.

IAT (The Basic Version): They tell the assistant, "Hey, when you are describing the picture, listen carefully to the words you just wrote." This helps the assistant stay grounded in reality without forgetting how to speak fluently. It stops the "flying elephant" without making the assistant repeat "red car" a thousand times.
AdaIAT (The Advanced Version): The basic version is good, but sometimes the assistant gets too excited about its own voice and starts ignoring the picture entirely.
- The Fix: AdaIAT acts like a smart editor with a special rulebook.
- When to intervene: It only steps in when it senses the assistant is about to drift off into a daydream (hallucinate). If the assistant is doing a good job, the editor stays quiet.
- How to intervene: It doesn't use a "one-size-fits-all" volume knob. Instead, it has a different volume knob for every part of the assistant's brain (called "attention heads"). If one part of the brain is struggling, it turns up the volume just for that part. If another part is doing fine, it leaves it alone.

The Result

By using this "Smart Editor" approach:

Fewer Lies: The assistant stops inventing flying elephants.
Better Stories: The assistant doesn't get stuck in a loop of repeating words. It tells a rich, diverse, and accurate story about the image.
Balance: It finds the perfect sweet spot between looking at the picture and remembering what it just said.

In short: Instead of forcing the assistant to stare harder at the photo (which makes it stutter), they taught it to listen to its own story (which keeps it honest and fluent).

1. Problem Statement

Large Vision-Language Models (LVLMs) suffer from hallucinations, where generated text descriptions contradict the input visual content (e.g., describing objects not present in the image).

Current Solutions & Limitations: Existing attention intervention methods (e.g., PAI, HGAI) attempt to mitigate hallucinations by directly amplifying attention weights assigned to image tokens ( $V$ ). While effective at reducing hallucination rates, these methods often suppress attention to generated text tokens ( $T_p$ ).
The Consequence: This suppression causes the model to "forget" preceding context, leading to repetitive descriptions and a degradation of linguistic diversity and coherence. The trade-off between reducing hallucinations and maintaining natural language generation remains a significant challenge.

2. Methodology

The paper proposes a two-stage approach: a foundational method (IAT) and an adaptive refinement (AdaIAT).

A. Core Insight: Attention Patterns

Through an analysis of attention maps during the generation of real vs. hallucinated objects, the authors discovered a critical pattern:

Real Objects: The model assigns higher attention to previously generated text tokens ( $T_p$ ).
Hallucinated Objects: The model assigns significantly lower attention to $T_p$ .
Reasoning: Generated text ( $T_p$ ) inherently contains instruction-relevant visual information and contextual knowledge that has already been aligned with the text domain. In contrast, raw image tokens ( $V$ ) contain instruction-irrelevant noise and suffer from a domain gap with the LLM. Leveraging $T_p$ helps the model maintain coherence and accuracy.

B. Method 1: Increased Attention to Generated Text (IAT)

Instead of amplifying image tokens, IAT directly increases the attention weights assigned to the generated text tokens ( $T_p$ ) during inference.

Mechanism: For intermediate layers (typically layers 5–18), the attention score $\tilde{A}$ for tokens in $T_p$ is amplified by a factor $\alpha$ :
$\tilde{A}^{(l,h)}(i) = \tilde{A}^{(l,h)}(i) + \alpha \cdot |\tilde{A}^{(l,h)}(i)|, \quad \text{where } \mathcal{I}(i) \in T_p$
Goal: This leverages the model's own compressed, instruction-relevant visual representations to guide future predictions, reducing hallucinations without suppressing linguistic context.

C. Method 2: Adaptive IAT (AdaIAT)

To prevent naive amplification from disrupting the model's inherent reasoning or causing over-correction, the authors propose AdaIAT, which introduces two adaptive mechanisms:

Layer-wise Threshold (Adaptive Timing):
- Hallucinations are sporadic; constant amplification is unnecessary and potentially harmful.
- A dynamic threshold $T^{(l)}$ is established for each layer based on the difference between real and hallucinated attention patterns:
  $T^{(l)} = \bar{A}_{T_p}^h + \beta (\bar{A}_{T_p}^r - \bar{A}_{T_p}^h)$
- Intervention is triggered only when the current attention to $T_p$ ( $\bar{A}_{T_p}^{(l)}$ ) falls below this threshold, ensuring the model is only corrected when it is "straying."
Head-wise Amplification Magnitude (Adaptive Strength):
- Different attention heads exhibit varying degrees of attention deficiency toward $T_p$ during hallucinations.
- An amplification ratio matrix $M$ is pre-computed:
  $M^{(l,h)} = \frac{A_{T_p}^r}{A_{T_p}^h}$
- The amplification factor $\alpha$ is multiplied by $M^{(l,h)}$ , allowing the system to apply stronger amplification to heads that are most prone to hallucination and weaker amplification to those performing well.

3. Key Contributions

Novel Insight: Identified that real object generation relies more heavily on attention to generated text than hallucinated generation does, shifting the focus from image tokens to text tokens.
IAT Framework: Proposed a method that reduces hallucinations while preserving linguistic diversity and avoiding repetitive descriptions, a common failure mode of previous image-focused interventions.
AdaIAT Optimization: Developed an adaptive mechanism that controls when (layer-wise threshold) and how much (head-wise magnitude) to intervene, minimizing disruption to the model's native prediction capabilities.
Comprehensive Evaluation: Demonstrated effectiveness across multiple LVLMs (LLaVA-1.5, Janus-Pro, Qwen2.5-VL) and decoding strategies (Greedy, Sample).

4. Experimental Results

The authors evaluated their methods on benchmarks like CHAIR, OpenCHAIR, HalluBench, and IIW-400.

Hallucination Reduction:
- On LLaVA-1.5-7B, AdaIAT reduced the sentence-level hallucination rate (CS) by 35.8% and instance-level rate (CI) by 37.1% compared to the baseline.
- It achieved the lowest hallucination rates among all tested methods on Janus-Pro and Qwen2.5-VL.
Linguistic Diversity & Quality:
- Unlike PAI and HGAI, which caused a ~~15% drop in Distinct-1 (D1) scores (indicating repetitive text), AdaIAT maintained D1 scores comparable to the original Greedy decoding (~~0.60).
- F1 Scores: AdaIAT achieved higher F1 scores (indicating better prediction capability and object richness) compared to IAT and other baselines, proving it does not sacrifice accuracy for hallucination reduction.
Robustness: The method performed consistently well across different decoding strategies (Greedy and Sample) and different model scales (7B and 13B).

5. Significance

This paper addresses a critical bottleneck in LVLM deployment: the trade-off between factual accuracy (low hallucination) and linguistic fluency.

Paradigm Shift: It challenges the prevailing intuition that "more attention to images = less hallucination," showing instead that "more attention to context (generated text) = better alignment."
Practicality: AdaIAT is a training-free, inference-time intervention that requires no additional data or model retraining, making it highly deployable for existing LVLMs.
Balance: It successfully achieves a "sweet spot" where models are both factually accurate regarding visual content and linguistically diverse/coherent, making them more reliable for real-world applications like automated captioning and visual assistants.

AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM

The Old Fix: The "Flashlight" Problem

The New Idea: Listening to Your Own Voice

The Solution: AdaIAT (The Smart Editor)

The Result

1. Problem Statement

2. Methodology

A. Core Insight: Attention Patterns

B. Method 1: Increased Attention to Generated Text (IAT)

C. Method 2: Adaptive IAT (AdaIAT)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agentic Planning

On the security of 2-key triple DES

Security issues in a group key establishment protocol

The impact of quantum computing on real-world security: A 5G case study

Yet another insecure group key distribution scheme using secret sharing