MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing

Imagine you have a very smart, well-read robot assistant that can look at pictures and describe them. This is called a Large Vision-Language Model (LVLM). It's great at talking, but sometimes it "hallucinates." This means it confidently describes things that aren't actually in the picture, like saying there's a cat in a photo of a dog, or claiming a car is red when it's blue.

For a long time, scientists tried to fix this by looking at the robot's brain in a very narrow way. They thought, "Maybe if we just check the robot's thoughts layer by layer (like checking floors of a building) or word by word (like checking one sentence at a time), we can find the truth."

The Paper's Big Idea: The "2D Map" Perspective

The authors of this paper, MAP, realized that looking at the robot's brain in just one dimension (like a single line or a single floor) was missing the bigger picture.

They proposed a new way to look at the robot's thinking process: as a giant 2D Semantic Map.

The Old Way (1D): Imagine the robot's thoughts as a long, single-file line of people passing notes. If you want to find the truth, you only look at the person right in front of you or the person on the floor above you.
The MAP Way (2D): Imagine those same people standing in a massive grid, like a chessboard or a spreadsheet.
- Rows represent different layers of the robot's brain (different levels of thinking).
- Columns represent different words or parts of the image.

The authors discovered something amazing: The truth isn't hiding in just one spot. If the robot is looking at a "bed," the signal for "bed" isn't just in one specific layer or one specific word. It's scattered all over the map, like little breadcrumbs spread across the entire chessboard. Some breadcrumbs are on the top floor, some on the bottom, some near the start, some near the end.

How MAP Fixes the Problem

The paper introduces a method called MAP (Map-Level Attention Processing) to gather all these scattered breadcrumbs. It does this in two main steps, using some clever metaphors:

1. The "Criss-Cross" Detective (Layer-Wise Criss-Cross Attention)

Imagine you are a detective trying to solve a mystery about a "bed." Instead of just asking the person standing right next to you, you draw a giant plus sign (+) over the entire grid of people.

You ask everyone in the same row (same thinking layer).
You ask everyone in the same column (same word position).

By gathering clues from everyone on this "plus sign," the detective (the model) gets a much clearer picture of what's actually there. It ignores the noise and focuses on the scattered "truth" signals that were hiding in plain sight. This happens layer by layer, refining the answer as it goes.

2. The "Local vs. Global" Panel (Global-Local Logit Fusion)

Once the detective has gathered the clues, they need to make a final decision.

Local View: This is like looking at the immediate neighborhood. It's great for details (e.g., "There are 3 cats").
Global View: This is like looking at the whole city from a helicopter. It's great for the big picture (e.g., "This is a park").

The paper found that sometimes the "Local" view is better, and sometimes the "Global" view is better. So, MAP acts like a wise judge who listens to both perspectives and combines their opinions to make the final, most accurate statement.

Why This Matters

No Retraining Needed: The best part? You don't have to teach the robot a new language or feed it millions of new pictures. MAP is like putting on a new pair of glasses for the robot. It works immediately with existing models.
Better Results: In tests, this method stopped the robot from making up fake objects much better than previous methods. It was especially good at catching the robot when it tried to guess things that weren't there.
Efficient: It's fast. Because it only looks at the "last token" (the current word being predicted) and its cross-section of the map, it doesn't slow the robot down significantly.

In a Nutshell

Previous methods tried to fix hallucinations by looking at the robot's brain through a narrow keyhole. MAP kicks the door open and looks at the whole room, realizing that the truth is scattered everywhere. By using a 2D map to gather clues from all directions (crossing rows and columns) and combining local details with global context, MAP helps the robot see the world exactly as it is, without making things up.

1. Problem Statement

Large Vision-Language Models (LVLMs) suffer from visual hallucinations, where the model generates linguistically fluent but factually incorrect descriptions (e.g., inventing objects, misidentifying spatial relations, or hallucinating quantities) that contradict the input image.

Limitations of Existing Methods: Current hallucination mitigation strategies primarily operate in single-dimensional paradigms:
- Inter-layer methods: Compare hidden states or logits across different decoder layers (e.g., contrastive decoding).
- Intra-layer methods: Refine token representations within a single layer (e.g., anchor token reallocation).
The Gap: These approaches overlook potentially faithful information dispersed across the entire 2D space of hidden states (spanning both layer depth and token position). They fail to leverage the holistic semantic map formed by all intermediate states.

2. Key Insight & Motivation

The authors conducted a systematic logit-lens analysis on a vanilla LLaVA-1.5 model using 3,000 images.

Observation: Factual signals are not localized to a single dimension (specific layers or specific tokens). Instead, faithful information regarding in-image objects is widely dispersed across the 2D semantic map (the matrix of hidden states defined by layer index $j$ and token position $t$ ).
Evidence: Statistical analysis showed that tokens corresponding to actual image objects consistently received higher maximum probabilities across the map compared to hallucinated objects, even in intermediate layers and non-anchored positions.

3. Methodology: MAP (Map-Level Attention Processing)

MAP is a training-free decoding framework that treats the inference process as a unified 2D semantic map. It consists of three core components:

A. Semantic Map Construction

The hidden states of the LVLM are reinterpreted as a 2D matrix $\mathcal{H}_j$ , where rows represent token positions and columns represent decoding layers. This allows the model to access information beyond traditional inter- or intra-layer boundaries.

B. Layer-Wise Criss-Cross Attention

To refine token representations, MAP introduces a mechanism to aggregate information from a "semantic neighborhood."

Criss-Cross Neighborhood ( $M_c$ ): For a target token $h_{t,j}$ , the neighborhood includes all tokens in the same layer (same row) and same position (same column) across different layers, excluding the token itself.
Mechanism: A map-level operation function $\mathcal{F}(\cdot)$ aggregates these neighbors using cosine similarity weights.
Update Rule: The original token is updated via a residual connection:
$\hat{h}_{u,j} = (1 - \alpha) \cdot \mathcal{F}(h_{t,j}, M_c) + \alpha \cdot h_{u,j}$
This is applied layer-by-layer, allowing the model to progressively refine representations by gathering factual signals from the entire 2D map.

C. Global-Local Logit Fusion

To further enhance robustness, MAP fuses hierarchical content at the logit level.

Global Attention: Computes a globally enhanced token $\tilde{h}_{t,n}$ by attending to the entire final semantic map (excluding the anchor).
Fusion Strategy: The final prediction is a weighted average of the logits from the locally refined token ( $\hat{h}_{t,n}$ ) and the globally enhanced token ( $\tilde{h}_{t,n}$ ):
$\text{logit}_{\text{final}} = \frac{1}{2} (\phi(\tilde{h}_{t,n}) + \phi(\hat{h}_{t,n}))$
Rationale: Empirical results show that local tokens excel at specific tasks (e.g., counting, OCR), while global tokens perform better on others (e.g., position, color). Fusion balances these strengths.

4. Key Contributions

Novel Perspective: Proposes a 2D semantic map paradigm, demonstrating that hallucination mitigation benefits from holistic information integration across both layer and position dimensions, rather than single-dimensional analysis.
MAP Framework: Introduces a training-free decoding method featuring:
- Layer-Wise Criss-Cross Attention: Efficiently aggregates cross-layer and cross-position dependencies.
- Global-Local Logit Fusion: Combines fine-grained local evidence with broad contextual information.
Comprehensive Evaluation: Validates the method across diverse LVLM architectures (LLaVA, mPLUG-Owl, InstructBLIP) and benchmarks (POPE, MME, MMHal-Bench).

5. Experimental Results

The authors evaluated MAP on three major benchmarks:

MME Benchmark (Comprehensive Evaluation):
- MAP achieved state-of-the-art results across all tested models.
- On LLaVA-1.5, MAP scored 1529.3, outperforming the vanilla baseline (1491.6) by 37.7 points and beating other SOTA methods like DAMO (1513.5) and DCLA (1520.1).
- Significant improvements were also seen on mPLUG-Owl2 and InstructBLIP.
POPE Benchmark (Object-Level Hallucination):
- MAP demonstrated superior robustness in Random, Popular, and Adversarial settings.
- On the challenging GQA Adversarial subset, MAP improved accuracy by 4.47% over VCD on LLaVA-1.5.
MMHal-Bench (Open-Ended Generation):
- MAP achieved the highest overall score (2.4) with balanced performance across sub-tasks (e.g., Environment, Counting, Comparison), whereas other methods showed task-specific instability.
Efficiency & Generalization:
- Efficiency: MAP maintains low computational overhead. Decode latency is 26.69 ms/token, significantly faster than DAMO (38.69 ms) and comparable to SPIN. Complexity is reduced from $O(n^2)$ to $O(n)$ by querying only the last token.
- Generalization: The method successfully improved performance on newer, larger models (Qwen2.5-VL, InternVL2.5, InternVL3), proving its architecture-agnostic nature.

6. Significance

Paradigm Shift: MAP moves the field away from single-dimensional (inter/intra-layer) analysis toward a holistic 2D map-level perspective, revealing that factual information is widely distributed in the latent space.
Practicality: As a training-free inference-time intervention, it offers a scalable solution for real-world applications (e.g., medical imaging, autonomous driving) where retraining large models is computationally prohibitive.
Robustness: By fusing local and global signals, MAP provides a more reliable generation process that is less prone to the "attention sink" and "co-occurrence bias" issues that plague current LVLMs.

MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing

1. The "Criss-Cross" Detective (Layer-Wise Criss-Cross Attention)

2. The "Local vs. Global" Panel (Global-Local Logit Fusion)

1. Problem Statement

2. Key Insight & Motivation

3. Methodology: MAP (Map-Level Attention Processing)

A. Semantic Map Construction

B. Layer-Wise Criss-Cross Attention

C. Global-Local Logit Fusion

4. Key Contributions

5. Experimental Results

6. Significance

More like this

The Quantification Horizon Theory of Consciousness

Algebras of actions in an agent's representations of the world

Heuristic Multiobjective Discrete Optimization using Restricted Decision Diagrams

PLM-Net: Perception Latency Mitigation Network for Vision-Based Lateral Control of Autonomous Vehicles

Automated Explanation Selection for Scientific Discovery