MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing

This paper introduces MAP, a training-free decoding method that mitigates hallucinations in Large Vision-Language Models by interpreting hidden states as a 2D semantic map and employing layer-wise criss-cross attention and global-local logit fusion to aggregate widely distributed factual information for improved factual consistency.

Chenxi Li, Yichen Guo, Benfang Qian, Jinhao You, Kai Tang, Yaosong Du, Zonghao Zhang, Xiande Huang

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you have a very smart, well-read robot assistant that can look at pictures and describe them. This is called a Large Vision-Language Model (LVLM). It's great at talking, but sometimes it "hallucinates." This means it confidently describes things that aren't actually in the picture, like saying there's a cat in a photo of a dog, or claiming a car is red when it's blue.

For a long time, scientists tried to fix this by looking at the robot's brain in a very narrow way. They thought, "Maybe if we just check the robot's thoughts layer by layer (like checking floors of a building) or word by word (like checking one sentence at a time), we can find the truth."

The Paper's Big Idea: The "2D Map" Perspective

The authors of this paper, MAP, realized that looking at the robot's brain in just one dimension (like a single line or a single floor) was missing the bigger picture.

They proposed a new way to look at the robot's thinking process: as a giant 2D Semantic Map.

  • The Old Way (1D): Imagine the robot's thoughts as a long, single-file line of people passing notes. If you want to find the truth, you only look at the person right in front of you or the person on the floor above you.
  • The MAP Way (2D): Imagine those same people standing in a massive grid, like a chessboard or a spreadsheet.
    • Rows represent different layers of the robot's brain (different levels of thinking).
    • Columns represent different words or parts of the image.

The authors discovered something amazing: The truth isn't hiding in just one spot. If the robot is looking at a "bed," the signal for "bed" isn't just in one specific layer or one specific word. It's scattered all over the map, like little breadcrumbs spread across the entire chessboard. Some breadcrumbs are on the top floor, some on the bottom, some near the start, some near the end.

How MAP Fixes the Problem

The paper introduces a method called MAP (Map-Level Attention Processing) to gather all these scattered breadcrumbs. It does this in two main steps, using some clever metaphors:

1. The "Criss-Cross" Detective (Layer-Wise Criss-Cross Attention)

Imagine you are a detective trying to solve a mystery about a "bed." Instead of just asking the person standing right next to you, you draw a giant plus sign (+) over the entire grid of people.

  • You ask everyone in the same row (same thinking layer).
  • You ask everyone in the same column (same word position).

By gathering clues from everyone on this "plus sign," the detective (the model) gets a much clearer picture of what's actually there. It ignores the noise and focuses on the scattered "truth" signals that were hiding in plain sight. This happens layer by layer, refining the answer as it goes.

2. The "Local vs. Global" Panel (Global-Local Logit Fusion)

Once the detective has gathered the clues, they need to make a final decision.

  • Local View: This is like looking at the immediate neighborhood. It's great for details (e.g., "There are 3 cats").
  • Global View: This is like looking at the whole city from a helicopter. It's great for the big picture (e.g., "This is a park").

The paper found that sometimes the "Local" view is better, and sometimes the "Global" view is better. So, MAP acts like a wise judge who listens to both perspectives and combines their opinions to make the final, most accurate statement.

Why This Matters

  • No Retraining Needed: The best part? You don't have to teach the robot a new language or feed it millions of new pictures. MAP is like putting on a new pair of glasses for the robot. It works immediately with existing models.
  • Better Results: In tests, this method stopped the robot from making up fake objects much better than previous methods. It was especially good at catching the robot when it tried to guess things that weren't there.
  • Efficient: It's fast. Because it only looks at the "last token" (the current word being predicted) and its cross-section of the map, it doesn't slow the robot down significantly.

In a Nutshell

Previous methods tried to fix hallucinations by looking at the robot's brain through a narrow keyhole. MAP kicks the door open and looks at the whole room, realizing that the truth is scattered everywhere. By using a 2D map to gather clues from all directions (crossing rows and columns) and combining local details with global context, MAP helps the robot see the world exactly as it is, without making things up.