Revealing and Enhancing Core Visual Regions: Harnessing Internal Attention Dynamics for Hallucination Mitigation in LVLMs

The paper proposes PADE, a training-free method that mitigates hallucinations in Large Vision-Language Models by leveraging internal Positive Attention Dynamics to identify and enhance core visual regions while adaptively scaling interventions and compensating for system tokens to ensure instruction adherence.

Guangtao Lyu, Qi Liu, Chenghao Xu, Jiexi Yan, Muli Yang, Xueting Li, Fen Fang, Cheng Deng

Published 2026-02-18
📖 4 min read☕ Coffee break read

Imagine you have a very smart, well-read assistant (a Large Vision-Language Model, or LVLM) who can look at pictures and describe them. This assistant is brilliant, but it has a quirky habit: sometimes, it gets so distracted by its own internal "noise" that it starts making things up. It might look at a picture of a red apple and confidently say, "That's a green pear," or describe a dog running toward water when it's actually running away.

This paper introduces a new, clever trick called PADE (Positive Attention Dynamics Enhancement) to fix this problem without needing to retrain the assistant or hire extra helpers.

Here is the breakdown of how it works, using simple analogies:

1. The Problem: The "Loud Roommate" (Attention Sinks)

When the assistant looks at a picture, it breaks the image down into tiny pieces (tokens) and tries to decide which pieces are important.

  • The Issue: In the assistant's brain, there are certain "Loud Roommates" (called Attention Sinks). These are usually boring, generic parts of the image (like the background or the start of the sentence) that scream the loudest.
  • The Result: Because these "Loud Roommates" are so loud, the assistant ignores the actual interesting stuff (the apple, the dog) and focuses on the noise. This causes it to hallucinate (make things up) because it's not actually looking at the important details.

2. The Old Solutions: The "Heavy Hammers"

Scientists tried to fix this before, but their methods were clunky:

  • The "Double-Check" Method: They made the assistant look at the picture twice (once normally, once with the picture slightly messed up) and compared the answers. This is like asking a friend to check your math homework twice just to be sure. It works, but it's slow and doubles the work.
  • The "Hire a Detective" Method: They hired a separate, smaller AI to point out what's in the picture. This is like hiring a security guard to watch your house while you sleep. It works, but it's expensive and the guard might not agree with your house rules.
  • The "Volume Knob" Method: They tried to just turn up the volume on the parts of the image that seemed important. But because the "Loud Roommates" were already so loud, this just made the noise even louder, making the hallucinations worse.

3. The New Solution: PADE (The "Spotlight Tracker")

The authors realized that while the "Loud Roommates" are always loud, the important parts of the image (like the apple) have a different behavior. They don't just stay loud; they grow louder as the assistant thinks deeper.

Think of it like a detective in a crowded room:

  • Static Attention (Old Way): You look at who is shouting the loudest right now. That's usually the "Loud Roommate" (the background noise).
  • PADE (New Way): You watch who is getting more interested as the conversation goes on. If the assistant starts paying more attention to the apple as it moves from the first layer of thinking to the last, that's a sign the apple is real and important.

PADE works in three simple steps:

  1. Track the Growth (Positive Attention Dynamics): Instead of asking "Who is loudest?", PADE asks, "Who is getting more attention as we think?" It ignores the static noise and focuses only on the parts of the image that are gaining importance. This reveals the "True Core" of the image.
  2. Adjust the Volume (MAD Scaling): The assistant's brain is messy; some parts are super loud, some are quiet. PADE uses a smart "volume knob" (called Median Absolute Deviation) to make sure it doesn't shout too loud or too soft. It finds the perfect balance to boost the important parts without breaking the system.
  3. Don't Forget the Instructions (System-Token Compensation): Sometimes, if you boost the image too much, the assistant forgets what you asked it to do (e.g., "Describe the apple"). PADE has a safety net: it takes a tiny bit of attention away from the "System Tokens" (the boring, generic parts of the prompt that don't matter much) and gives it to the apple. This way, the assistant sees the apple clearly without forgetting your question.

The Result

By using this "Spotlight Tracker," the assistant stops making things up.

  • It stops saying the apple is green.
  • It stops saying there is a cup in the picture when there isn't one.
  • It does all this instantly (no retraining needed) and cheaply (no extra computers needed).

In short: PADE teaches the AI to ignore the background noise and focus on the parts of the image that are actively becoming more interesting as it thinks, ensuring it tells the truth about what it sees.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →