Imagine you have a very smart robot assistant (a Large Vision-Language Model, or LVLM) that looks at a photo and describes it to you. Sometimes, this robot gets a little too confident and starts making things up. It might say, "There's a dog in the picture," even though there isn't one. This is called a hallucination.

The robot does this because it relies too much on its "memory" of how the world usually sounds (language priors) and not enough on what it is actually seeing in the photo at that specific moment.

The Old Way: The "One-Size-Fits-All" Volume Knob

Recently, researchers tried to fix this by turning up the "volume" on the robot's visual attention. Imagine the robot has a volume knob for "looking at the picture" and another for "listening to its own thoughts."

Previous methods used a fixed volume knob. They would just crank up the "look at the picture" volume by the same amount for every single word the robot wrote.

The Problem: This is like trying to listen to a quiet whisper and a loud shout with the same volume setting.
- Sometimes the robot needs to look harder at the picture to stop a lie, but the fixed volume is too low (Under-amplification).
- Other times, the robot is already looking closely, but the volume is turned up so high it starts "seeing" things that aren't there just because it's staring too hard (Over-amplification).

The paper calls this the "Dual Failure Pattern": the fixed setting is sometimes too weak and sometimes too strong.

The New Solution: SAVAA (The Smart, Adaptive Pilot)

The authors propose a new method called SAVAA (Step-wise Adaptive Visual Attention Amplification). Instead of a fixed knob, SAVAA acts like a smart pilot who adjusts the controls in real-time, word by word.

Here is how it works, step-by-step:

1. The "Risk Radar" (Visual Grounding Entropy)

Before the robot writes the next word, SAVAA checks a "Risk Radar" to see if the robot is about to lie. It uses a special tool called Visual Grounding Entropy (VGE).

How it works: It asks two questions:
1. Is the robot unsure? (High uncertainty = Risk).
2. Is the robot looking at the picture for this word? (If the robot is confident but the picture doesn't support the word, that's a huge risk).
The Analogy: Imagine the robot is describing a beach. If it says "sand," the picture supports it, so the risk is low. If it says "snow" (and the picture is clearly a beach), the risk is high, even if the robot feels very confident about the word "snow."

2. The "Dynamic Volume Knob"

Based on the Risk Radar, SAVAA adjusts the "look at the picture" volume for the next word:

High Risk: If the robot is about to say something risky, SAVAA cranks up the visual attention. It forces the robot to look closely at the photo before speaking.
Low Risk: If the robot is safe and the picture clearly supports the word, SAVAA turns it down. This prevents the robot from getting "too intense" and inventing new details.

3. The "Mute Button" for Old Thoughts

Sometimes, even with a perfect visual focus, the robot still relies too much on its internal "storytelling" habits. To fix this, SAVAA adds a Text Attention Suppression feature.

The Analogy: Imagine the robot is trying to describe a photo, but it keeps getting distracted by a radio playing a story in the background. SAVAA gently mutes the radio (the text input) so the robot focuses purely on the photo. This stops the robot from finishing a sentence just because it "sounds right" in a story, rather than because it's in the photo.

Why This Matters

The paper tested this on three different smart robots (LLaVA, Qwen, and InternVL) using various tests where the robots had to describe images.

The Result: SAVAA significantly reduced the number of made-up details (hallucinations) compared to the old "fixed volume" methods.
The Efficiency: It does all this without needing to retrain the robot or make it run slower. It just tweaks the attention knobs while the robot is thinking.

Summary

Think of the old method as a driver who keeps the gas pedal pressed at 50% no matter if they are driving on a straight highway or a winding mountain road. They might crash (hallucinate) because they didn't slow down for the curve or speed up enough for the hill.

SAVAA is the driver who constantly checks the road, the speed, and the conditions, adjusting the gas and brakes perfectly for every single turn. It ensures the robot describes exactly what it sees, nothing more and nothing less.

Technical Summary: SAVAA – Mitigating Hallucinations in LVLMs via Step-wise Adaptive Visual Attention Amplification

1. Problem Statement

Large Vision-Language Models (LVLMs) frequently suffer from hallucinations, generating content that is inconsistent with the visual input. A primary cause during autoregressive generation is the model's over-reliance on language priors, particularly when visual evidence is under-utilized.

Recent training-free approaches, collectively termed Visual Attention Amplification (VAA), attempt to mitigate this by amplifying attention weights assigned to visual tokens during a single forward pass. However, existing VAA methods employ a fixed amplification factor across all generation steps. The authors identify a dual failure pattern inherent in this static approach:

Under-amplification: At certain steps, the fixed factor is too weak to resolve existing hallucinations.
Over-amplification: At other steps, the same factor is too strong, introducing new hallucinations that were not present in the original output.

This phenomenon occurs because the appropriate level of visual attention varies dynamically depending on the specific token being generated and the model's current state of uncertainty and grounding.

2. Methodology: SAVAA

The paper proposes Step-wise Adaptive Visual Attention Amplification (SAVAA), a framework that dynamically calibrates the visual amplification factor at each generation step based on an estimated hallucination risk.

2.1 Visual Grounding Entropy (VGE)

To enable risk-guided calibration, SAVAA requires a lightweight hallucination-risk estimator. The authors introduce Visual Grounding Entropy (VGE), which augments standard predictive entropy with visual grounding signals.

Predictive Entropy ( $\bar{H}_t$ ): Measures the model's uncertainty based on output logits. However, entropy alone fails when the model is highly confident but weakly grounded in the image (i.e., "confident hallucinations").
Visual Grounding Score ( $G_t$ ): Computed during the prefilling stage, a grounding vector $G$ is constructed by pooling the vocabulary logits of visual tokens. For a predicted token $v^*_t$ , the grounding score $G_t$ represents the strength of visual evidence supporting that token.
VGE Calculation: The risk score is defined as a weighted combination:
$VGE_t = \alpha \bar{H}_t + (1 - \alpha)(1 - G_t)$
where $\alpha$ balances uncertainty and grounding. A higher VGE indicates a token is uncertain, weakly grounded, or both.

2.2 Step-wise Adaptive VAA Factor Calibration

SAVAA utilizes a one-step-lagged calibration mechanism. The risk $r_t$ is computed from the logits of the current step $t$ but is used to calibrate the amplification factor for the next step $t+1$ .

Risk Normalization: The VGE is normalized to a risk score $r_t \in [0, 1]$ .
Factor Calibration: The visual amplification factor $m_t$ for step $t$ is calculated as:
$m_t = 1 + (m_{vis}^{max} - 1) \cdot r_{t-1}$
If the previous step had low risk, $m_t \approx 1$ (no intervention). If the previous step had high risk, $m_t$ increases, applying stronger visual amplification to the current step.

2.3 Text Attention Suppression

To address residual hallucinations that persist even under strong visual amplification (often due to language priors), SAVAA incorporates a complementary Text Attention Suppression mechanism.

This mechanism suppresses the pre-softmax attention scores assigned to input text tokens by a fixed factor $m_{txt}^{max}$ .
Unlike the adaptive visual factor, this suppression is static and lightweight, designed to reduce the relative influence of language priors without requiring step-wise risk estimation.

2.4 Overall Workflow

Prefilling: Compute the visual grounding vector $G$ once.
Generation Loop:
- Compute the VAA factor $m_t$ and text suppression factor based on the risk $r_{t-1}$ from the previous step.
- Modulate pre-softmax attention scores: amplify visual tokens by $m_t$ and suppress text tokens by $1/m_{txt}^{max}$ .
- Generate the next token.
- Compute new risk $r_t$ using VGE for the next iteration.

3. Key Contributions

Identification of Dual Failure Pattern: The paper demonstrates that fixed amplification factors in existing VAA methods lead to both under-amplification (leaving hallucinations unresolved) and over-amplification (introducing new hallucinations), varying across generation steps and model architectures.
SAVAA Framework: A new VAA framework that estimates hallucination risk step-wise using Visual Grounding Entropy (VGE) and adaptively calibrates the amplification factor.
Complementary Text Suppression: The introduction of a lightweight text attention suppression mechanism to address residual language-prior dominance, working in tandem with visual amplification.
Empirical Validation: Comprehensive evaluation across three diverse LVLMs (LLaVA-NeXT-7B, Qwen3-VL-8B, InternVL3.5-8B) and multiple benchmarks (CHAIR, SHR, AMBER, POPE), showing consistent improvements over state-of-the-art baselines.

4. Experimental Results

SAVAA was evaluated on standard hallucination benchmarks:

CHAIR: SAVAA achieved the best CHAIRs (sentence-level) and CHAIRi (instance-level) scores across all three models, reducing hallucinations by significant margins (e.g., reducing CHAIRs by 7.00 points on InternVL3.5-8B) without sacrificing generation quality (F1 scores remained comparable to vanilla models).
SHR: SAVAA outperformed baselines on all four metrics (HSR, HWR, HSPI, HWPI), demonstrating effectiveness in reducing both sentence-level and word-level hallucinations.
AMBER: SAVAA achieved the lowest hallucination rates (Hal) while maintaining high content coverage (Cover), indicating a favorable trade-off between mitigation and semantic completeness.
POPE: On discriminative tasks, SAVAA preserved original model utility, achieving accuracy and F1 scores comparable to vanilla baselines, though gains were modest compared to generative tasks.
Efficiency: SAVAA incurs only marginal inference-time overhead compared to vanilla models, as VGE requires no additional forward passes (grounding vector is computed once) and the calibration logic is lightweight.

5. Significance and Claims

The paper claims that SAVAA represents a shift from static, heuristic-based visual amplification to risk-guided, step-wise attention calibration. By explicitly modeling the varying need for visual grounding across different generation steps, SAVAA addresses the limitations of fixed-factor methods.

The authors position SAVAA as a practical, training-free solution that can be deployed at inference time to improve the factual reliability of LVLMs in open-ended generation tasks. They acknowledge limitations, noting that the method relies on model-specific hyperparameters and yields more modest gains on single-token discriminative tasks compared to open-ended generation. The work suggests that future directions could involve more adaptive calibration strategies and extending the approach to closed-form prediction settings.

SAVAA: Mitigating Hallucinations in LVLMs via Step-wise Adaptive Visual Attention Amplification