Self-Aug: Query and Entropy Adaptive Decoding for Large Vision-Language Models

Imagine you have a very smart, well-read friend who loves looking at pictures and describing them. This friend is a Large Vision-Language Model (LVLM). They are incredibly talented, but they have a quirky habit: sometimes, when they aren't 100% sure about a detail in a photo, they confidently make things up. This is called "hallucinating."

For example, if you show them a picture of a cat and ask, "Is the cat wearing a hat?", they might say, "Yes, it's wearing a red beret," even though there is no hat in the picture. They are just guessing based on patterns they've seen before, not what's actually there.

The paper you're asking about introduces a new method called Self-Aug to fix this. Think of it as a "Reality Check" system for your AI friend. Here is how it works, broken down into two simple steps using everyday analogies.

The Problem: The "Amateur" vs. The "Expert"

To stop the AI from lying, previous methods tried a trick called Contrastive Decoding. Imagine you have two people looking at the same photo:

The Expert: Your smart AI friend.
The Amateur: A slightly confused version of that same friend who is looking at a blurry, noisy, or distorted version of the photo.

The idea is: "If the Expert says 'It's a cat' but the Amateur (looking at a blurry photo) says 'It's a dog,' we should trust the Expert more."

But there was a flaw: The old methods just randomly blurred the photo (like adding static to a TV) without thinking about what you asked. If you asked, "What color is the car?", randomly blurring the whole picture wasn't very helpful. They needed a smarter way to distort the image based on your specific question.

Solution Part 1: The "Skeptical Detective" (Self-Augmentation)

This is the first big innovation of the paper. Instead of randomly blurring the image, the AI is asked to act like a Skeptical Detective.

The Analogy:
Imagine you are a detective trying to solve a crime. You have a witness (the AI) who says, "The suspect was wearing a blue hat."
To test if the witness is reliable, you don't just ask them to repeat it. You ask them to imagine a scenario where the evidence is most likely to be wrong.

If the question is about color, the detective says, "Let's invert the colors of the photo. If the hat was blue, it would look orange now. If the witness still says 'blue,' they are lying."
If the question is about left vs. right, the detective says, "Let's flip the photo horizontally. If the suspect was on the left, they are now on the right."

How Self-Aug works:
Before answering your question, the AI looks at the image and the question, then asks itself: "What is the one thing I can do to this picture that would make it hardest for me to answer this specific question correctly?"

If you ask about counting people, it might "mask" (cover up) parts of the image to see if the count changes.
If you ask about text, it might add "noise" to make the letters unreadable.

By choosing the perfect distortion for the specific question, the AI creates a much stronger "Reality Check." If the AI still gives the same answer after this targeted distortion, it's probably telling the truth.

Solution Part 2: The "Confidence Filter" (Entropy Adaptive Truncation)

The second innovation is about how the AI picks its final words.

The Analogy:
Imagine the AI is a chef preparing a soup. At every step, the chef has a list of possible ingredients to add next (e.g., salt, pepper, sugar, or "unicorn horn").

Old Method: The chef just cuts off the bottom 50% of the list based on a fixed rule. "I'll never add anything that isn't in the top 50%." This is risky. If the chef is very unsure (low confidence), they might accidentally throw away the only correct ingredient because it wasn't in the top half.
New Method (SAT): The chef checks their own confidence level first.
- High Confidence (Low Entropy): The chef is sure. "I know I need salt." The list of options is very short. The chef can be strict and only pick from the top few options.
- Low Confidence (High Entropy): The chef is confused. "Is it salt? Sugar? Maybe cumin?" The list of options is long and messy. The chef realizes, "I can't be too strict here, or I'll miss the right answer." So, they keep a wider list of options to be safe.

This new method, called Sparsity Adaptive Truncation (SAT), dynamically adjusts how picky the AI is based on how confused it feels at that exact moment. It prevents the AI from throwing away good answers when it's unsure, and prevents it from picking random nonsense when it's confident.

The Result: A More Honest AI

When the researchers tested this new Self-Aug system on five different AI models and seven different tests (like identifying objects, solving math problems in images, or describing scenes), the results were impressive:

Fewer Lies: The AI hallucinated much less. It stopped confidently inventing facts.
Smarter Distortions: Instead of randomly messing up the image, it knew exactly how to "trick" itself to find the truth.
No Extra Training: The best part? You don't need to re-teach the AI. You just change how it "thinks" while it answers. It's like giving your friend a new set of glasses to wear while they look at the world, rather than sending them back to school for a year.

In Summary:
Self-Aug is like giving your AI a customized magnifying glass and a confidence meter. It looks at your specific question, figures out the best way to "break" the image to test its own knowledge, and then carefully filters its answers based on how sure it feels. The result is an AI that is much more reliable and less likely to make things up.

1. Problem Statement

Large Vision-Language Models (LVLMs) suffer from hallucinations, where they generate plausible but factually incorrect or nonsensical outputs. This issue stems from the auto-regressive training objective, which prioritizes statistical token likelihood over precise factual understanding.

While Contrastive Decoding (CD) has been proposed to mitigate this by contrasting an "expert" model's output with an "amateur" model's output (generated from a degraded input), existing Visual Contrastive Decoding (VCD) methods have two critical limitations:

Query-Agnostic Augmentation: Most methods apply generic visual degradations (e.g., random noise) regardless of the specific text query. This fails to target the specific semantic intent of the user's request (e.g., a color query requires color disruption, while a spatial query requires positional disruption).
Static/Inefficient Thresholding: Existing methods often use a fixed threshold or rely solely on the maximum logit value to filter tokens. This ignores the full information in the logit distribution (such as model confidence/entropy), leading to the potential discarding of correct tokens in low-confidence scenarios or the retention of false positives.

2. Methodology: Self-Aug

The authors propose Self-Aug, a training-free decoding strategy that integrates two novel components: Self-Augmentation Selection (SAS) and Sparsity Adaptive Truncation (SAT).

A. Self-Augmentation Selection (SAS)

Instead of using random or fixed augmentations, Self-Aug dynamically selects the most semantically disruptive visual augmentation based on the text query.

Mechanism: It employs a meta-level prompting strategy where the LVLM itself acts as an "augmentation analyst."
Prompt Design: The prompt includes:
- Definitions of available augmentations (e.g., color inversion, random mask, horizontal flip) and their specific semantic effects.
- In-Context Learning (ICL) examples demonstrating how to reason about which augmentation invalidates a specific question's premise.
- A requirement for the model to generate a reasoning trace before selecting the augmentation.
Process: Given a query $x$ , the model outputs a reasoning trace $r$ and a choice $c$ . The chosen augmentation $A(c, v)$ is applied to the image $v$ to create the "amateur" input for contrastive decoding. This ensures the "amateur" model is specifically confused about the aspect of the image relevant to the query.

B. Sparsity Adaptive Truncation (SAT)

To address the limitations of static thresholding, Self-Aug introduces a confidence-aware filtering mechanism.

Concept: The method posits that the entropy of the logit distribution correlates with model uncertainty.
- High Entropy (Low Confidence): The model is unsure; a lenient threshold is needed to avoid discarding potentially correct, low-probability tokens.
- Low Entropy (High Confidence): The model is confident; a strict threshold is applied to prune the tail of the distribution and prevent sampling rare, erroneous tokens.
Algorithm:
- It calculates the Shannon Entropy $H$ of the expert's logit distribution.
- A decay function $H_{decay}$ maps this entropy to a dynamic threshold $\beta_t$ . The function uses a sigmoid to ensure the threshold stays within $(0, 0.5]$ .
- The vocabulary is truncated to only include tokens where $p(y_t) \geq \beta_t \cdot \max(p)$ .
- Tokens outside this set are assigned $-\infty$ logit values before the final contrastive calculation.

C. Contrastive Decoding Integration

The final logit for the next token is calculated as:
$l_{CD} = (1 + \alpha) \cdot l_{expert} - \alpha \cdot l_{amateur}$
where $l_{amateur}$ is derived from the query-specific augmented image. The SAT threshold is applied to this contrasted distribution to filter implausible tokens.

3. Key Contributions

Query-Dependent Augmentation: Introduces a prompting-based strategy (SAS) that leverages the LVLM's intrinsic knowledge to select visual augmentations semantically aligned with the text query, maximizing the informative discrepancy between expert and amateur models.
Entropy-Aware Thresholding: Proposes Sparsity Adaptive Truncation (SAT), which dynamically adjusts the plausibility constraint based on the output entropy, utilizing the full logit distribution rather than just the maximum value.
Training-Free Efficiency: The method requires no architectural changes or additional training, making it compatible with any existing LVLM.

4. Experimental Results

The authors evaluated Self-Aug across 5 LVLMs (including LLaVA-1.5, Qwen-VL, InstructBLIP, and Qwen3-VL) and 7 benchmarks (covering both discriminative and generative tasks).

Performance Gains:
- Self-Aug consistently outperformed state-of-the-art baselines (Multinomial sampling, VCD, and VACoDe).
- On Discriminative Benchmarks (POPE, MME, MMVP), Self-Aug achieved an average gain of up to 18.78% over the baseline (e.g., on InstructBLIP).
- On Generative Benchmarks (LLaVA-Bench, MM-Vet, MMHal-Bench), it significantly reduced hallucination rates while improving factual consistency scores.
Ablation Studies:
- Augmentation Selection: Adaptive selection significantly outperformed static augmentations (e.g., always using noise). The choice of augmentation varied correctly based on the benchmark (e.g., "Random Mask" was preferred for object counting tasks in POPE).
- Thresholding: SAT consistently outperformed the standard Adaptive Plausibility Constraint (APC) across all models, confirming the value of entropy-based dynamic thresholding.
- Model Scale: Larger models (13B vs 7B) demonstrated better reasoning quality in selecting augmentations, though even smaller models showed significant gains.
Computational Efficiency: While SAS adds a text-generation step, it is significantly more efficient than brute-force methods (like VACoDe) that require multiple forward passes for every augmentation candidate. Self-Aug achieves a favorable trade-off between latency and performance.

5. Significance

This work fundamentally shifts the paradigm of hallucination mitigation in LVLMs from static, heuristic-based approaches to dynamic, context-aware strategies.

Semantic Alignment: It demonstrates that the "amateur" model in contrastive decoding must be confused specifically about the query's intent to be effective, rather than just generally degraded.
Confidence Utilization: It highlights that model confidence (entropy) is a critical signal for decoding, and ignoring it leads to suboptimal token selection.
Generalizability: As a training-free, plug-and-play method, Self-Aug offers a scalable solution for improving the reliability of multimodal AI systems without the cost of retraining.

In conclusion, Self-Aug proves that integrating query-dependent visual reasoning with entropy-aware decoding is a principled and highly effective approach to reducing hallucinations in Large Vision-Language Models.

Self-Aug: Query and Entropy Adaptive Decoding for Large Vision-Language Models

The Problem: The "Amateur" vs. The "Expert"

Solution Part 1: The "Skeptical Detective" (Self-Augmentation)

Solution Part 2: The "Confidence Filter" (Entropy Adaptive Truncation)

The Result: A More Honest AI

1. Problem Statement

2. Methodology: Self-Aug

A. Self-Augmentation Selection (SAS)

B. Sparsity Adaptive Truncation (SAT)

C. Contrastive Decoding Integration

3. Key Contributions

4. Experimental Results

5. Significance

More like this

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

When Is Collective Intelligence a Lottery? Multi-Agent Scaling Laws for Memetic Drift in LLMs

AutoSAM: an Agentic Framework for Automating Input File Generation for the SAM Code with Multi-Modal Retrieval-Augmented Generation

Trust as Monitoring: Evolutionary Dynamics of User Trust and AI Developer Behaviour

Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach