Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

Imagine you have a very smart, well-read friend who loves looking at pictures and telling you stories about them. This friend is an AI (specifically, a Large Vision-Language Model). They are incredibly talented, but they have a quirky habit: they sometimes lie.

When you show them a picture of a dog, they might confidently say, "That's a dog wearing a red hat!" even though the dog is bare-headed. In the AI world, this is called a hallucination. It's like the AI is daydreaming while it's supposed to be describing reality.

The paper introduces a new system called Kestrel (named after a sharp-eyed bird of prey) to fix this problem. Here is how it works, explained simply:

The Problem: The "Confident Liar"

Most current methods to stop AI from lying are like trying to teach the AI to be honest by making it study harder (retraining). But for massive AI models, this is like trying to rebuild a skyscraper just to fix a cracked window—it's too expensive and slow.

Other "free" methods try to tweak the AI's brain while it's talking, but they often just guess or make the AI overthink, leading to new mistakes.

The Solution: Kestrel's "Detective Team"

Kestrel doesn't retrain the AI. Instead, it acts like a fact-checking editor or a detective that works alongside the AI. It uses a three-step process to catch lies before they become the final answer.

1. Breaking the Story into Clues (Decomposition)

When the AI gives an answer, Kestrel doesn't just accept it. It breaks the answer down into small, checkable facts.

AI says: "There are three red apples on the table."
Kestrel breaks it down:
- Claim 1: Are there apples?
- Claim 2: Are there three of them?
- Claim 3: Are they red?
- Claim 4: Are they on the table?

2. Sending in the "Grounding Agent" (The Detective)

This is the magic part. Kestrel sends a specialized tool (called a Grounding Agent, based on a technology called SAM3) to look at the picture specifically for those clues.

Think of this agent as a forensic photographer. It doesn't just "look" at the image; it zooms in, draws boxes around objects, and takes close-up photos of specific areas.
It gathers hard evidence: "I see a box around an object. It looks like an apple. I see two of them, not three. The color is green, not red."

3. The "Evidence-Gated" Debate (Verification & Refinement)

Now, Kestrel brings the AI and the Detective together for a debate.

The AI says, "I'm sure it's three red apples!"
The Detective says, "Here is a photo showing only two green apples."
The Rule: Kestrel has a strict rule: Don't change the answer unless the evidence is overwhelming.
- If the evidence is weak or blurry, Kestrel trusts the AI's original guess (to avoid over-correcting).
- If the evidence is clear (like a zoomed-in photo showing green apples), Kestrel forces the AI to change its story.

This happens in rounds. If the AI is still unsure, Kestrel sends the Detective back for a second look, gathering more proof until the answer is rock-solid.

Why is this better? (The "Conservative" Approach)

Imagine a student taking a test.

Old methods are like a student who panics and changes every answer they aren't 100% sure about, often turning right answers into wrong ones.
Kestrel is like a student who only changes an answer if they find a smoking gun (irrefutable proof). If the proof isn't there, they stick with their original thought. This prevents the AI from "over-correcting" and making new mistakes.

The Results

The paper tested Kestrel on many difficult picture quizzes.

It made the AI significantly more accurate (like going from a B+ to an A+).
It worked with different types of AI models (it's "backbone-agnostic," meaning it fits on any smart AI).
Most importantly, it provides a paper trail. You can see exactly why the AI changed its mind: "I changed my answer because the zoomed-in photo showed the object was blue, not red."

In a Nutshell

Kestrel is a training-free system that stops AI from hallucinating by acting as a fact-checking editor. It breaks answers into small claims, sends a "detective" to gather visual proof, and only allows the AI to change its story if the evidence is undeniable. It's like giving the AI a pair of glasses and a magnifying glass so it can see the truth clearly before it speaks.

1. Problem Statement

Large Vision-Language Models (LVLMs) have achieved remarkable performance in multimodal tasks but remain prone to hallucinations, where models generate responses inconsistent with or unsupported by the input image.

Limitations of Existing Solutions:
- Training-based methods: Require expensive retraining, massive annotated datasets, or alignment with external feedback, making them difficult to deploy for large models.
- Existing Training-free methods: Often rely on internal decoding dynamics (e.g., logit manipulation) or single-pass verification. These approaches suffer from limited robustness, lack of interpretability, and a tendency toward over-correction or brittle corrections that cannot be validated against concrete visual evidence.

2. Methodology: The Kestrel Framework

Kestrel is a training-free framework designed to mitigate hallucinations at inference time by unifying an explicit visual grounding agent with an evidence-driven iterative self-refinement mechanism. It operates through a four-stage pipeline:

A. Initialization & Claim Decomposition

The LVLM generates an initial answer ( $\hat{A}^{(0)}$ ).
The system decomposes the question-answer pair into a set of verifiable claims (e.g., existence, color, count, position).
Each claim is anchored to specific visual entities to serve as targets for the grounding agent.

B. Agent Grounding (External Evidence Collection)

Kestrel invokes an external SAM3-based grounding agent (Segment Anything Model 3) to collect explicit visual evidence for each claim.
Visual Evidence Collected:
- Segmentation overlays for transparent localization.
- Instance bounding boxes for geometry-based reasoning.
- Crop-and-zoom views to reduce ambiguity for attribute inspection (e.g., color, texture).
Structured Textual Evidence: The visual outputs are converted into structured textual statements by an LVLM (e.g., "Number of instances > 0" for existence, relative spatial coordinates for position). These are paired with citation identifiers to create an auditable evidence chain.

C. Claim-Level Verification

An LVLM-as-a-Judge verifies each claim against the cited structured evidence.
The judge outputs a verdict (Supported/Contradicted/Insufficient), a confidence score, and a reasoning trace citing the specific evidence.
A top-level verdict is generated: the answer is "Contradicted" if any claim is confidently refuted, "Supported" only if all claims are supported, and "Insufficient" otherwise.

D. Evidence-Gated Self-Refinement

To prevent over-correction, Kestrel employs a stateful, evidence-gated update scheme.
Update Criteria: The answer is revised only if the verification provides high-confidence judgments supported by cited evidence.
Iterative Process: If the answer is revised, the system generates new claims for uncertain or contradicted areas and repeats the grounding/verification cycle (up to $K=3$ iterations).
Early Stopping: The process stops if the answer stabilizes (two consecutive supported verdicts) or if no stronger evidence is found.

3. Key Contributions

Unified Framework: Proposes Kestrel, the first training-free framework to combine explicit visual grounding agents with iterative, evidence-verified self-refinement.
Interpretability & Auditability: Unlike black-box decoding corrections, Kestrel provides transparent verification traces, showing exactly which visual evidence supported or refuted a claim.
Conservative Correction: Introduces an "evidence-gated" mechanism that significantly reduces the risk of over-correction (changing correct answers to wrong ones) by requiring high-confidence evidence before updating.
Backbone Agnosticism: The framework is model-agnostic and improves performance across different state-of-the-art LVLM backbones without retraining.

4. Experimental Results

Kestrel was evaluated on POPE (MS-COCO, A-OKVQA, GQA) and MME-Hallucination benchmarks using Qwen3-VL and InternVL3.5 backbones.

POPE Benchmark:
- Achieved an average improvement of +3.31% over Qwen3-VL and +3.03% over InternVL3.5.
- Outperformed strong training-free baselines (OPERA, VCD, Woodpecker) by an average of +1.38% to +1.47%.
- Demonstrated robustness across random, popular, and adversarial sampling splits.
MME-Hallucination Benchmark:
- Boosted Qwen3-VL by +28.34 points (total score 760.00), significantly outperforming OPERA (+11.67) and VCD (+5.00).
- Showed consistent gains across object-level (existence, count) and attribute-level (position, color) hallucinations.
Human Preference Study:
- In a study of 60 cases, Kestrel was preferred 68.3% of the time over other methods (DeGF, Woodpecker, RITUAL, VCD), indicating superior alignment with human judgments on factual consistency and grounding quality.
Efficiency:
- While latency increases due to iterative grounding (approx. 24x slower than single-pass), the system employs early stopping, resolving most cases in the first iteration. The trade-off yields a practical balance between performance and cost.

5. Significance

Kestrel addresses a critical bottleneck in the deployment of LVLMs: trustworthiness. By moving away from internal decoding heuristics toward external, verifiable evidence, Kestrel offers a scalable solution for hallucination mitigation that does not require expensive retraining.

Reliability: It ensures that corrections are grounded in visual reality, reducing the "hallucination of hallucinations" (correcting a wrong answer to another wrong one).
Transparency: The generation of citation-linked evidence chains allows for fine-grained diagnosis of model failures, a crucial step for safety-critical applications.
Generalizability: The framework's success across diverse backbones suggests it is a viable, plug-and-play module for enhancing the reliability of future multimodal systems.