Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing

Imagine you have a super-smart robot assistant who loves looking at satellite photos of the Earth. You ask it, "How many red boats are in that tiny harbor in the bottom right corner?"

Sometimes, this robot gets it right. But often, it starts hallucinating. It might confidently say, "There are five blue boats," even though there are none, or it might say, "I see a red boat," when it's actually looking at a red car on the shore.

This paper is about fixing that robot so it can "see clearly" without needing to go back to school for years of training.

Here is the breakdown of the problem and the solution, using some everyday analogies.

The Problem: The "Distracted Detective" and the "Tiny Ant"

The authors found that these AI robots fail in remote sensing (satellite photos) for two main reasons:

The "Distracted Detective" (Can't Find):
Imagine you are looking for a specific ant in a massive, crowded stadium from a helicopter. If you just look at the whole stadium at once, your eyes get overwhelmed. You might start staring at the concession stand or the crowd in the stands, completely missing the tiny ant you were looking for.
- In AI terms: The model gets distracted by the huge background and fails to "ground" (focus) on the specific small area the question is asking about.
The "Tiny Ant" (Can't See Clearly):
Imagine you finally find the right spot in the stadium, but you are still too high up. The ant looks like a blurry speck. You guess, "That's probably a red ant," but it's actually a green one.
- In AI terms: The model focuses on the right area, but the image is too zoomed out to see fine details (like the color of a ship or the number of cars), so it guesses based on what it thinks it should see, rather than what is actually there.

The Solution: A New Benchmark (RSHBench)

Before fixing the robot, the authors needed a better way to test it. Existing tests just asked, "Did you get the answer right?"

The authors created RSHBench, which is like a medical exam for the robot's brain. Instead of just checking the final answer, they force the robot to show its "work" (its reasoning).

If the robot says, "I see a red boat," the examiners check: Did the robot actually look at the boat, or did it just guess?
This helps them distinguish between a Factual Hallucination (making up facts) and a Logical Hallucination (using bad math or reasoning).

The Fix: RADAR (The "Smart Zoom" Tool)

The authors propose a clever, training-free method called RADAR. "Training-free" is the key here. Usually, to fix a robot, you have to feed it thousands of new photos and re-teach it (which takes weeks and supercomputers). RADAR doesn't do that. It just changes how the robot looks at the picture during the test.

Think of RADAR as a two-step "Smart Zoom" process:

Step 1: The "Where" Question (Coarse Localization)

Instead of asking the robot, "How many red boats are there?", RADAR first asks it: "Where in this giant picture should I look to find the boats?"

The robot uses its internal "attention" (like a flashlight) to scan the image.
It ignores the distracting background (the stadium seats) and points the flashlight directly at the harbor.
The Magic: It compares the "Where" question against a generic "What's in the whole picture?" question. This helps it filter out the noise and focus only on what matters.

Step 2: The "What" Question (Fine-Grained Reasoning)

Once the robot has zoomed in on the harbor, RADAR asks: "Now that we are looking at the harbor, what color are the boats?"

Because the robot is now looking at a cropped, zoomed-in version of just the harbor, the "ants" (boats) look much bigger and clearer.
It can now count them accurately and see their colors without guessing.

Why is this a big deal?

No Re-Training: You don't need to re-teach the robot. You just give it this new "thinking strategy" (the two-step zoom) when you ask it a question.
It Works Everywhere: The authors tested this on many different types of robots (AI models), and it consistently reduced mistakes by about 10% and improved accuracy by 2-4%.
It Stops the "Snowball Effect": Usually, if a robot gets the location wrong, it makes up a story to fit that wrong location. By forcing the robot to find the right spot first, the rest of the story becomes true.

The Bottom Line

The paper teaches us that to make AI smarter at looking at satellite photos, we don't necessarily need bigger brains or more data. Sometimes, we just need to teach the AI how to look.

By giving the AI a "Where to look" step before asking "What do you see?", we stop it from daydreaming about the whole world and help it focus on the tiny, important details right in front of its eyes. It's like telling a distracted student: "Don't just read the whole textbook; find the specific paragraph first, then answer the question."

1. Problem Statement

Multimodal Large Language Models (MLLMs) have shown remarkable progress in general vision tasks but suffer from severe hallucinations when applied to Remote Sensing Visual Question Answering (RS-VQA). The paper identifies two primary causes for these failures, both rooted in visual grounding failures:

Localization Failure ("Cannot Find"): Remote sensing images cover vast spatial extents. MLLMs often fail to localize the specific, small regions relevant to a query within the large scene, causing their attention to become diffuse and distracted by irrelevant background clutter.
Recognition Failure ("Cannot See Clearly"): Even when the model attends to the correct region, the targets are often small-scale or fine-grained. At the native resolution of the full image, visual evidence is too ambiguous for the model to perform accurate attribute recognition or counting, leading to incorrect predictions based on language priors rather than visual evidence.

Existing benchmarks primarily evaluate answer correctness, failing to systematically diagnose the types of hallucinations (factual vs. logical) or their underlying causes. Furthermore, most mitigation strategies require expensive retraining or fine-tuning, which is not scalable.

2. Methodology

The paper proposes two main contributions to address these issues: a new diagnostic benchmark (RSHBench) and a training-free inference framework (RADAR).

A. RSHBench: Protocol-Driven Hallucination Diagnosis

To systematically analyze hallucinations, the authors introduce RSHBench, a benchmark designed for fine-grained diagnosis.

Construction: It aggregates 371 image-question pairs from existing datasets (LRS-VQA, MME-RealWorld-RS, etc.) covering structural reasoning, localization, attribute inference, and counting.
Standardized Protocol: Models are forced to output an explicit reasoning trace ( $C'$ ) and a final answer ( $A'$ ) in a structured JSON format.
Judge-Based Evaluation: Instead of relying solely on answer accuracy, three expert AI judges (Gemini-3-pro, GPT-5.2, Qwen3-max) evaluate outputs against visual evidence.
Taxonomy: Hallucinations are categorized into:
- Factual Hallucinations: Unsupported claims about objects (OBJ), attributes (ATT), or spatial relations (SPA).
- Logical Hallucinations: Invalid reasoning (IR), unjustified causal inference (CI), internal inconsistency (INC), or semantic over-attribution (SO).

B. RADAR: Relative Attention-Driven Actively Reasoning

RADAR is a training-free inference method that leverages the intrinsic attention mechanisms of MLLMs to perform progressive, coarse-to-fine evidence acquisition. It operates in two stages:

Query-Conditioned Relative Attention (QCRA):
- Standard attention maps are often dominated by global visual saliency (e.g., large buildings) rather than query-specific relevance.
- RADAR computes a relative attention map by contrasting the attention driven by the Task Query ( $Q_T$ ) against a Global Scene Query ( $Q_G$ ).
- Formula: $\hat{A}_\ell = \frac{A_\ell(Q_T)}{A_\ell(Q_G) + \epsilon}$ . This suppresses query-irrelevant saliency and highlights regions specifically relevant to the question.
- A Focus Test (based on entropy) ensures the attention map is sufficiently concentrated before proceeding; if diffuse, the model falls back to the full image to avoid aggressive cropping errors.
Progressive Evidence Acquisition (Where $\to$ What):
- Stage 1 (Where-oriented): A prompt asks the model to identify where the relevant region is. The QCRA map guides the extraction of a coarse bounding box ( $B_1$ ) from the full image.
- Stage 2 (What-oriented): A second prompt asks for fine-grained details within the cropped region ( $I_1$ ). The QCRA map is recomputed on this crop to extract a tighter region ( $B_2$ ) for high-resolution inspection.
- Conservative Fallback: If the focus test fails at any stage, the system reverts to the previous level of context (full image or coarse crop) to prevent hallucination propagation.
- Final Answer: The model synthesizes the global context (with the location marked) and the refined local crop to generate the final answer.

3. Key Contributions

RSHBench: The first protocol-driven benchmark specifically for diagnosing factual and logical hallucinations in RS-VQA, distinguishing between localization failures and recognition failures.
RADAR Framework: A scalable, training-free inference strategy that uses intrinsic attention to guide adaptive zooming. It effectively mitigates both "cannot find" and "cannot see clearly" failure modes without retraining the model.
Systematic Analysis: The paper provides the first large-scale analysis showing that even state-of-the-art proprietary MLLMs suffer heavily from hallucinations in remote sensing, and that factual errors often cascade into logical reasoning errors.

4. Experimental Results

The authors evaluated RADAR on diverse MLLMs (including GPT-4o, Gemini-2.5-pro, LLaVA, Qwen-VL, and GeoZero) across three benchmarks: LRS-VQA, MME-RealWorld-RS, and LHRS-Bench.

Hallucination Reduction: RADAR consistently reduced the overall hallucination rate by approximately 10%–12% across different models. Specifically, it reduced factual hallucinations (OBJ, ATT) by significant margins.
Performance Improvement: RADAR improved RS-VQA accuracy by 2%–4% on average.
- On GeoZero (a specialized RS model), RADAR improved the average score from 45.56 to 47.40.
- On MME-RealWorld-RS, it achieved a +4.02 point improvement in average accuracy, with massive gains in Color (+6.22) and Count (+4.73) tasks, which rely heavily on fine-grained visual evidence.
Comparison with Baselines:
- RADAR outperformed generic cropping baselines (like ViCrop), which often degraded performance by cropping out relevant context or selecting mismatched regions.
- RADAR's performance gains were comparable to scaling up the backbone model size, demonstrating the efficiency of test-time evidence acquisition.
Ablation Studies: Removing either Stage 1 (localization) or Stage 2 (refinement) still improved performance over the baseline, but the full two-stage pipeline yielded the best results, confirming the complementary nature of coarse localization and fine-grained refinement.

5. Significance

Paradigm Shift: The paper demonstrates that test-time inference strategies (like adaptive zooming guided by attention) can be as effective as, or more effective than, retraining models for specialized domains like remote sensing.
Reliability: By explicitly grounding reasoning in localized visual evidence, RADAR significantly increases the trustworthiness of MLLMs in critical remote sensing applications (e.g., disaster monitoring, urban planning).
Diagnostic Tool: RSHBench provides a necessary framework for the community to move beyond simple accuracy metrics and understand why models fail, enabling more targeted future research.
Accessibility: As a training-free method, RADAR can be immediately applied to existing black-box or proprietary MLLMs, making advanced remote sensing reasoning accessible without massive computational resources for fine-tuning.