GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

Imagine you are trying to find a specific, tiny ant on a massive, high-resolution photograph of a football field.

The Problem: The "Zoom-Obsessed" Robot
Current AI models designed to look at these huge satellite photos are like a robot with a magnifying glass that is stuck in "ON" mode. No matter what question you ask it, the robot immediately zooms in, even if the answer is right there in the wide shot.

The paper calls this "Tool Usage Homogenization." It's like a student who, when asked a math problem, immediately pulls out a calculator for every single step, even for simple things like "2 + 2." They end up wasting time, getting confused by too much detail, and missing the big picture. In the world of satellite imagery, this means the AI wastes computing power zooming in on empty sky when it should be looking at a whole city, or it zooms in once and stops when it needs to zoom in three times to count tiny cars.

The Solution: GeoEyes (The Smart Detective)
The researchers built a new AI called GeoEyes. Think of GeoEyes not as a robot with a stuck magnifying glass, but as a smart detective who knows exactly when to use a magnifying glass and when to just look with their naked eyes.

Here is how they trained this detective using a two-step "schooling" process:

Step 1: The "Textbook" Phase (Cold-Start SFT)

Before letting the AI learn by trial and error, the researchers gave it a massive textbook called UHR-CoZ.

What's in the book? It contains thousands of examples of how to solve problems. Some examples show the detective solving a problem without zooming at all. Others show them zooming in once. Some show them zooming in multiple times, step-by-step, like peeling an onion to get to the core.
The Goal: This teaches the AI the concept of "on-demand" focusing. It learns that sometimes you need a microscope, and sometimes a wide-angle lens is enough.

Step 2: The "Video Game" Phase (AdaZoom-GRPO)

Once the AI knows the basics, they put it in a video game-like training environment using a special reward system.

The Rules of the Game:
1. Don't Zoom if you don't have to: If the AI zooms in unnecessarily, it loses points (this stops the "stuck magnifying glass" habit).
2. Zoom if you need to: If the AI is stuck and needs to see a tiny detail to answer correctly, it gets a bonus for zooming in.
3. The "Ladder" Reward: The AI gets extra points for zooming in a logical, step-by-step way (like climbing a ladder), rather than jumping randomly around the image.
4. The "Honesty" Check: If the AI guesses an answer about a tiny object without actually zooming in to look, it gets penalized. It must prove it "saw" the evidence.

The Result: A Master Detective
After this training, GeoEyes became a master at Ultra-High-Resolution (UHR) remote sensing.

It stopped wasting time zooming in on empty fields.
It started zooming in deeply when it needed to count tiny vehicles or spot a specific type of building.
The Score: On a tough test called XLRS-Bench, GeoEyes scored 54.23%. This is impressive because it beat much larger, more powerful AI models (some with 235 billion parameters) while using a much smaller, efficient model (7 billion parameters).

In a Nutshell
The paper solves the problem of AI being "too eager" to zoom in. By teaching the AI to be selective (knowing when not to zoom) and persistent (knowing when to zoom multiple times), they created a system that can actually understand the tiny details hidden in massive satellite images, just like a human expert would.

1. Problem Statement

The paper addresses a critical bottleneck in applying Multimodal Large Language Models (MLLMs) to Ultra-High-Resolution (UHR) Remote Sensing (RS) tasks. While the "thinking-with-images" paradigm (using zoom tools for active visual exploration) has shown promise, existing models suffer from a failure mode termed "Tool Usage Homogenization."

The Phenomenon: Models trained on UHR datasets tend to collapse into a uniform, task-agnostic pattern where they invoke the zoom tool for every query, regardless of necessity.
Root Causes:
1. Task Heterogeneity: UHR tasks vary wildly in difficulty. Some require only a global view (no zoom needed), while others require deep, multi-step focusing. A uniform strategy leads to over-triggering (wasting compute) on simple tasks and under-exploration on complex ones.
2. Low Effective Evidence Density: In massive UHR images (e.g., 8k×8k), relevant cues are tiny and sparse. Standard Reinforcement Learning (RL) relying solely on final answer correctness fails to guide the model through the necessary multi-step search, causing it to get stuck in local optima (e.g., a single, ineffective zoom).

2. Methodology: GeoEyes Framework

The authors propose GeoEyes, a staged training framework designed to learn on-demand zooming with proper stopping behavior. The approach consists of two main stages:

A. Cold-Start Supervised Fine-Tuning (SFT)

To initialize the model with diverse tool-use behaviors, the authors constructed a new dataset: UHR Chain-of-Zoom (UHR-CoZ).

Source: Derived from HighRS-VQA but augmented with agent-orchestrated reasoning trajectories.
Content: An interleaved image-text Chain-of-Thought (CoT) dataset containing 25,467 samples.
Diversity: It explicitly covers three reasoning regimes:
1. No-tool: Global tasks solvable without zooming.
2. Single-call: Medium-scale targets requiring one zoom.
3. Multi-step progressive: Tiny objects requiring iterative, deep focusing.
Goal: To teach the model when to abstain from using tools and how to execute multi-round focusing, preventing the initial collapse into uniform behavior.

B. Agentic Reinforcement Learning: AdaZoom-GRPO

Following SFT, the model is optimized using a novel RL method called AdaZoom-GRPO (Group Relative Policy Optimization). The core innovation lies in a reconstructed reward function ( $R$ ) designed to address UHR-specific challenges:

Adaptive Efficiency Reward ( $R_{tool}$ ):
- Goal: Address Task Heterogeneity.
- Mechanism: Penalizes unnecessary tool usage for easy tasks (where the base model can solve it) while allowing/exploring necessary steps for hard tasks. It uses a dynamic step allowance based on task category and instance difficulty ( $P_\alpha$ ).
Chain-of-Focus Reward ( $R_{cof}$ ):
- Goal: Address Low Effective Evidence Density.
- Mechanism: Enforces a "Coarse-to-Fine" geometric trajectory. It rewards zooming actions that strictly narrow the view ( $b_{t+1} \subset b_t$ ) and penalizes disjoint drift. Crucially, it provides a "safe harbor" (neutral reward) for backtracking (expanding the view to correct errors), encouraging exploration without fear of penalty.
Process Verification Reward ( $R_{proc}$ ):
- Goal: Ensure Evidence Grounding.
- Mechanism: A "Necessity-Aware" judge that penalizes the model if it generates a confident answer to a detail-oriented query without performing the corresponding zoom-in action. This prevents hallucinations based on insufficient visual evidence.

3. Key Contributions

Diagnosis of Tool Usage Homogenization: The paper identifies and analyzes the phenomenon where UHR MLLMs collapse into saturated, single-call tool usage patterns, attributing it to task heterogeneity and sparse evidence.
UHR-CoZ Dataset: The creation of the largest cold-start dataset for HR RS that systematically annotates interleaved multi-turn tool-use reasoning trajectories, covering diverse zooming regimes.
GeoEyes Model & AdaZoom-GRPO: A novel training strategy combining SFT initialization with a multi-component RL reward system that explicitly optimizes for evidence gain, adaptive efficiency, and geometric progression.

4. Experimental Results

The model was evaluated on XLRS-Bench, a standard UHR remote sensing benchmark.

Overall Performance: GeoEyes achieved a new State-of-the-Art (SOTA) average accuracy of 54.23%.
Comparisons:
- Outperformed domain-specialized baselines: GeoLLaVA-8K (51.5%) and DeepEyes (50.0%).
- Outperformed massive general-purpose models: Surpassed Qwen3-VL-235B (51.1%) and Qwen2.5-VL-72B (50.2%), despite GeoEyes using a much smaller 7B backbone.
Fine-Grained Gains: The most significant improvements were in fine-grained perception tasks:
- Object Color (OCL): 66.1% (vs. 39.0% for Qwen3-VL-235B).
- Overall Counting (OCC): 59.5% (vs. 44.0% for Qwen3-VL-235B).
Ablation Studies:
- Removing the SFT cold start dropped accuracy to 47.73%, proving the necessity of diverse initialization.
- Replacing the proposed geometric reward with standard IoU dropped performance significantly, confirming the need for scale-aware rewards.
- Removing the "Necessity-Aware" process reward led to hallucinations and lower accuracy.

5. Significance

This work demonstrates that active, policy-controlled zooming is a more effective solution for UHR remote sensing than simply scaling up model parameters ("brute-force scaling"). By explicitly training models to differentiate between when to abstain, when to iterate, and when to stop, GeoEyes achieves robust evidence-grounded understanding. The proposed framework offers a principled path for developing efficient, high-precision MLLMs for complex visual domains where information is sparse and resolution is extreme.

GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

Step 1: The "Textbook" Phase (Cold-Start SFT)

Step 2: The "Video Game" Phase (AdaZoom-GRPO)

1. Problem Statement

2. Methodology: GeoEyes Framework

A. Cold-Start Supervised Fine-Tuning (SFT)

B. Agentic Reinforcement Learning: AdaZoom-GRPO

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya

Operational Noncommutativity in Sequential Metacognitive Judgments

Proximity Measure of Information Object Features for Solving the Problem of Their Identification in Information Systems

ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback

Algebraic Structure Discovery for Real World Combinatorial Optimisation Problems: A General Framework from Abstract Algebra to Quotient Space Learning