NeSy-Route: A Neuro-Symbolic Benchmark for Constrained Route Planning in Remote Sensing

Imagine you are the captain of a ship, but instead of an ocean, you are navigating a massive, complex landscape seen only from a satellite in space. Your goal is to get a hiker from Point A to Point B. But here's the catch: the hiker has very specific rules. They can't walk through deep water, they get tired on steep hills, and they absolutely must avoid dense forests where they might trip.

This is the challenge that the paper "NeSy-Route" tackles. It's a new "test" designed to see if Artificial Intelligence (specifically, smart computer brains called Multimodal Large Language Models, or MLLMs) is actually good at planning a safe route in these complex scenes, or if they are just good at looking at pictures.

Here is the breakdown of the paper using simple analogies:

1. The Problem: The AI is a "Know-It-All" but a "Bad Navigator"

Currently, AI models are amazing at describing what they see. If you show them a picture of a forest, they can say, "That's a tree, and that's a road." They are great at perception (seeing) and reasoning (telling you facts).

However, when you ask them, "Okay, given that the hiker hates mud and needs to stay on dry land, draw me the exact path they should take," they often fail.

Why? Most previous tests for AI in remote sensing were like multiple-choice quizzes. The AI just had to pick the right sentence from a list. Real life isn't a multiple-choice quiz; you have to create the solution from scratch.
The Gap: There was no big, fair test to see if AI could actually plan a route while following strict rules.

2. The Solution: NeSy-Route (The "Three-Layer Cake" Test)

The authors built a massive new benchmark called NeSy-Route. Think of it as a three-level obstacle course for AI. To pass the whole test, the AI has to succeed at all three levels:

Level 1: The Translator (Text to Logic)
- The Task: You give the AI a story: "The hiker has boots, so they can walk on sand, but they can't swim."
- The Test: Can the AI translate that story into a strict rulebook? (e.g., "Sand = OK, Water = NO").
- Analogy: It's like asking a translator to turn a casual conversation into a strict legal contract. If they get the rules wrong here, everything else fails.
Level 2: The Detective (Text to Image)
- The Task: Now, show the AI a satellite map.
- The Test: Can the AI look at the map, find the sand and the water, and apply the rules from Level 1? "Ah, that blue patch is water (Rule: No go), and that brown patch is sand (Rule: Go)."
- Analogy: This is like a detective looking at a crime scene photo and pointing out exactly which clues match the suspect's description.
Level 3: The Navigator (The Actual Route)
- The Task: Draw the path.
- The Test: The AI must generate a list of coordinates (a path) that gets the hiker from start to finish without breaking the rules and taking the shortest/safest route.
- Analogy: This is the GPS. It can't just say "Go North." It has to draw the exact line on the map that avoids the mud and the trees.

3. How They Built It: The "Robot Factory"

Creating a test this big is hard because you need to know the perfect answer to grade the AI. If the AI draws a path, how do you know if it's the best path?

The authors built an automated factory:

They took real satellite maps.
They used a "smart robot" (a computer algorithm called A-Star search) to calculate the mathematically perfect path for every single scenario.
They used another AI to write the questions and rules.
Result: They created over 10,000 unique test cases where they know the "Gold Standard" answer. This is 10 times bigger than any previous test!

4. The Results: The AI is Still a Rookie

They tested the world's smartest AI models (like GPT-5, Gemini, and Qwen) on this new course. Here is what they found:

Level 1 (Rules): The AI is actually pretty good at understanding the rules. If you ask it "Can a boat drive on land?", it says "No."
Level 2 (Vision): The AI starts to struggle. It often confuses what is what on the map. It might think a river is a road.
Level 3 (Planning): This is where the AI really crashes.
- Even when the AI understands the rules and sees the map, it often draws a path that goes through a lake, or takes a path that is 10 times longer than necessary.
- The Big Takeaway: Just because an AI can talk about a problem and see the picture doesn't mean it can solve the problem. There is a huge gap between "knowing" and "doing."

5. Why This Matters

This paper is a wake-up call. It shows that for AI to be truly useful in real-world emergencies (like guiding a rescue team through a flood or helping a farmer plan a path through a field), we need to stop testing them on simple quizzes and start testing them on complex planning.

In a nutshell: NeSy-Route is a new, super-hard driving test for AI. It proves that while our current AI drivers are great at reading the map and knowing the traffic laws, they are still terrible at actually steering the car through a storm without crashing. The authors hope this test will help engineers build better, smarter AI that can actually get the job done.

1. Problem Statement

Remote sensing imagery is critical for applications like disaster relief and ecological surveys, where systems must interpret complex scenes and make reliable decisions. While Multimodal Large Language Models (MLLMs) have advanced in visual perception and reasoning, current remote sensing benchmarks (e.g., XLRS-Bench) primarily evaluate these two capabilities. They fail to adequately assess constrained route planning due to:

Data Curation Difficulty: Creating large-scale planning tasks with verifiable optimal solutions is challenging.
Evaluation Flaws: Existing protocols often rely on multiple-choice formats or lack rigorous ground-truth verification, failing to measure the ability to generate executable, optimal trajectories in complex environments.

There is a need for a benchmark that hierarchically evaluates the full pipeline: understanding textual constraints, aligning them with visual data, and generating optimal paths.

2. Methodology: The NeSy-Route Framework

The authors propose NeSy-Route, a neuro-symbolic benchmark built upon the OpenEarthMap dataset. It employs an automated data generation framework and a three-level hierarchical evaluation protocol.

A. Automated Data Generation Framework

The framework generates high-fidelity image-text pairs with provably optimal solutions using a neuro-symbolic approach:

Knowledge Base (KB) Construction: Defines 8 land-cover types (e.g., Bareland, Road, Water) and 4 agent types (Pedestrian, Car, Drone, Boat). It establishes rules for Traversability (Always, Conditionally, Non-traversable) and Routing Objectives (Shortest, Fastest, Safest, Comfortable).
Controllable Symbolic Query Synthesis: Uses LLMs (DeepSeek-V3.2, Gemini-3-Pro) to synthesize natural language queries based on configuration tuples (agent, task, start/end points). A self-inference and cross-verification process ensures the generated text strictly aligns with the symbolic KB rules.
Semantic Visual Grounding: Maps symbolic constraints to OpenEarthMap segmentation masks. It uses morphological erosion to isolate pure visual regions and ensures the image contains all necessary land-cover classes for the query.
Constrained Trajectory Generation: Constructs a region-level connectivity graph and applies A-Star search on a pixel-level cost map derived from the symbolic constraints. The cost function penalizes non-traversable areas (infinite cost) and ranks traversable areas based on preference vectors. The A-Star algorithm guarantees mathematically optimal trajectories ( $\tau^*$ ).

B. Hierarchical Evaluation Protocol

The benchmark evaluates MLLMs across three integrated tasks:

Task 1: Textual Constraint Understanding: The model extracts symbolic traversability and preference vectors from natural language instructions.
- Metrics: Traversability Matching (TM), Preference Ranking Correlation (PR), Fully Matching Accuracy (FM).
Task 2: Text–Image Constraint Alignment: The model anchors textual constraints to specific visual regions in the image, determining traversability and priority for identified land covers.
- Metrics: Region Matching Rate (RM), TM, and PR.
Task 3: Constrained Route Planning: The model generates a sparse waypoint trajectory from a start point to an end point, adhering to topological barriers and land-type constraints.
- Metrics: Adherence Rate (AR - % of paths staying in valid regions), Violation Ratio (VR), Cost Ratio (CR - optimality relative to GT), and Chamfer Distance (CD - geometric proximity to optimal path).

3. Key Contributions

First Neuro-Symbolic Benchmark for Remote Sensing Planning: NeSy-Route is the first benchmark to hierarchically assess constrained route planning in remote sensing, covering 10,821 samples (nearly 10x larger than previous benchmarks like XLRS-Bench).
Automated Symbolized Data Generation: Introduces a closed-loop framework integrating neuro-generation (LLMs) with symbolic verification (KB rules and A-Star search) to produce diverse tasks with mathematically proven optimal ground truths.
Granular Failure Analysis: The three-level evaluation decouples perception, reasoning, and planning, allowing researchers to pinpoint exactly where models fail (e.g., understanding constraints vs. applying them visually vs. executing the path).

4. Experimental Results

The authors evaluated state-of-the-art closed-source (GPT-5.1, Gemini-3-Pro, Qwen3-VL-Plus) and open-source models (Qwen3-VL series, InternVL, LLaVA) on the benchmark.

Task 1 (Textual Understanding): Closed-source models (especially Gemini-3-Pro) and strong open-source models (Qwen3.5-27B) demonstrated high capability in parsing complex constraints (TM > 80%).
Task 2 (Text-Image Alignment): Performance dropped significantly for all models when visual features were introduced. While models could understand the text, they struggled to correctly map constraints to specific image regions. This highlights a gap between textual reasoning and visual perception.
Task 3 (Route Planning):
- Adherence: Even the best models had low adherence rates (e.g., Gemini-3-Pro ~18.7%, Qwen3-VL-32B ~32.7%), indicating frequent violations of non-traversable zones.
- Optimality: Models that did produce valid paths often generated highly redundant, inefficient routes (high Cost Ratio).
- Correlation: Strong performance in Tasks 1 and 2 did not guarantee success in Task 3. Planning requires a distinct cognitive capability beyond perception and reasoning.
Difficulty Stratification: Performance degraded significantly as task difficulty (complexity of the landscape) increased from Easy to Hard.

5. Significance and Conclusion

Revealing Limitations: The study exposes that current MLLMs, despite their reasoning prowess, lack the specific capabilities required for constrained spatial planning in remote sensing. They struggle to integrate land-type textures and geological characteristics into decision-making.
Architectural Implications: The results suggest that current remote sensing MLLMs rely on outdated architectures that are insufficient for complex planning. Future models need architectures capable of bridging the gap between perceptual attribute recognition and constrained planning.
Future Research: NeSy-Route provides a rigorous, objective gold standard for developing next-generation MLLMs that can truly support autonomous decision-making in disaster response, resource allocation, and ecological monitoring.

NeSy-Route: A Neuro-Symbolic Benchmark for Constrained Route Planning in Remote Sensing

1. The Problem: The AI is a "Know-It-All" but a "Bad Navigator"

2. The Solution: NeSy-Route (The "Three-Layer Cake" Test)

3. How They Built It: The "Robot Factory"

4. The Results: The AI is Still a Rookie

5. Why This Matters

1. Problem Statement

2. Methodology: The NeSy-Route Framework

A. Automated Data Generation Framework

B. Hierarchical Evaluation Protocol

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Exploration and Exploitation Errors Are Measurable for Language Model Agents

SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications

Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach

WebXSkill: Skill Learning for Autonomous Web Agents