RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

Imagine you are trying to teach a very smart, but slightly myopic, robot how to navigate a giant, colorful subway map of a bustling city.

The Problem: The "All-or-Nothing" Trap
Currently, these AI models (called Multimodal Large Language Models) are great at chatting but terrible at looking closely at complex maps. They often miss small details, like which train line goes where, or they get confused by the colors.

If you try to teach them using standard methods, it's like playing a video game where you only get a "Game Over" or "You Win" message at the very end of a 20-hour journey. If the robot makes a tiny mistake in the first minute (like misreading a station name), it doesn't know what it did wrong until it fails the whole trip. This is called the Sparse Reward problem. The robot gets no feedback along the way, so it struggles to learn.

The Solution: REWARDMAP
The authors of this paper built a new training system called REWARDMAP to fix this. Think of it as a super-strict, but incredibly helpful, coach who doesn't just wait for the finish line to give feedback.

Here is how they did it, broken down into three simple steps:

1. Building a "Training Gym" (REASONMAP-PLUS)

Before they could teach the robot to run a marathon (solve complex route planning), they needed to build a gym with smaller, easier exercises.

The Old Way: They only had the marathon route.
The New Way: They created REASONMAP-PLUS, a massive dataset of practice questions.
- Easy Level: "How many red lines are on this map?" (Counting).
- Medium Level: "Is Station A on the same line as Station B?" (True/False).
- Hard Level: "Plan a route from A to B." (Complex reasoning).
The Analogy: Imagine a music teacher. Instead of just asking a student to play a full concerto immediately, they start with scales, then simple songs, then complex pieces. This dataset gives the AI a "curriculum" to learn step-by-step.

2. The "Detail-Oriented" Coach (Difficulty-Aware Rewards)

In the old system, if the robot got the final answer wrong, it got zero points. In REWARDMAP, the coach is much more granular.

The New Reward System: Even if the robot gets the final route wrong, the coach gives partial credit for the parts it got right.
- Did it identify the starting station correctly? +1 point.
- Did it pick the right train line name? +1 point.
- Did it count the stops correctly? +1 point.
The Analogy: Think of a spelling bee. If a student spells "Elephant" as "Elephent," the old system says "Wrong, 0 points." The REWARDMAP system says, "You got 'Eleph' right, and 'ant' right, but missed the 'n'. Good effort, let's fix that one letter." This constant, small feedback keeps the robot motivated and learning.

3. The "Climbing the Ladder" Strategy (Multi-Stage Learning)

They didn't just throw the robot into the deep end. They used a Multi-Stage approach.

Stage 1: The robot practices on the easy "Counting" and "True/False" questions. It learns to see the map clearly.
Stage 2: Once it's good at seeing, it moves to the harder "Route Planning" questions.
The Analogy: You wouldn't teach a child to drive a Formula 1 car on day one. You start with a tricycle, then a bicycle, then a car in an empty parking lot, and finally, the race track. REWARDMAP ensures the AI masters the "tricycle" (visual perception) before tackling the "race track" (complex reasoning).

The Result

When they tested this new method, the results were impressive:

Better Navigation: The AI became much better at reading subway maps, spotting small details, and planning routes without getting confused.
General Smarts: Because the AI learned to "look closer" and "think step-by-step," it didn't just get better at maps. It also got better at other tasks, like reading charts, understanding diagrams, and solving visual puzzles it had never seen before.

In a Nutshell:
The paper is about teaching AI to stop guessing and start seeing. By breaking big, scary problems into small, manageable steps and giving the AI constant, detailed feedback (like a great coach), they turned a confused robot into a sharp-eyed navigator.

1. Problem Statement

Multimodal Large Language Models (MLLMs) struggle with fine-grained visual reasoning, particularly in structured, information-rich domains like transit maps. While recent benchmarks like REASONMAP highlight this gap (e.g., route planning requires combining visual perception with spatial logic), standard training methods face two critical bottlenecks:

Sparse Rewards: In complex reasoning tasks (like route planning), supervision is typically only available at the final answer. Intermediate steps receive no feedback, leading to unstable optimization and ineffective exploration during Reinforcement Learning (RL).
Limitations of Supervised Fine-Tuning (SFT): While SFT provides dense supervision, it often leads to overfitting and "cognitive rigidity," failing to equip models with the dynamic, long-chain decision-making capabilities required for complex visual reasoning.

2. Methodology

The authors propose REWARDMAP, a multi-stage RL framework designed to overcome sparse rewards and enhance both visual understanding and reasoning. The methodology consists of three core components:

A. REASONMAP-PLUS: A Dense-Supervision Dataset

To facilitate effective "cold-start" training, the authors constructed REASONMAP-PLUS, an extended dataset containing 4,018 questions across 30 cities.

Structure: Questions are organized along a difficulty continuum from simple perception to complex reasoning.
Categories:
1. Global Counting: Total number of lines in a map.
2. Local Counting: Intermediate stops between two stations or lines passing through a station.
3. True/False: Spatial relationships between stations or lines.
4. Planning: (Inherited from REASONMAP) Route planning between two points.
Purpose: The simpler VQA tasks (Counting, True/False) provide dense reward signals, allowing the model to learn basic visual grounding before tackling complex planning tasks.

B. Difficulty-Aware Reward Design

The core innovation in the reward function is the introduction of Detail Rewards and Difficulty-Aware Weighting to mitigate sparsity:

Reward Formula: $R = W_{difficulty} \times (R_{format} + R_{correctness} + \alpha \times R_{detail})$ $R = W_{d i f f i c u l t y} \times (R_{f or ma t} + R_{cor r ec t n ess} + α \times R_{d e t ai l})$
- Format Reward: Ensures output adheres to specific syntax (e.g., \boxed{}).
- Correctness Reward: Exact match for simple tasks; official benchmark scoring for planning.
- Detail Reward ( $R_{detail}$ ): Crucially, this grants partial credit for correct intermediate components in complex answers (e.g., correct origin/destination stops, route names, transfer stations, and segment counts) even if the final path is slightly off. This directly addresses the sparse reward problem.
- Difficulty Weighting ( $W_{difficulty}$ ): Scales rewards based on map complexity (Easy/Medium/Hard) and question complexity (e.g., number of transfers), ensuring harder tasks receive appropriate emphasis.

C. Multi-Stage RL Curriculum

Instead of standard SFT followed by RL, REWARDMAP employs a Multi-Stage GRPO (Group Relative Policy Optimization) strategy:

Global Curriculum: Training data is scheduled from easy to hard. The model first learns basic perception (Counting/True-False) using dense rewards from REASONMAP-PLUS, then progresses to complex reasoning (Planning) using REASONMAP.
Local Stochasticity: Within each stage, samples are shuffled to prevent overfitting to a fixed curriculum trajectory.
Cold-Start Strategy: The framework bypasses SFT initialization, using the dense rewards from the easy stages of REASONMAP-PLUS to bootstrap the RL policy directly.

3. Key Contributions

REASONMAP-PLUS Dataset: An extended dataset organized by difficulty, providing dense supervision signals essential for cold-starting RL in visual reasoning tasks.
REWARDMAP Framework: A novel multi-stage RL framework integrating:
- Detail-Oriented Rewards: Partial credit mechanisms to alleviate reward sparsity in long-horizon reasoning.
- Difficulty-Aware Weighting: Dynamic scaling of rewards based on task complexity.
- Curriculum Learning: A structured progression from perception to reasoning without relying on SFT.
Comprehensive Evaluation: Extensive experiments demonstrating that the approach improves performance not only on transit maps but also generalizes to broader visual reasoning benchmarks.

4. Experimental Results

The authors evaluated REWARDMAP using Qwen2.5-VL-7B-Instruct and compared it against baselines (Standard RL, SFT→RL, and various reference models like GPT-4o and Seed1.5-VL).

Performance on REASONMAP & REASONMAP-PLUS:
- REWARDMAP achieved a Weighted Accuracy of 74.25% on REASONMAP-PLUS and 31.77% on long-question REASONMAP tasks.
- It significantly outperformed the best open-source baseline (Qwen2.5-VL-72B-Instruct) and approached the performance of the closed-source Seed1.5-VL.
- Ablation studies confirmed that both the Detail Reward and Multi-Stage Design contribute independently and synergistically to performance gains.
Generalization to Other Benchmarks:
- Models trained with REWARDMAP showed consistent improvements across six external benchmarks (SEED-Bench-2-Plus, SpatialEval, V* Bench, HRBench, ChartQA, MMStar).
- Average Improvement: +3.47% across all benchmarks.
- Notable Gains: A massive 13.51% improvement on the SpatialEval benchmark, particularly in maze navigation tasks.
Qualitative Analysis:
- Visual comparisons showed that REWARDMAP significantly reduced visual confusion (e.g., misidentifying lines) and hallucinations (e.g., repeating routes) compared to baseline models.

5. Significance

This work addresses a fundamental limitation in applying Reinforcement Learning to complex visual tasks: reward sparsity. By introducing a curriculum of dense rewards and a detail-oriented reward function, REWARDMAP enables MLLMs to learn robust fine-grained visual reasoning capabilities without relying on massive amounts of high-quality SFT data.

The framework demonstrates that structured visual domains (like transit maps) can serve as a powerful testbed for improving general spatial and logical reasoning in AI. The success of the method across diverse benchmarks suggests that the principles of difficulty-aware reward shaping and multi-stage curriculum RL are transferable to other structured visual reasoning tasks, such as chart analysis and diagram interpretation, potentially advancing the state of the art in multimodal reasoning.

RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

1. Building a "Training Gym" (REASONMAP-PLUS)

2. The "Detail-Oriented" Coach (Difficulty-Aware Rewards)

3. The "Climbing the Ladder" Strategy (Multi-Stage Learning)

The Result

1. Problem Statement

2. Methodology

A. REASONMAP-PLUS: A Dense-Supervision Dataset

B. Difficulty-Aware Reward Design

C. Multi-Stage RL Curriculum

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Compositional Neuro-Symbolic Reasoning

Understanding the Nature of Generative AI as Threshold Logic in High-Dimensional Space

AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems