RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

This paper introduces RewardMap, a multi-stage reinforcement learning framework that leverages a dense-reward dataset (ReasonMap-Plus) and difficulty-aware reward signals to overcome sparse reward challenges, thereby significantly enhancing the fine-grained visual reasoning and spatial understanding capabilities of multimodal large language models.

Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, Huan Wang

Published 2026-02-24
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a very smart, but slightly myopic, robot how to navigate a giant, colorful subway map of a bustling city.

The Problem: The "All-or-Nothing" Trap
Currently, these AI models (called Multimodal Large Language Models) are great at chatting but terrible at looking closely at complex maps. They often miss small details, like which train line goes where, or they get confused by the colors.

If you try to teach them using standard methods, it's like playing a video game where you only get a "Game Over" or "You Win" message at the very end of a 20-hour journey. If the robot makes a tiny mistake in the first minute (like misreading a station name), it doesn't know what it did wrong until it fails the whole trip. This is called the Sparse Reward problem. The robot gets no feedback along the way, so it struggles to learn.

The Solution: REWARDMAP
The authors of this paper built a new training system called REWARDMAP to fix this. Think of it as a super-strict, but incredibly helpful, coach who doesn't just wait for the finish line to give feedback.

Here is how they did it, broken down into three simple steps:

1. Building a "Training Gym" (REASONMAP-PLUS)

Before they could teach the robot to run a marathon (solve complex route planning), they needed to build a gym with smaller, easier exercises.

  • The Old Way: They only had the marathon route.
  • The New Way: They created REASONMAP-PLUS, a massive dataset of practice questions.
    • Easy Level: "How many red lines are on this map?" (Counting).
    • Medium Level: "Is Station A on the same line as Station B?" (True/False).
    • Hard Level: "Plan a route from A to B." (Complex reasoning).
  • The Analogy: Imagine a music teacher. Instead of just asking a student to play a full concerto immediately, they start with scales, then simple songs, then complex pieces. This dataset gives the AI a "curriculum" to learn step-by-step.

2. The "Detail-Oriented" Coach (Difficulty-Aware Rewards)

In the old system, if the robot got the final answer wrong, it got zero points. In REWARDMAP, the coach is much more granular.

  • The New Reward System: Even if the robot gets the final route wrong, the coach gives partial credit for the parts it got right.
    • Did it identify the starting station correctly? +1 point.
    • Did it pick the right train line name? +1 point.
    • Did it count the stops correctly? +1 point.
  • The Analogy: Think of a spelling bee. If a student spells "Elephant" as "Elephent," the old system says "Wrong, 0 points." The REWARDMAP system says, "You got 'Eleph' right, and 'ant' right, but missed the 'n'. Good effort, let's fix that one letter." This constant, small feedback keeps the robot motivated and learning.

3. The "Climbing the Ladder" Strategy (Multi-Stage Learning)

They didn't just throw the robot into the deep end. They used a Multi-Stage approach.

  • Stage 1: The robot practices on the easy "Counting" and "True/False" questions. It learns to see the map clearly.
  • Stage 2: Once it's good at seeing, it moves to the harder "Route Planning" questions.
  • The Analogy: You wouldn't teach a child to drive a Formula 1 car on day one. You start with a tricycle, then a bicycle, then a car in an empty parking lot, and finally, the race track. REWARDMAP ensures the AI masters the "tricycle" (visual perception) before tackling the "race track" (complex reasoning).

The Result

When they tested this new method, the results were impressive:

  • Better Navigation: The AI became much better at reading subway maps, spotting small details, and planning routes without getting confused.
  • General Smarts: Because the AI learned to "look closer" and "think step-by-step," it didn't just get better at maps. It also got better at other tasks, like reading charts, understanding diagrams, and solving visual puzzles it had never seen before.

In a Nutshell:
The paper is about teaching AI to stop guessing and start seeing. By breaking big, scary problems into small, manageable steps and giving the AI constant, detailed feedback (like a great coach), they turned a confused robot into a sharp-eyed navigator.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →