Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs

Imagine you are teaching a brilliant student how to solve a complex puzzle that involves both looking at a picture and doing math.

In the world of Artificial Intelligence, this student is a Multimodal Large Language Model (MLLM). It can see images and read text. The goal of this paper is to teach this student to get better at solving these puzzles using a special training method called Reinforcement Learning with Verifiable Rewards (RLVR). Think of RLVR as a strict coach who only gives a "thumbs up" if the final answer is correct, but doesn't tell the student how to get there.

The Problem: The "Split Personality" of AI Responses

The authors discovered a fundamental problem with how these AI students currently learn. When the AI looks at a picture and solves a problem, its answer is a mix of two very different types of thoughts:

The "Eyes" (Perception): These are the words where the AI describes what it sees. "I see a red car," "There are two rows of players," "The jersey says 'MLB'."
The "Brain" (Reasoning): These are the words where the AI connects the dots to solve the problem. "Therefore, since they are wearing uniforms, it must be a professional game," "So the answer is 42."

The Mistake: Previous training methods treated the AI's answer like a single block of text. They either tried to train the "Eyes" to see better, or the "Brain" to think harder, but rarely both at the same time in a coordinated way.

The Analogy: Imagine a basketball coach who only practices the team's shooting (Reasoning) but ignores their defense (Perception). The team might score a lot, but they'll lose because they can't stop the other team. Or, imagine a coach who only practices defense but never lets the team shoot. They are great at stopping others but can't score. You need both working together.

The Discovery: They Are Tightly Coupled

The researchers ran an experiment where they tried to train the AI by focusing only on the "Eyes" words or only on the "Brain" words.

Result: The AI failed. If they only trained the "Brain," the AI would come up with perfect logic but describe the picture completely wrong (hallucinating details). If they only trained the "Eyes," the AI would describe the picture perfectly but fail to solve the math problem.

They realized that seeing and thinking are inseparable. You can't have good reasoning without accurate perception, and accurate perception is useless without reasoning to make sense of it.

The Solution: "Token Reweighting" (ToR)

To fix this, the authors proposed a new strategy called Token Reweighting (ToR).

The Metaphor: The Dynamic Scoreboard
Imagine the AI is writing a story, word by word. In traditional training, every word gets the same amount of attention from the coach.

With ToR, the coach puts on special glasses that highlight the most critical words in real-time:

When the AI is struggling to decide between two logical paths (high uncertainty), the coach highlights those Reasoning words and says, "Pay extra attention here! This is where the logic matters!"
When the AI is describing a specific visual detail (like a color or a shape), the coach highlights those Perception words and says, "Make sure this matches the picture perfectly!"

The coach then reweights the training. Instead of treating every word equally, the AI gets a "bonus" for getting those critical "Eyes" and "Brain" words right. It forces the AI to learn that seeing correctly is just as important as thinking correctly.

Why This Matters

The paper shows that by using this "Token Reweighting" strategy, the AI becomes much better at:

Seeing accurately: It stops making up details about the image.
Thinking clearly: It builds better logical chains based on what it actually sees.
Solving harder problems: It achieves top-tier scores on difficult math and visual reasoning benchmarks.

In a Nutshell

Think of the AI as a detective.

Old Method: The detective was told to either "Look at the crime scene" or "Solve the mystery," but not both at once.
New Method (ToR): The detective is trained to realize that looking at the clues (Perception) and connecting them (Reasoning) happen simultaneously. The training system gives extra credit for getting the clues right while solving the mystery.

This simple but powerful tweak allows AI models to become much more reliable, accurate, and "smart" when dealing with the real world, where seeing and thinking are always happening together.

1. Problem Statement

Extending Reinforcement Learning with Verifiable Rewards (RLVR) to Multimodal Large Language Models (MLLMs) faces a fundamental challenge: the inherent interdependence between perception (visual grounding) and reasoning (symbolic inference).

The Interleaving Issue: MLLM responses naturally interleave tokens that ground visual content (perception-related) with tokens that construct reasoning chains (reasoning-related).
The Optimization Gap: Existing RLVR approaches typically optimize these capabilities in isolation (e.g., focusing only on Chain-of-Thought for reasoning or specific augmentations for perception).
The Core Hypothesis: The authors hypothesize that isolated optimization is suboptimal because perception and reasoning are tightly coupled at the token level. Optimizing one without the other leads to a failure mode where the model either produces coherent reasoning based on hallucinated visuals or maintains accurate visual grounding but fails to integrate it into logical conclusions.

2. Methodology: Token Reweighting (ToR)

The authors propose Token Reweighting (ToR), a plug-and-play strategy that explicitly models the interdependence of perception and reasoning by dynamically reweighting critical tokens during RLVR training.

A. Token Identification

The method identifies two distinct types of critical tokens using intrinsic model signals without external priors:

Reasoning-Related Tokens: Identified via high predictive entropy. Following insights that high-entropy tokens correspond to pivotal "forking" points in reasoning chains, the top- $\alpha_r$ fraction of tokens with the highest entropy are selected.
Perception-Related Tokens: Identified via visual sensitivity. This is measured as the absolute difference in log-probability of a token when the model conditions on the image versus an empty placeholder (text-only context). The top- $\alpha_p$ fraction of tokens with the highest visual sensitivity are selected.

B. The Reweighting Mechanism

Instead of masking gradients for non-selected tokens (which leads to isolated optimization), ToR assigns specific weights ( $\gamma_r$ and $\gamma_p$ ) to the identified tokens within the policy gradient calculation.

Objective Function: The standard RLVR objectives (e.g., GRPO or DAPO) are modified to include a reweighting term:
$\text{Weight}_t = \gamma_r \cdot \mathbb{I}[(b,i,t) \in T_r] + \gamma_p \cdot \mathbb{I}[(b,i,t) \in T_p]$
Where $T_r$ and $T_p$ are the sets of reasoning and perception tokens, respectively. Tokens outside these sets receive a weight of zero for the advantage calculation.
Balancing: The method ensures that the model simultaneously reduces reasoning uncertainty (by focusing on high-entropy decision points) and strengthens visual grounding (by focusing on image-sensitive tokens).

3. Key Contributions

Empirical Validation of Coupling: Through controlled "selective optimization" experiments, the authors demonstrate that optimizing only reasoning tokens or only perception tokens consistently underperforms full-token optimization. Reasoning-only models misinterpret visuals; perception-only models fail to reason coherently.
Token-Reweighting (ToR) Framework: A lightweight, self-contained module that integrates seamlessly into existing RLVR algorithms (GRPO, DAPO). It requires no architectural changes or external reward models.
State-of-the-Art Performance: ToR achieves significant gains across multiple benchmarks, proving that joint optimization is superior to isolated approaches.
Analysis of Token Dynamics: The paper provides a detailed analysis of the distribution of entropy and visual sensitivity, showing that perception relies on stable, low-entropy visual cues, while reasoning relies on high-entropy decision points, and that these distributions are complementary.

4. Experimental Results

The experiments were conducted using Qwen2.5-VL-7B and Qwen2.5-VL-3B models, trained on the Geometry3K dataset and evaluated on five benchmarks: MathVerse, MathVision, MathVista, WeMath (reasoning) and HalluBench (perception).

Ablation Studies:
- Isolated Optimization: Optimizing only reasoning tokens resulted in a ~2% drop on WeMath compared to full GRPO. Optimizing only perception tokens caused a ~3% drop and failed on reasoning-heavy tasks.
- Joint Reweighting: ToR-GRPO and ToR-DAPO consistently outperformed their baselines. For instance, ToR-GRPO improved MathVerse from 50.8 to 53.0 and HalluBench from 69.8 to 72.4.
Comparison with SOTA:
- ToR-DAPO achieved 54.3 on MathVerse and 74.2 on MathVista, outperforming other RLVR-based models like R1-VL, Vision-R1, and OpenVLThinker, despite using the same training data scale (2.1K samples).
- The method demonstrated strong generalization across model sizes (3B vs. 7B) and data scales (scaling to 39K samples).
Hyperparameter Sensitivity: A perception weight ( $\gamma_p$ ) of 0.5 (with reasoning weight fixed at 1.0) provided the most stable trade-off. The method is robust to the exact selection ratio ( $\alpha$ ), with 30% being a conservative operating point.

5. Significance

Paradigm Shift: The paper challenges the prevailing trend of optimizing perception and reasoning separately in MLLMs. It establishes that these capabilities are not independent modules but are token-level coupled, requiring joint optimization strategies.
Efficiency: ToR offers a "plug-and-play" solution that improves performance without the computational overhead of complex architectural changes or additional training data.
Generalizability: The strategy is applicable to various RLVR algorithms (GRPO, DAPO) and scales effectively, suggesting a fundamental improvement in how MLLMs learn to integrate visual evidence with logical deduction.
Future Directions: The work opens avenues for fine-grained token identification (e.g., using segmentation models) and dynamic reweighting based on gradient contributions, potentially extending to unified generation and understanding tasks.

In conclusion, Token Reweighting (ToR) successfully bridges the gap between visual perception and logical reasoning in MLLMs by recognizing their intrinsic interdependence, leading to more robust, accurate, and coherent multimodal reasoning.