Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

Imagine you are teaching a brilliant but slightly scattered student (an AI) how to solve a complex puzzle that involves both looking at a picture and doing math.

In the past, when we trained these AI students using "Reinforcement Learning," we acted like a strict teacher who only looked at the final answer.

The Old Way: If the student got the right answer, we gave them a gold star for the whole page. If they got it wrong, we gave them a red X for the whole page.
The Problem: This is unfair. Maybe the student looked at the picture perfectly, understood the shapes, and did the math right, but made one tiny typo at the end. Or maybe they guessed the right answer by accident without actually looking at the picture. The old method couldn't tell the difference between a "good guess" and "real understanding."

The paper you shared, PEPO (Perception-Exploration Policy Optimization), introduces a smarter way to teach. It acts like a super-observant coach who watches every single step the student takes, not just the final result.

Here is how PEPO works, broken down into simple concepts:

1. The Two Superpowers: "Looking" and "Thinking"

The paper argues that solving these puzzles requires two different skills that work together:

Perception (The "Eyes"): This is the ability to actually see the image. Did the student look at the triangle in the picture? Did they notice the red car?
Exploration (The "Brain"): This is the ability to think through different possibilities. "If I do X, then Y happens. But wait, maybe Z is better?" This is the part where the student is unsure and trying to figure things out.

The Analogy: Imagine a detective solving a crime.

Perception is looking at the fingerprint on the glass.
Exploration is wondering, "Could the butler have done it? Or maybe the gardener?"
PEPO realizes you need both. If the detective ignores the fingerprint (Perception), they are guessing blindly. If they only stare at the fingerprint and never think of other suspects (Exploration), they miss the bigger picture.

2. The Magic Trick: The "Smart Weight" System

The core innovation of PEPO is a new way to decide which parts of the student's answer deserve praise and which deserve correction.

Instead of giving one grade for the whole answer, PEPO gives a custom score to every single word the AI writes. It uses a special "Smart Gate" to decide how much weight to give each word based on two things:

Did this word "look" at the picture? (Perception)
- If the AI writes "The triangle is red," PEPO checks: Did the AI actually look at the red triangle in the image? If yes, that word gets a high score. It anchors the reasoning in reality.
Was this word a moment of "thinking hard"? (Exploration)
- If the AI writes "Hmm, maybe the angle is 45 degrees," PEPO sees that the AI is uncertain and exploring options. That word also gets a high score because it shows the AI is actively reasoning, not just copying.

The Metaphor: Think of the AI's answer as a chain of paperclips.

In the old method, if the chain broke, you threw away the whole chain.
With PEPO, you look at every single paperclip.
- The paperclips that are holding the picture (Perception) are gold.
- The paperclips that are connecting the logic steps (Exploration) are silver.
- The paperclips that are just filler words or random guesses are plastic.
PEPO tells the AI: "Keep the gold and silver clips! Throw away the plastic ones and try to make better connections next time."

3. Why This is a Big Deal

The researchers tested this on many difficult tasks:

Geometry: Solving math problems with diagrams.
Visual Puzzles: Figuring out patterns in images.
Finding Objects: Pointing to exactly where a "cat" is in a photo.

The Results:

The AI trained with PEPO became much better at not hallucinating (making things up). It actually looked at the picture before answering.
It became better at reasoning through tough problems without getting stuck.
It did all this without needing extra teachers or more data. It just learned to pay attention to the right things on its own.

Summary

PEPO is like upgrading an AI's learning style from "Grade the Final Exam" to "Coach the Practice Session."

It teaches the AI that looking at the evidence (Perception) and thinking through the options (Exploration) are the two most important things to do. By rewarding these specific moments, the AI learns to be more accurate, more logical, and more reliable, just like a student who learns to trust their eyes and their brain equally.

1. Problem Statement

Large Vision-Language Models (LVLMs) have made significant strides in reasoning tasks using Chain-of-Thought (CoT) and Reinforcement Learning with Verifiable Rewards (RLVR), such as Group Relative Policy Optimization (GRPO). However, existing RLVR methods suffer from coarse-grained optimization:

Sequence-Level Supervision: Current methods apply a single advantage score (based on the final answer's correctness) uniformly to all tokens in a response. This fails to distinguish between tokens that are critical for visual grounding (perception) and those that represent exploratory reasoning steps.
Inadequate Exploration: While some methods use token-level entropy to encourage exploration, they primarily capture textual uncertainty and lack a strong connection to visual semantics.
Computational Overhead: Other perception-aware methods often require auxiliary masking branches or attention mechanisms that are incompatible with efficient acceleration frameworks (e.g., FlashAttention) and introduce significant computational costs.

The core problem is the lack of a mechanism to fine-grainedly couple visual perception and reasoning exploration at the token level without adding supervision or heavy computational overhead.

2. Methodology: Perception-Exploration Policy Optimization (PEPO)

The authors propose PEPO, a token-level policy optimization framework that derives a "perception prior" from hidden states and integrates it with token entropy via a smooth gating mechanism.

A. Token-Level Analysis

The authors first analyze multimodal reasoning trajectories and identify two complementary dynamics:

Perceptual Grounding: Correct reasoning relies heavily on a compact subset of tokens that are strongly aligned with visual inputs. They quantify this using Visual Similarity (VS), defined as the mean cosine similarity between a response token's hidden state and all vision token hidden states across layers.
Exploratory Uncertainty: Reasoning involves uncertain steps where the model explores alternative paths. This is captured by Token Entropy derived from the output logits.

Key Finding: High visual similarity tokens anchor the reasoning to the image, while high-entropy tokens mark decision points. They are complementary, not redundant.

B. The PEPO Framework

PEPO modifies the standard GRPO/DAPO objective by reweighting the advantage function at the token level.

Perception Modeling:
For each response token $t$ in response $i$ , compute the visual similarity score $VS^{(i)}_t$ using hidden states (no auxiliary branches needed).
$VS^{(i)}_t = \frac{1}{L} \sum_{l=1}^{L} \frac{1}{N} \sum_{n=1}^{N} \frac{\langle h_{l,t}, v_{l,n} \rangle}{\|h_{l,t}\| \|v_{l,n}\|}$
Exploration Modeling:
Compute token-level entropy $H^{(i)}_t$ from the policy model's logits.
Perception-Exploration Fusion (Smooth Gating):
The method fuses these signals to generate a token weight $w^{(i)}_t$ :
- Normalize $VS$ and $H$ to $[0, 1]$ .
- Compute a centered joint score $\hat{g}^{(i)}_t$ .
- Apply a smooth gating function using a hyperparameter $\alpha$ :
  $w^{(i)}_t = T \cdot \text{Softmax}\left( (1 + \alpha \tanh(\hat{g}^{(i)}_t)) \cdot VS^{(i)}_t \right)$
- Crucial Design: The gate is multiplied by $VS^{(i)}_t$ . This ensures that entropy-driven modulation only occurs on tokens that are already visually grounded, preventing the amplification of high-entropy but visually irrelevant tokens.
Token-Level Advantage:
The sequence-level advantage $A^{(i)}$ is refined into a token-level advantage $A^{(i)}_t$ :
$A^{(i)}_t = \left[ (1 - \lambda) + \lambda w^{(i)}_t \right] A^{(i)}$
Here, $\lambda$ is a scheduling parameter that linearly increases from 0 to 1 during training, allowing the model to start with sequence-level optimization and gradually shift to fine-grained token-level optimization.

3. Key Contributions

First Token-Level Analysis of Multimodal CoT: The work is the first to explicitly reveal the complementary roles of visual-grounded tokens (anchoring perception) and high-entropy tokens (driving exploration) in LVLM reasoning.
PEPO Framework: A novel, lightweight policy optimization method that:
- Derives a perception prior directly from hidden-state similarity (no extra supervision).
- Uses a smooth gating mechanism to fuse perception and entropy.
- Integrates seamlessly with existing frameworks like GRPO and DAPO.
Efficiency: PEPO requires no auxiliary branches and adds negligible computational overhead (<1% step-level time), making it compatible with efficient training accelerators.

4. Experimental Results

The authors evaluated PEPO on two models (Qwen2.5-VL-3B and InternVL3-2B) across five diverse benchmarks:

Geometry & Math Reasoning (Geometry3K, MathVista, MathVerse, LogicVista):
- PEPOG (PEPO on GRPO) improved average accuracy by +3.67 (Qwen) and +3.51 (InternVL) over standard GRPO.
- PEPOD (PEPO on DAPO) showed gains of +0.45 and +5.15 respectively.
Visual Grounding (RefCOCO, LISA):
- Achieved +0.86 IoU@50 improvement on LISA (out-of-domain), demonstrating better alignment between text and visual regions.
- Avoided the "entropy-only collapse" seen in baseline High-Entropy RL methods.
Few-Shot Classification (FGVC Aircraft, Flower102):
- Significant gains in low-data regimes: +5.32 and +1.46 average accuracy improvements over GRPO.
Visual Puzzles (PuzzleVQA, AlgoPuzzleVQA):
- Consistent improvements in both in-domain and out-of-domain settings, indicating better transfer of visuospatial reasoning.
Scalability:
- On the large-scale ViRL39k dataset, PEPO showed consistent gains over GRPO and PAPO, proving robustness with larger data scales.
Efficiency:
- Computational overhead ratio ( $\rho$ ) remained below 1%.
- Training throughput was comparable to or slightly higher than GRPO, partly due to the generation of shorter, more concise responses.

5. Significance

This paper addresses a critical bottleneck in multimodal RL: the misalignment between the coarse nature of reward signals and the fine-grained, multimodal nature of reasoning.

Theoretical Insight: It establishes that successful multimodal reasoning is not just about "thinking" (entropy) or "seeing" (perception) in isolation, but the dynamic coupling of both.
Practical Impact: PEPO provides a plug-and-play solution that significantly boosts reasoning capabilities in LVLMs without requiring complex architectural changes or massive computational resources. It offers a principled path toward more robust, visually grounded, and efficient multimodal agents.