From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

The Big Problem: The Robot is "Blind" to Progress

Imagine you are teaching a robot to bake a cake.

The Old Way (Passive Observer): You show the robot a video of the baking process. The robot is very good at describing what it sees: "I see flour being poured. I see eggs being cracked." But if you ask, "Are we done yet?" or "How much of the cake is actually baked?", the robot gets confused. It might say, "The flour is gone, so the cake must be 100% done!" even though the oven is still cold. It's like a tourist taking photos; they see the scenery, but they don't understand the story or the goal.
The Bottleneck: Current AI models are great at describing events, but terrible at judging progress. They can't tell the difference between a robot that is successfully baking a cake and one that is just making a mess that looks like baking.

The Solution: PRIMO R1 (The "Critic" Chef)

The authors introduce a new system called PRIMO R1. Instead of just being a tourist (Observer), they turn the AI into a strict Food Critic (Active Critic).

Here is how they did it, broken down into three simple steps:

1. The "Before and After" Photo Album

Most robots only look at the video clip of what is happening right now. It's like trying to guess the ending of a movie by only watching the middle scene.

PRIMO's Trick: They force the AI to look at three things at once:
1. The Start: A photo of the kitchen before anything happened.
2. The Middle: The video of the robot working.
3. The Now: A photo of the kitchen right at this exact second.
The Analogy: Imagine you are grading a student's essay. Instead of just reading the middle paragraph, you look at the Prompt (what they were asked to do), the First Draft, and the Current Draft. This helps you see exactly how far they have come.

2. The "Think Aloud" Training (Chain of Thought)

Previously, we just told the AI, "Guess the percentage: 50%." If it was wrong, we just said "Wrong."

PRIMO's Trick: They made the AI talk to itself before giving the answer. It has to write a plan, observe the video, and reason through the steps.
The Analogy: Think of a math student.
- Old Way: The teacher asks "What is 2+2?" The student guesses "5". The teacher says "No." The student learns nothing.
- PRIMO Way: The teacher says, "Show your work." The student writes: "I know 2+2 means adding two groups of two. That makes four." Then they answer "4."
- By forcing the AI to write out its reasoning (Planning → Observation → Reasoning), it learns why a task is 50% done, not just that it is 50% done.

3. The "Taste Test" (Reinforcement Learning)

This is the secret sauce. They didn't just teach the AI with textbooks (Supervised Learning). They used Reinforcement Learning, which is like training a dog with treats.

How it works: The AI generates a reasoning chain and a guess.
- If the guess is close to the truth, it gets a "treat" (a reward).
- If the guess is way off, it gets no treat.
The Magic: The AI realizes that to get the treat, it must write a good reasoning chain. It learns that thinking deeply leads to better answers. It stops guessing and starts "critiquing" the robot's performance like a human expert would.

Why This Matters (The Results)

The paper shows that this new "Critic" AI is amazing:

It's Smarter than Bigger Models: A small 7-billion-parameter model (PRIMO R1) beat massive 72-billion-parameter models (like giant versions of GPT-4) at judging robot tasks. It's like a sharp, focused chef beating a giant, confused food critic.
It Doesn't Get Fooled: If a robot drops a cake and the pieces look like a "finished" cake on the floor, the old AI might say "100% done!" PRIMO R1 looks at the start and end photos, sees the mess, and says, "Wait, the cake is broken. That's a failure, not success."
It Works in the Real World: It can watch a robot in a simulation and then immediately understand a robot in a real factory, even if it's never seen that specific factory before.

Summary

The paper takes a robot brain that was just a passive camera (describing what it sees) and turns it into an active coach (judging how well the robot is doing).

By forcing the AI to look at the start and end points, think out loud, and learn from rewards, they created a system that can accurately tell a robot: "You are halfway there, but you dropped the spoon. Fix it!" This is a huge step toward robots that can learn complex tasks on their own without needing humans to program every single reward.

1. Problem Statement

The paper addresses a critical bottleneck in long-horizon robotic manipulation: the lack of effective, dense reward signals for policy learning.

The Limitation of Current Models: Existing Video Multimodal Large Language Models (Video MLLMs) function primarily as passive "Observers." They excel at describing what is happening (captioning, QA) but fail at rigorous quantitative reasoning regarding how well a task is progressing relative to a final goal.
The "Observer" Deficit: When trained via Supervised Fine-Tuning (SFT) on progress estimation, these models tend to:
- Recognize visual trajectories that resemble success even if the task fails.
- Lack causal reasoning, leading to brittle generalization on unseen objects or environments.
- Fail to align continuous visual trajectories with discrete logical conditions required for task success.
The Goal: Transform the model from a passive observer into an active "Critic" capable of explicit process reasoning, providing accurate progress estimation and failure detection without relying on privileged ground-truth states.

2. Methodology: PRIMO R1

The authors propose PRIMO R1 (Process Reasoning Induced MOnitoring), a 7B parameter framework that shifts the paradigm from direct regression to outcome-based Reinforcement Learning (RL) with explicit Chain-of-Thought (CoT).

A. Structured Temporal Input

To address the loss of detail in continuous dynamic feature spaces, the architecture employs a structural prompting strategy:

Triad Input: Instead of feeding only a video sequence ( $V_{seq}$ $V_{se q}$ ), the model is anchored by three specific inputs:
1. Initial State Image ( $I_{init}$ ): The environment before execution.
2. Process Video Sequence ( $V_{seq}$ ): The temporal transition.
3. Current State Image ( $I_{curr}$ ): The latest observed outcome.
Significance: This explicitly anchors the reasoning task between defined spatial boundaries, transforming generic temporal perception into structured state-alignment verification.

B. Process Reasoning via Reinforcement Learning (GRPO)

The core innovation is replacing SFT with Group Relative Policy Optimization (GRPO) to elicit CoT generation:

Two-Stage Training:
1. SFT Phase: The model is fine-tuned on a dataset with CoT annotations to learn the output format and basic reasoning structure.
2. RL Phase (GRPO): The model is optimized using outcome-based rewards.
Reward Design:
- Format Reward: Enforces a strict structure: <thinking>Planning...Observation...Reasoning...</thinking><answer>Progress %</answer>. This prevents the model from collapsing into direct guessing.
- Accuracy Reward: A bounded linear decay function based on the difference between the predicted progress and the ground truth.
Mechanism: GRPO samples a group of outputs, normalizes rewards against the group distribution, and updates the policy to maximize the relative advantage. This incentivizes the model to self-organize intermediate reasoning steps to accurately align the visual trajectory with the task goal.

C. PRIMO Dataset and Benchmark

PRIMO Dataset: A comprehensive collection aggregating data from real-world (AgiBot) and high-fidelity simulations (BEHAVIOR-1k, RoboTwin). It includes 116k samples for SFT and 182k samples for RL, all annotated with fine-grained progress indicators and CoT paths.
PRIMO Bench: A benchmark designed to evaluate Out-of-Domain (OOD) generalization, including:
- In-Domain: Same tasks, seen environments.
- Cross-Task: Unseen tasks in seen environments.
- Cross-Environment: Real-world transfer to a different humanoid robot (Leju KUAVO-MY) in unstructured settings.

3. Key Contributions

Paradigm Shift: Introduces PRIMO R1, transforming Video MLLMs from passive observers to active, interpretable critics via outcome-based RL.
Structured Temporal Input: Proposes anchoring video sequences between initial and current state images, which is identified as a necessary prerequisite for accurate long-horizon progress estimation.
CoT Elicitation: Demonstrates that optimizing for continuous progress reasoning via GRPO intrinsically constructs the temporal context representations required for discrete failure detection.
New Benchmark: Releases the PRIMO Dataset and PRIMO Bench to systematically evaluate post-training methods in video-based MLLMs.

4. Experimental Results

The model (based on Qwen2.5-VL-7B) was evaluated across diverse environments (AgiBot, BEHAVIOR, RoboTwin, and Real Humanoid).

Progress Estimation Accuracy:
- SOTA Performance: PRIMO R1 achieves an average Mean Relative Accuracy (MRA) of 82.90 and Mean Absolute Error (MAE) of 15.52.
- Comparison: It outperforms the massive Qwen2.5-VL-72B (a 72B parameter model) by +9.10 absolute MRA points.
- Error Reduction: It reduces the MAE of specialized reasoning baselines by 50%.
- Sim-to-Real: In the "Real Humanoid" OOD setting, PRIMO R1 maintains a strong MRA of 72.32, whereas general MLLMs drop significantly (e.g., Qwen2.5-VL-7B drops to 56.46).
Failure Detection (Zero-Shot Generalization):
- On the RoboFail benchmark, PRIMO R1 achieves 67.0% accuracy.
- This surpasses closed-source models like OpenAI o1 (61.0%), GPT-4o (63.0%), and Gemini 2.0 Flash (67.0%), despite PRIMO R1 being a 7B open-source model.
Ablation Studies:
- Input Modalities: Using only the current state ( $I_{curr}$ ) yields high error (MAE ~59.5). The full triad ( $I_{init} + V_{seq} + I_{curr}$ ) is essential for minimizing error, especially in long-horizon tasks.
- RL vs. SFT: While SFT improves performance, the RL phase is crucial for generalization. RL-only training without SFT fails to discover correct reasoning structures, but the combination (SFT + RL) creates a powerful synergy.

5. Significance and Impact

Reward Signal Generation: PRIMO R1 provides a viable method for deriving dense, reliable reward signals directly from visual observations, a prerequisite for training autonomous policies in complex, long-horizon tasks without manual reward engineering.
Efficiency: The 7B model achieves state-of-the-art results with significantly lower computational cost and inference latency compared to 72B+ models or closed-source giants.
Reasoning Capability: The paper establishes that optimizing for continuous progress tracking inherently enables robust zero-shot failure detection. This suggests a unified approach where "process reasoning" serves as the foundation for both evaluating success and identifying failure in embodied AI.
Generalization: The ability to generalize to unseen robots and unstructured real-world environments demonstrates that explicit temporal anchoring and CoT reasoning bridge the sim-to-real gap effectively.