See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

Imagine you are teaching a very smart, but slightly clumsy, robot to make a sandwich.

Most current robot brains (AI models) work like a person who is given a single, giant instruction: "Make a sandwich." They look at the bread, the knife, and the ham, and they try to jump straight to the final result. If they drop the ham, or if the bread slides off the counter, they get confused. They don't realize they made a mistake until the whole thing is ruined, and then they just keep trying to do the same wrong thing over and over again.

The paper you shared introduces a new way of thinking called SPR (See, Plan, Rewind). It's like giving the robot a checklist, a GPS, and a "Undo" button.

Here is how it works, broken down into three simple steps:

1. See: The "Checklist" Moment

Instead of just looking at the whole messy kitchen, the robot breaks the big job into tiny, manageable steps.

The Analogy: Imagine you are packing for a trip. You don't just think "Pack everything." You make a list: 1. Pack socks, 2. Pack shirts, 3. Pack shoes.
What SPR does: When the robot sees the task "Put the soup in the basket," it doesn't just rush. It says, "Okay, first I need to pick up the soup. Then I need to move to the basket. Then I need to drop it." It creates a mental map of these tiny checkpoints (called spatial subgoals).

2. Plan: The "GPS" Navigation

Once it has the checklist, it plans the exact path to the next checkpoint.

The Analogy: Think of a GPS in your car. It doesn't just tell you "Drive to New York." It says, "Turn left in 50 feet, then merge onto the highway."
What SPR does: The robot plans a specific path for its arm to reach the next item on the list (like the soup can). It draws a line in the air from where its hand is now to where the soup is. This makes the movement much more precise.

3. Rewind: The "Undo" Button

This is the most magical part. If the robot drops the soup, or if it gets stuck because the basket is in a weird spot, it doesn't panic. It realizes, "Hey, I'm not making progress!"

The Analogy: Imagine you are playing a video game and you fall into a pit. Instead of restarting the whole level from the beginning, you hit the "Reload Last Save" button. You go back to the last safe spot and try again.
What SPR does: The robot has a "state recorder." If it sees that it's stuck in a loop (trying to grab the soup but failing 5 times in a row), it triggers the Rewind. It physically moves its arm back to the starting position (or a safe spot) and tries the plan again, but this time with a fresh perspective.

Why is this a big deal?

1. It's like a human, not a machine.
Humans naturally break big tasks into small steps and know when to back up if we mess up. This robot finally does the same thing. It understands progress. It knows, "I finished step 1, now I'm on step 2."

2. It's super tough (Robust).
The researchers tested this robot in a "chaos mode" where they changed the lighting, moved the objects around, or even changed the robot's starting position.

Old robots: When things changed, they got confused and failed 30% to 50% of the time.
SPR Robot: It only failed about 18% of the time. It was much better at handling surprises because it could "rewind" and try a different approach.

3. It doesn't need a million mistakes to learn.
Usually, to teach a robot how to fix its mistakes, you have to break it a thousand times and record the data. That's expensive and slow.

SPR's trick: It learns how to "rewind" just by watching successful videos of robots doing tasks. It figures out, "If I go backward, I can try again." It's like learning to ride a bike by watching someone else fall and get back up, rather than falling yourself a thousand times.

The Bottom Line

This paper gives robots a sense of direction and a safety net. Instead of blindly charging forward until they crash, they check their progress, plan their next small move, and if things go wrong, they hit "Rewind" to try again. This makes them much more reliable for real-world jobs, like cleaning a messy room or helping in a kitchen, where things rarely go exactly according to plan.

Here is a detailed technical summary of the paper "See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation."

1. Problem Statement

Robotic manipulation requires agents to operate in dynamic 3D environments where errors (e.g., failed grasps, collisions, misalignment) are inevitable. Existing Vision-Language-Action (VLA) models often lack progress awareness:

Abstract Planning: Many models generate high-level plans or binary success/failure signals that lack spatial grounding, making it difficult to verify intermediate states.
Fragile Recovery: When failures occur, current approaches often rely on extensive failure data collection (costly) or external Large Language Models (LLMs) for reasoning (slow and less adaptable).
Lack of Closed-Loop Correction: Without a mechanism to detect stagnation or regression in task progress, robots often persist in failure modes, leading to task collapse, especially in long-horizon tasks or Out-of-Distribution (OOD) scenarios.

The core challenge is to create a framework that can quantitatively measure task progress against concrete milestones and autonomously recover from failures without requiring additional training data or auxiliary models.

2. Methodology: See, Plan, Rewind (SPR)

The authors propose SPR, a progress-aware VLA framework that operates through a continuous closed-loop cycle: See, Plan, and Rewind.

A. Core Cycle

See (Progress Monitoring):
- The model analyzes the current observation and task instruction.
- It outputs the remaining subtask count ( $n$ ) and a sequence of spatial subgoals ( $s_1, ..., s_n$ ).
- Each subgoal consists of a semantic description and a 2D spatial coordinate (waypoint) representing the completion state of that subtask.
- This transforms abstract task goals into a verifiable sequence of spatial milestones.
Plan (Trajectory Generation):
- Instead of planning directly to the final goal, the model plans a 2D trajectory (up to 5 waypoints) from the current gripper position to the next subtask waypoint.
- This fine-grained planning reduces error accumulation and provides robust guidance for long-horizon tasks where the final goal might be spatially irrelevant until intermediate steps are completed.
Rewind (Error Recovery):
- A State Recorder continuously monitors the predicted subtask count and planned 2D trajectories over recent timesteps.
- Anomaly Detection:
  - Count Anomaly: If the subtask count increases (indicating regression) or fails to decrease monotonically.
  - Stagnation: If the planned 2D trajectory remains identical over 8 timesteps (indicating the robot is stuck).
- Recovery Action: Upon detecting an anomaly, the system triggers a Rewind mechanism. It replaces the current task instruction with a "return to initial position" command for a fixed duration ( $N=3$ steps), allowing the robot to backtrack to a recoverable state before resuming the task.

B. Data Curation Pipeline

A key innovation is the automatic generation of supervision signals from existing demonstration data, eliminating the need for manual annotation or auxiliary models:

Subtask Segmentation:
- For Pick-and-Place: Boundaries are detected via gripper state transitions (open/close).
- For Other Manipulations (e.g., pushing): A video-language model (Gemini-3) annotates start/end frames and semantic descriptions.
Spatial Extraction:
- DINOv3 and SAM are used to extract precise 2D gripper coordinates without task-specific training.
- Trajectories are smoothed and subtask waypoints are extracted at boundary frames.
Rewind Data Construction:
- Reverse trajectories are synthesized by temporally inverting successful forward demonstrations and negating action tokens (delta movements), creating a "return to start" policy.

3. Key Contributions

Progress Awareness with Spatial Subtasks: Replaces abstract linguistic plans with a sequence of verifiable 2D spatial waypoints. This enables fine-grained, robot-executable progress tracking without auxiliary models.
Progress-Driven Error Recovery: Formulates recovery as an executable policy. By monitoring progress metrics (subtask count and trajectory stability), the system autonomously triggers a rewind to restore the robot to an in-distribution state.
Data-Efficient Training: The framework learns recovery behaviors purely from successful demonstrations (via synthesized reverse trajectories), avoiding the high cost of collecting explicit failure data.
State-of-the-Art Robustness: Demonstrates superior generalization in OOD scenarios, particularly in handling unseen initial states and environmental perturbations.

4. Experimental Results

Benchmarks & Metrics

LIBERO Benchmark: Standard simulation suite for language-instructed manipulation.
LIBERO-Plus: A challenging OOD benchmark with >6,800 test variants (backgrounds, robot states, language, layouts, lighting).
Real-Robot Tasks: Pick-and-place, multi-object tidying, and continuous-contact pushing (Push-T).

Performance Highlights

LIBERO Performance: SPR outperforms the MolmoAct baseline by 5% (91.8% vs. 86.8% average success rate). The improvement is most significant in the "Long" (complex multi-step) category (+8.2%).
LIBERO-Plus (OOD Robustness):
- SPR achieves the smallest performance drop (average degradation of 18.8%) compared to OpenVLA-OFT (27.0%) and UniVLA (37.5%).
- It maintains high success rates across language shifts (-12.1%) and lighting changes (-5.6%), proving superior semantic and spatial grounding.
Real-Robot Results:
- Tidy up the Table: SPR achieves 30% success on a complex 3-object task where the baseline (MolmoAct) fails completely (0%).
- Push-T: SPR achieves 40% success on continuous-contact manipulation, while the baseline fails (0%).
- Scaling: As task complexity (object count) increases, SPR degrades gracefully, whereas baseline performance collapses.

Ablation Studies

See-Plan vs. Rewind: Removing the Rewind mechanism reduces performance by ~1%, but removing both See-Plan capabilities causes a significant drop, confirming that spatial subgoals are the primary driver of robustness.
Rewind Steps ( $N$ ): $N=3$ is empirically optimal. $N<3$ fails to clear obstacles; $N>3$ causes the arm to drift out of the camera's view.
Extended Horizons: SPR continues to improve success rates as episode length increases, whereas baselines plateau, demonstrating effective error recovery over long horizons.

5. Significance and Impact

Paradigm Shift: Moves robotic control from "end-to-end prediction" to "milestone-driven execution," mimicking human cognitive strategies of breaking tasks into verifiable steps.
Robustness without Data Overhead: Proves that robust failure recovery can be achieved by synthesizing data from successful demonstrations, bypassing the expensive and difficult process of collecting failure data.
Real-World Applicability: The framework's success on real robots in unstructured environments (tidying, pushing) suggests it is a viable path toward deploying reliable agents in dynamic, real-world settings where failures are common.
Generalization: The ability to handle unseen instructions, initial states, and object layouts establishes a new state-of-the-art for OOD robustness in VLA models.

In summary, SPR addresses the critical gap in robotic reliability by embedding progress awareness directly into the action generation loop, enabling robots to self-correct and recover from failures autonomously and efficiently.