Critic in the Loop: A Tri-System VLA Framework for Robust Long-Horizon Manipulation

Imagine you are trying to clean up a very messy room. You have two very different parts of your brain working together:

The Planner (The "Brain"): This is the part that thinks, "Okay, first I need to pick up the trash bag, then open it, then put the cups inside." It's smart and understands complex instructions, but it's slow. It takes time to think.
The Doer (The "Cerebellum"): This is the part that actually moves your hands. It's incredibly fast and reactive. If you reach for a cup and it slips, your hands instantly adjust to catch it. But on its own, it doesn't really understand why it's picking up the cup or what to do next if things go wrong.

The Problem with Current Robots
Most robots today are stuck in a bad relationship between these two parts.

Some robots rely too much on the Planner. They think about every single movement before making it. This makes them slow and clumsy; by the time they decide to move, the object has already fallen.
Other robots rely too much on the Doer. They are fast, but if something unexpected happens (like a cup falling over), they just keep trying the same thing over and over again until they get stuck in a loop, unable to figure out a new plan.

The Solution: "Critic in the Loop"
The paper introduces a new robot architecture called Tri-System VLA. Think of it as adding a third character to the team: The Critic.

Here is how the three work together, using a simple analogy:

1. The Brain (System Two) – The Project Manager

Role: High-level planning.
Analogy: Imagine a project manager sitting in an office. They look at the big picture and say, "We need to stack these plates." They don't touch the plates; they just give the instruction.
In the Robot: This is a powerful AI that understands language and complex goals. Because it's slow, we only ask it to speak when absolutely necessary.

2. The Cerebellum (System One) – The Athlete

Role: Fast, physical execution.
Analogy: Imagine a professional athlete. Once the Project Manager says "Pick up the plate," the Athlete grabs it and moves it instantly. They don't stop to think about the physics; they just do it.
In the Robot: This is a fast AI that controls the robot's motors. It runs at high speed (20 times a second) to keep the robot moving smoothly.

3. The Critic (System Three) – The Referee / Spotter

Role: Watching, judging, and deciding when to switch.
Analogy: Imagine a referee or a spotter at a gym. They aren't doing the work, and they aren't making the strategy. They are just watching closely.
- If the Athlete is doing well, the Referee stays quiet and lets them keep going.
- If the Athlete starts to slip, or if they get stuck trying to lift a heavy box that won't move, the Referee blows a whistle!
- The Referee says, "Stop! Something is wrong. Go talk to the Project Manager for a new plan."
In the Robot: This is a lightweight AI that constantly watches the camera feed. It checks: "Is the robot making progress? Did the cup fall? Is the robot stuck in a loop?"

How It Saves the Day: The "Stuck" Scenario

Let's say the robot is trying to pick up a cup, but it's in a weird spot (a scenario the robot has never seen before).

Old Robot: It tries to grab the cup with its right hand. It fails. It tries again. It fails again. It gets stuck in an infinite loop of failure because it doesn't know to stop and think.
Tri-System Robot:
1. The Athlete tries to grab the cup.
2. The Referee watches and sees, "Hey, you've been trying this for 5 seconds and nothing is happening. You're stuck."
3. The Referee blows the whistle (sends a signal).
4. The Athlete freezes.
5. The Project Manager is woken up. The Referee tells the Manager, "The right arm is stuck. Maybe try the left arm?"
6. The Project Manager thinks for a second and says, "Good idea. Switch to the left arm."
7. The Athlete switches hands and successfully grabs the cup.

Why This Is a Big Deal

Speed: The robot doesn't waste time asking the slow "Project Manager" for instructions on every tiny movement. It only asks when things go wrong or a step is finished.
Resilience: If the robot drops something or gets stuck, the "Referee" notices immediately and calls for a new plan, rather than letting the robot spin in circles forever.
Human-Like Rules: The system can be programmed with simple human logic, like "If you've been stuck for 10 seconds, reset your position." This helps the robot handle surprises without needing to be trained on every single possible disaster.

In Summary:
This paper teaches robots how to balance thinking and doing by adding a watchful referee. The referee ensures the robot moves fast when things are going well, but knows exactly when to stop, call for help, and try a new strategy when things go wrong. This makes robots much smarter, faster, and more reliable in the real world.

Here is a detailed technical summary of the paper "Critic in the Loop: A Tri-System VLA Framework for Robust Long-Horizon Manipulation."

1. Problem Statement

Robotic manipulation faces a fundamental trade-off between high-level semantic reasoning and low-level reactive control:

Vision-Language Models (VLMs) excel at complex planning and reasoning but suffer from high inference latency, making them unsuitable for real-time, high-frequency control.
Vision-Language-Action (VLA) models are fast and reactive but often lack the semantic depth required for complex, long-horizon tasks, leading to failures in out-of-distribution (OOD) scenarios or when anomalies occur.
Existing Dual-System Approaches (combining a slow planner and a fast executor) often rely on rigid, fixed-frequency switching or heuristics. This results in wasted computation during smooth execution and sluggish reaction to disturbances. Furthermore, handling rare failures typically requires expensive, task-specific data collection, limiting scalability.

The core challenge is to create an architecture that dynamically decides when to think (replan) and when to act, preserving mutual awareness between planning and execution while minimizing computational overhead.

2. Methodology: The Tri-System Architecture

The authors propose a Tri-System VLA framework driven by an adaptive, event-driven scheduling mechanism. The system decouples cognitive reasoning from continuous control via three distinct components:

System 1: The Cerebellum (Reactive Execution)

Role: High-frequency, closed-loop control.
Mechanism: A flow-matching based action expert that generates continuous action chunks.
Operation: It operates at ~20 Hz, translating semantic subtasks into precise motor commands. It bypasses the autoregressive bottleneck of VLMs, ensuring real-time responsiveness.

System 2: The Brain (Global Reasoning)

Role: Low-frequency, high-level semantic planning.
Mechanism: A pre-trained VLM (e.g., PaliGemma) that generates semantic subgoals.
Operation: It remains dormant during normal execution. It is only triggered when a subtask is completed, an anomaly is detected, or stagnation occurs. It ingests the global instruction, visual observations, and short-term memory context to generate the next subtask (e.g., "pick and place the blue cup").

System 3: The Critic (State Evaluation & Scheduling)

Role: The "Critic in the Loop" that monitors execution and routes control authority.
Mechanism: A lightweight, visually-grounded model (e.g., Florence-2) formulated as a Visual Question Answering (VQA) task.
Functionality:
1. Progress Tracking: Estimates a normalized progress value ( $V_t \in [-1.0, 0.0]$ ) for the current subtask using Monte Carlo estimation.
2. Anomaly Detection: Detects physical failures (e.g., dropped objects) by emitting a discrete high-priority token <aci> (Accident) during critical windows.
3. Stagnation Detection: Tracks if progress has stalled for a set number of frames ( $N_{stag}$ ).
Scheduling Logic: The Critic asynchronously evaluates the environment. It only triggers System 2 (Brain) if:
- An anomaly (<aci>) is detected.
- Subtask completion is confirmed ( $V_t > \tau_{succ}$ ).
- Execution stagnation is detected (triggering a heuristic-based state reset).

Automated Subtask Annotation Pipeline

To address the lack of dense semantic annotations for long-horizon tasks, the authors developed an automated pipeline:

Kinematic Heuristics: Uses the Ramer-Douglas-Peucker (RDP) algorithm on end-effector trajectories to identify geometric waypoints.
VLM Retrieval: Pairs these waypoints with visual frames and uses a VLM to retrieve precise semantic labels (e.g., "Pick and place the pink plate").
Result: Generates high-quality, temporally segmented subtask data without manual labeling, enabling robust training from diverse datasets.

3. Key Contributions

Adaptive Cognitive Switching: Introduces a critic-guided asynchronous scheduling mechanism that dynamically invokes high-level reasoning only when necessary, drastically improving computational efficiency and physical responsiveness compared to fixed-frequency baselines.
Proactive Anomaly Detection & Recovery: Integrates a "Critic" that combines data-driven value estimation with human-inspired heuristic rules (e.g., state resets for stagnation). This allows the system to break infinite retry loops and recover from OOD scenarios without exhaustive emergency data.
Scalable Data Pipeline: Develops an automated subtask extraction tool that eliminates the manual data bottleneck, facilitating robust long-horizon training from raw teleoperation data.

4. Experimental Results

The framework was evaluated on a dual-arm Cobot Magic ALOHA platform with two complex tasks: Arrange the Tableware and Tidy up the Desk.

Baselines: Compared against a Single-System $\pi0.5$ (direct action prediction) and a Dual-System $\pi0.5$ (fixed subtask generation).
Performance:
- OOD Generalization: The Tri-System achieved a 70% success rate on the "Left Cup" scenario (using an unseen left arm), whereas the Single-System failed completely (0%) due to overfitting to the right arm.
- Robustness: In the "Fallen" scenario (cup knocked over), the Tri-System successfully recovered, while baselines failed to re-prioritize the task.
- Deformable Objects: The Tri-System successfully handled the "Tidy up the Desk" task involving plastic bags, whereas the Dual-System frequently oscillated between subtasks and stalled due to continuous visual state changes.
Ablation Insights:
- The system's ability to generalize to the left arm was enabled by training System 1 on left-arm data for other objects (bowls), proving shared representation learning.
- The "Stagnation" heuristic was critical; without the Critic detecting the deadlock and triggering a state reset, the robot would remain trapped in a kinematic loop.

5. Significance

This work represents a significant step forward in Embodied AI by moving beyond static dual-system pipelines to a dynamic, event-driven architecture.

Efficiency: It minimizes expensive VLM queries, making long-horizon autonomy computationally feasible.
Robustness: By explicitly modeling "when to think," the system handles real-world unpredictability (anomalies, OOD objects) more effectively than purely reactive or rigidly hierarchical models.
Scalability: The automated annotation pipeline offers a pathway to scale VLA training to massive, diverse datasets without prohibitive human labeling costs.

The paper concludes that the Tri-System VLA effectively bridges the gap between deliberation and control, offering a blueprint for robust, autonomous robots capable of operating in complex, unstructured environments.