SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

Imagine you are teaching a robot to perform a complex task, like making a sandwich or organizing a messy desk. You show it a video of a human doing it perfectly. This is called Imitation Learning.

For a long time, the best way to teach robots was to show them a short "clip" of what happened just a second ago. But here's the problem: if the task is long and complicated (a "long horizon" task), the robot gets confused. It forgets what it did five minutes ago, or it gets stuck because it thinks it's back at the beginning.

This paper introduces a new robot brain called SeedPolicy. Think of it as giving the robot a "super-memory" and a "smart filter" so it can handle long, complicated jobs without getting lost.

Here is how it works, broken down with simple analogies:

1. The Problem: The "Goldfish" Robot

Imagine a robot that only remembers the last 3 seconds of what it sees.

The Issue: If you ask it to "put the red block in the box, then the blue block," it might put the red block in, forget it did that, and then try to put the red block in again. Or, if the camera shakes or the background changes, the robot panics because it thinks the world has changed completely.
The Result: The longer the task, the worse the robot gets. It's like trying to solve a 1,000-piece puzzle while only being allowed to look at the last three pieces you picked up.

2. The Solution: SeedPolicy's "Smart Notebook"

The authors created a system called SEGA (Self-Evolving Gated Attention). Let's break that down into two parts:

A. The "Living Notebook" (Self-Evolving Latent State)

Instead of just looking at the last few frames, SeedPolicy keeps a living summary of everything that has happened so far.

Analogy: Imagine a detective solving a crime. A bad detective only looks at the crime scene right now. A good detective keeps a notebook where they write down every clue, every suspect they met, and every theory they had. Even if the crime scene changes, the detective can look at their notebook and remember, "Oh right, I already checked the red door."
How it helps: SeedPolicy compresses hours of video into a tiny, efficient "notebook" (a latent state). This allows the robot to remember the whole story of the task, not just the last few seconds.

B. The "Smart Filter" (Gated Attention)

Now, imagine that notebook is getting messy. It has scribbles about the weather, the color of the walls, and a bird flying by. These are distractions.

The Issue: If the robot tries to remember everything, it gets overwhelmed by noise (like a background moving or a shadow).
The Fix: SeedPolicy has a Smart Filter (the "Gate").
Analogy: Think of a bouncer at an exclusive club. The bouncer looks at every piece of information coming in. "Is this important? Did the robot move the cup? Yes, let it in. Is this just a shadow on the wall? No, keep it out."
How it works: The robot uses its own attention mechanism to decide what is "important" and what is "noise." It actively deletes the distractions from its memory, keeping the notebook clean and focused only on the task.

3. Why This is a Big Deal

Scaling Up: Previous robots got worse as tasks got longer. SeedPolicy gets better as tasks get longer because its "notebook" gets more useful.
Efficiency: There are other massive AI models (like RDT) that are huge and expensive, like a supercomputer in a backpack. SeedPolicy is like a smartphone: it's much smaller, cheaper, and faster, but it does the job just as well (or better) for these specific robot tasks.
Real-World Success: They tested this on a real robot arm. When the robot had to do a loop (pick up a block, put it down, pick it up again), old robots got stuck in a loop of confusion. SeedPolicy remembered, "I already did that part," and kept moving forward.

The Bottom Line

SeedPolicy is like upgrading a robot from having a short-term memory and a cluttered mind to having a photographic memory with a personal assistant who filters out the noise. It allows robots to finally tackle long, complex, multi-step jobs without getting confused or stuck, all while running on hardware that isn't too expensive or heavy.

Here is a detailed technical summary of the paper "SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation."

1. Problem Statement

The paper addresses a critical limitation in Imitation Learning (IL) for robotic manipulation, specifically within Diffusion Policies (DP). While standard Diffusion Policies excel at capturing multi-modal expert behaviors, they suffer from performance degradation as the observation horizon increases.

The Paradox: Contrary to intuition, increasing the number of observation frames (the "horizon") often leads to a drop in success rates for baseline Diffusion Policies.
Root Causes:
1. Inadequate Temporal Modeling: Treating observations merely as a stack of image frames fails to capture complex long-term temporal dependencies.
2. Computational Bottleneck: Standard attention mechanisms scale quadratically with the sequence length, making long-horizon modeling computationally prohibitive for real-time edge devices.
3. Temporal Sparsity & Noise: Not every frame contributes useful information. Irrelevant background shifts, occlusions, and static scenes introduce noise that pollutes the policy's understanding of the task state, leading to "state aliasing" (where the robot confuses a later state with an earlier one) and execution stagnation.

2. Methodology: SeedPolicy

The authors propose SeedPolicy, a framework that integrates a novel temporal module called Self-Evolving Gated Attention (SEGA) into the Diffusion Policy architecture.

A. Core Component: Self-Evolving Gated Attention (SEGA)

SEGA is designed to maintain a compact, time-evolving latent state that summarizes historical context while filtering noise. It operates via a dual-stream Transformer design:

State Update Stream (Upper Stream):
- Mechanism: Uses Cross-Attention to extract relevant semantic information from the current observation ( $O_t$ ) to update the historical latent state ( $S_{t-1}$ ).
- Self-Evolving Gate (SEG): Instead of indiscriminately integrating new data, the module computes a global relevance score based on the raw cross-attention maps. A sigmoid gate ( $G_t$ ) dynamically suppresses noisy or irrelevant signals (e.g., background shifts) and modulates the state update.
- Formula: $S_t = G_t \odot \text{Inter} \cdot S_t + (1 - G_t) \odot S_{t-1}$ .
- Benefit: Ensures the latent state only evolves with semantically relevant information, preventing "pollution" over long horizons.
State Retrieval Stream (Lower Stream):
- Mechanism: Uses the current observation as a Query to retrieve relevant temporal cues from the updated historical state ( $S'_{t-1}$ ).
- Benefit: Generates Enhanced Observation Features ( $EObst$ ) that are enriched with long-term context, allowing the policy to "remember" past states even when the current visual input is ambiguous.

B. Integration with Diffusion Policy

The enhanced features ( $EObst$ ) are fed into a standard Diffusion Action Expert (a Transformer-based denoising model) to predict a sequence of future actions. This allows the system to scale the observation horizon without incurring the quadratic computational cost of full sequence attention, as the history is compressed into a fixed-size latent state.

3. Key Contributions

SEGA Module: The introduction of a temporal module that synergizes attention with a dynamic gating mechanism. It maintains a compact, evolving latent state that captures long-term dependencies while filtering temporal disturbances.
Horizon Scaling: SeedPolicy reverses the performance degradation trend seen in prior Diffusion Policies. It demonstrates that longer observation windows lead to measurable performance gains rather than degradation.
Efficiency vs. Performance: The method achieves state-of-the-art results with significantly fewer parameters (33M–147M) compared to large Vision-Language-Action (VLA) foundation models (e.g., RDT with 1.2B parameters), making it suitable for resource-constrained robotic systems.

4. Experimental Results

The method was evaluated on the RoboTwin 2.0 benchmark (50 manipulation tasks) and real-world robot experiments (Dexmal Dos W1).

Performance Gains:
- Clean Settings: Achieved a 36.8% relative improvement over standard Diffusion Policy (DP).
- Challenging/Randomized Settings: Achieved a massive 169% relative improvement over DP.
- Long-Horizon Tasks: The performance gap between SeedPolicy and baselines widens significantly as task length increases. For long-length tasks, SeedPolicy improved success rates by 21.9% (CNN backbone) and 16.0% (Transformer backbone) over the baseline.
Robustness:
- SeedPolicy successfully resolved execution stagnation (robots getting stuck in loops due to state aliasing) and spatial precision errors (air grabs/collisions due to lack of depth info) that plagued baseline models.
- It maintained high success rates in "Hard" settings where baseline policies often collapsed to near 0% success.
Ablation Studies:
- Removing the gating mechanism (using only State updates) reduced performance, confirming the necessity of filtering noise.
- Using Cross-Attention maps for gating proved superior to standard MLP-based gating, especially for long horizons.

5. Significance

Solving the Horizon Bottleneck: SeedPolicy provides a scalable solution for long-horizon robotic manipulation, a domain where previous diffusion-based methods failed due to temporal modeling limitations.
Efficiency: It demonstrates that high-performance manipulation does not strictly require billion-parameter foundation models. By optimizing the temporal architecture, SeedPolicy achieves competitive or superior performance with 1-2 orders of magnitude fewer parameters.
Generalization: The ability to filter visual noise and maintain a coherent latent state allows the robot to generalize better to randomized environments and complex, multi-stage tasks (e.g., sequential picking, looping retrieval) without explicit depth sensors.

In conclusion, SeedPolicy establishes a new state-of-the-art for imitation learning in robotics by effectively bridging the gap between short-term visual perception and long-term task execution through efficient, self-evolving temporal modeling.