Beyond Short-Horizon: VQ-Memory for Robust Long-Horizon Manipulation in Non-Markovian Simulation Benchmarks

Here is an explanation of the paper "Beyond Short-Horizon: VQ-Memory for Robust Long-Horizon Manipulation," translated into simple, everyday language with some creative analogies.

The Big Picture: Why Robots Are Getting Stuck

Imagine you are teaching a robot to open a safe.

The Old Way: Most robots are trained on simple tasks like "pick up a cup" and "put it on the table." These are like one-step puzzles. The robot looks at the cup, grabs it, and moves. It doesn't need to remember what happened five seconds ago.
The Real World: Real life is messy. Opening a safe isn't just one step. You might need to:
1. Turn a knob.
2. Wait for a light to turn green.
3. Type a specific code on a keypad.
4. Then pull the handle.

If the robot only looks at the camera right now, it gets confused. It sees a handle and thinks, "I should pull it!" But if it doesn't remember that it just turned the knob, it pulls the handle too early, and the safe stays locked. This is called a non-Markovian problem (a fancy way of saying: "You can't solve this just by looking at the present; you need to remember the past").

Part 1: RuleSafe (The New Training Ground)

The authors realized that existing robot training games were too easy. They built a new, harder training ground called RuleSafe.

The Analogy: Think of previous robot benchmarks as a playground with a slide. You just climb up and slide down. It's fun, but it doesn't teach you how to navigate a maze.
RuleSafe is a complex escape room. It features safes with different locks:
- Key locks: You have to find the right key and turn it.
- Password locks: You have to press buttons in a specific order (like 1-2-3).
- Logic locks: You have to do things based on rules (e.g., "Turn the knob twice, but only if the handle is down").

To make thousands of these puzzles without human labor, they used an AI (LLM) to invent the rules. This means the robot has to learn to solve puzzles it has never seen before, requiring it to plan ahead and remember its steps.

Part 2: The Problem with Robot "Memory"

When robots try to solve these escape rooms, they usually fail for two reasons:

They forget: They only look at the current camera frame.
They get overwhelmed: If you tell a robot, "Remember every single angle of your arm joints from the last 10 minutes," it gets confused by the noise. It's like trying to remember every single word of a conversation you had last week, including the background noise and your own breathing. You get lost in the details and miss the main point.

Part 3: VQ-Memory (The Robot's "Sticky Note")

This is the paper's main invention: VQ-Memory.

The Analogy: Imagine you are writing a story.
- Raw Data: Writing down every single letter, punctuation mark, and typo from your draft. (Too much info, hard to read).
- VQ-Memory: Instead of writing the whole draft, you summarize the story into 4 main sticky notes: "The Hero Arrives," "The Villain Appears," "The Fight," "The Victory."

How it works:

Compression: The system takes the robot's messy, continuous history of arm movements and squashes them into discrete tokens (like the sticky notes).
Filtering: It throws away the "noise" (tiny, unimportant wobbles in the arm) and keeps the "big picture" (e.g., "I just finished turning the knob").
Efficiency: Instead of feeding the robot a 10-minute video of its past, it just feeds it a short list of 4 or 5 words: "Knob Turned," "Handle Pulled."

This allows the robot to say, "Ah, I see the handle. But my memory says I just turned the knob, so I know I'm in the 'Unlocking' phase, not the 'Opening' phase."

The Results: Why It Matters

The authors tested this on several top-tier robot AI models.

Without VQ-Memory: The robots were like amnesiacs. They could solve simple tasks but failed miserably at the complex, multi-step safe puzzles.
With VQ-Memory: The robots became strategic thinkers. They could remember the sequence of events, ignore the tiny jitters in their movements, and successfully solve the complex puzzles.

The Bottom Line:
This paper gives robots a better way to "remember" what they just did. By turning messy movement data into clean, simple "memory tokens," robots can finally handle long, complicated tasks that require planning and patience, moving us one step closer to robots that can actually help us in our messy, real-world homes.

Here is a detailed technical summary of the paper "Beyond Short-Horizon: VQ-Memory for Robust Long-Horizon Manipulation in Non-Markovian Simulation Benchmarks."

1. Problem Statement

Current robotic simulation benchmarks predominantly focus on short-horizon, simple manipulation tasks (e.g., pick-and-place). These benchmarks fail to capture the complexity of real-world environments, specifically:

Non-Markovian Nature: Real-world tasks often require memory because the current visual observation is insufficient to determine the task stage (partial observability).
Articulated Object Complexity: Existing benchmarks rarely handle articulated objects (doors, drawers, safes) with complex inter-joint dependencies and multi-stage reasoning (e.g., unlocking a safe requires a specific sequence of knob turns and handle pulls).
Limitations of Existing Memory Approaches:
- Visual History: Using past video frames as memory is computationally expensive and introduces high-dimensional noise.
- Raw Joint States: Using raw robot proprioceptive states (joint angles) as memory is lightweight but prone to overfitting specific trajectories and is highly sensitive to low-level noise, leading to poor generalization in long-horizon tasks.

2. Methodology

The paper introduces two core components: a new benchmark (RuleSafe) and a novel memory architecture (VQ-Memory).

A. RuleSafe Benchmark

RuleSafe is a scalable, LLM-aided simulation benchmark designed for long-horizon articulated manipulation.

Environment: Built on the SAPIEN engine using Unitree H1-2 humanoid robots with Inspire hands.
Task Design: Features 20 distinct "safe" unlocking rules involving diverse mechanisms (key locks, password locks, logic locks).
Non-Markovian Structure: Tasks are governed by two latent variables:
1. Part-Phase: Discrete states of object components (e.g., knob open/closed).
2. Task-Phase: High-level progress tracking (e.g., "password entered," "logic condition met").
Generation: Rules and demonstrations are generated using Large Language Models (LLMs) to ensure diversity and scalability without manual scripting. The LLMs generate rule descriptions and executable programs to verify task completion.

B. VQ-Memory (Vector-Quantized Memory)

To address the noise and overfitting issues of raw joint states, the authors propose VQ-Memory, a compact, structured temporal representation.

Core Mechanism: It utilizes a Vector-Quantized Variational Autoencoder (VQ-VAE) to encode continuous sequences of past robot joint states into discrete latent tokens.
Process:
1. Encoding: A continuous joint-state trajectory $Q_t$ is mapped to a latent embedding $z_t$ by an encoder.
2. Quantization: The embedding is quantized to the nearest entry in a learned codebook, producing discrete tokens.
3. Clustering (Post-hoc): To further reduce redundancy and emphasize high-level semantic patterns, the codebook entries are clustered (e.g., via K-means) into a smaller vocabulary. This filters out low-level noise while preserving task-phase context.
Integration: These discrete tokens are treated as special language tokens and injected into existing Vision-Language-Action (VLA) models (like $\pi_0$ , RDT, CogACT) or Diffusion Policies (DP3). This makes VQ-Memory model-agnostic.
Advantages:
- Noise Filtering: Discretization removes fine-grained variations that do not affect task progress.
- Efficiency: Significantly reduces token count compared to raw state sequences or visual history (e.g., ~20x compression).
- Robustness: Prevents overfitting to specific trajectory noise, enabling better generalization to unseen configurations.

3. Key Contributions

RuleSafe Benchmark: A novel, LLM-generated benchmark featuring non-Markovian, multi-stage articulated manipulation tasks that challenge current policies' temporal reasoning capabilities.
VQ-Memory Module: A compact, structured memory mechanism that leverages VQ-VAEs and clustering to convert noisy joint-state histories into robust, discrete semantic tokens.
Model-Agnostic Enhancement: Demonstration that VQ-Memory can be seamlessly integrated into diverse architectures (Diffusion Transformers, Flow-matching models, VLMs) to significantly boost performance.

4. Experimental Results

Experiments were conducted on state-of-the-art models ( $\pi_0$ , RDT, CogACT, DP3) in both single-task and multi-task settings.

Single-Task Performance:
- On complex rule 020 (8-step sequence), the baseline $\pi_0$ achieved 0% success rate.
- Adding raw joint-state memory improved performance slightly but remained unstable.
- VQ-Memory boosted $\pi_0$ success rate to 45% and Process Score to 67.3%.
- Similar improvements were observed across DP3, RDT, and CogACT, confirming the method's generality.
Multi-Task Performance:
- In a multi-task setting (20 rules, 1000 trajectories), VQ-Memory increased the average success rate from 25.0% to 56.3% and the process score from 48.8% to 76.5%.
Ablation Studies:
- Clustering: Reducing the vocabulary size via clustering (from 256 to 4) was critical. 4 clusters yielded the best performance (45% SR), while 256 (no clustering) resulted in only 20% SR due to noise.
- Memory Length: A memory length of 40 tokens was optimal; shorter lengths failed to capture dependencies, while longer lengths offered diminishing returns.

5. Significance

This work addresses a critical gap in robotic learning: the inability of current models to handle long-horizon, non-Markovian tasks involving articulated objects.

Scalability: By using LLMs for benchmark generation, the paper demonstrates a path toward creating vast, diverse training environments without manual effort.
Efficiency vs. Performance: VQ-Memory offers a "sweet spot" between the high computational cost of visual memory and the poor generalization of raw state memory. It proves that discretizing temporal context is a powerful strategy for robust robotic control.
Future Impact: The framework provides a foundation for training robots to perform complex, multi-step household tasks (e.g., opening locked cabinets, operating appliances) where visual cues alone are insufficient, paving the way for more capable and generalizable embodied AI.

Beyond Short-Horizon: VQ-Memory for Robust Long-Horizon Manipulation in Non-Markovian Simulation Benchmarks

The Big Picture: Why Robots Are Getting Stuck

Part 1: RuleSafe (The New Training Ground)

Part 2: The Problem with Robot "Memory"

Part 3: VQ-Memory (The Robot's "Sticky Note")

The Results: Why It Matters

1. Problem Statement

2. Methodology

A. RuleSafe Benchmark

B. VQ-Memory (Vector-Quantized Memory)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Monotone Comparative Statics without Lattices

Motion Illusions Generated Using Predictive Neural Networks Also Fool Humans

Performance Analysis of IEEE 802.11p Preamble Insertion in C-V2X Sidelink Signals for Co-Channel Coexistence

Construction of time-varying ISS-Lyapunov Functions for Impulsive Systems

Real-Time BDI Agents: a model and its implementation