Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement

Imagine you are teaching a robot to cook a complex meal, like making a sandwich, heating soup, and then cleaning up. To do this well, the robot needs to "see" the kitchen, "understand" your voice commands, and "decide" what to do next.

In the world of robotics, these smart robots are called Vision-Language-Action (VLA) models. They are like brilliant chefs who can read a recipe and look at the ingredients. But right now, these chefs have two big problems:

They have a terrible short-term memory: They tend to forget what happened a few seconds ago. If you ask them to "put the pot on the stove, wait 5 minutes, then take it off," they might forget they already put the pot there and just keep putting it on the stove over and over again.
They are incredibly slow: Every time they look at the kitchen, they re-analyze everything from scratch—even the parts that haven't changed at all, like the color of the walls or the pattern on the rug. It's like a chef stopping to re-read the entire recipe book every time they pick up a single spoon.

This paper introduces a new solution called SD-VLA (Static-Dynamic Vision-Language-Action). Think of it as giving the robot a "smart filing system" and a "memory trick."

The Big Idea: Separating the "Boring" from the "Busy"

The authors realized that in any scene, most things don't move. The background is Static (still), while the robot's hand or the object it's holding is Dynamic (moving).

Imagine you are watching a movie.

The Static parts: The scenery, the sky, the furniture. These stay the same for hours.
The Dynamic parts: The actors moving, the ball flying, the door opening. These change every second.

Current robots treat every single frame of the movie as if it's brand new, re-calculating the sky and the furniture every time. SD-VLA says, "Wait a minute! Why are we re-calculating the sky? Let's just remember it once!"

How SD-VLA Works (The Analogy)

1. The "Smart Filing Cabinet" (Static-Dynamic Disentanglement)

Instead of shoving the whole kitchen scene into the robot's brain every second, SD-VLA splits the image into two piles:

The "Still" Pile (Static Tokens): The walls, the floor, the stove. The robot only needs to look at this once. It puts this in a special "cached" folder and says, "I know this part; I don't need to re-read it."
The "Moving" Pile (Dynamic Tokens): The robot arm, the can of soup. The robot re-reads this every single second because it's changing.

The Result: The robot's "brain" (the context window) stays small and fast because it's not wasting space re-reading the walls. This allows it to remember a much longer history of what happened (long-horizon reasoning) without getting overwhelmed.

2. The "Smart Gatekeeper" (The Recache Gate)

You might ask, "What if the robot moves the stove? Then the 'Static' pile is wrong!"
SD-VLA has a tiny, smart gatekeeper (a learned gate) that watches the scene.

If the robot moves the stove, the gatekeeper says, "Oh, the background changed! Let's throw away the old 'Static' file and take a new picture."
If nothing changed, the gatekeeper says, "All good, keep using the old file."

This gatekeeper is learnable, meaning the robot figures out when to refresh its memory on its own, rather than following a rigid, dumb rule.

Why This Matters: The "Speed" and "Memory" Wins

The paper tested this new robot chef against others using a special test called LIBERO-Memory. This test was designed specifically to trick robots that have bad memories.

The Test: The robot had to heat a can, wait for a specific time, put it back, and then heat a different can.
The Old Robots: They got confused. They forgot which can they just heated or how long they waited. They failed the test.
SD-VLA: Because it kept a clean, efficient memory of the "static" room and only focused on the "moving" actions, it remembered the sequence perfectly.
- Success Rate: It improved success rates by nearly 40% compared to previous methods on memory tasks.
- Speed: It ran 2.26 times faster than the standard model. It's like the robot went from walking to jogging.

The Takeaway

Before this paper, making robots that could handle long, complex tasks was like trying to carry a giant, heavy backpack full of useless information (like re-reading the same page of a book 100 times).

SD-VLA is like giving the robot a highlighter and a bookmark. It highlights the parts of the world that never change so it can ignore them later, and it bookmarks the parts that are moving so it can focus on them. This makes the robot faster, smarter, and capable of remembering long stories to get the job done.

1. Problem Statement

Vision-Language-Action (VLA) models have emerged as a leading paradigm for generalist robotic control, leveraging large Vision-Language Models (VLMs) to predict actions based on visual observations and language instructions. However, current state-of-the-art VLAs face two critical bottlenecks:

Limited Long-Horizon Context: Most VLAs operate in a "memoryless" manner, processing only the current frame ( $T=0$ ). To handle tasks requiring temporal memory (e.g., remembering if a button was pressed), models must concatenate multiple frames. However, VLM backbones generate hundreds of tokens per image. Concatenating multiple frames leads to prohibitively long context lengths, causing quadratic complexity in attention mechanisms and exceeding context window limits.
Inefficient Inference: VLAs suffer from high computational latency due to large parameter counts and the need to recompute attention for every frame. Existing acceleration methods (e.g., KV-cache reuse, token pruning) often rely on heuristic, non-learnable criteria or assume that pixel-space similarity implies latent-space invariance. This assumption fails in transformer architectures where static background patches can still change in latent space due to causal attention mechanisms.

2. Methodology: SD-VLA

The authors propose SD-VLA, a framework that addresses these issues by explicitly disentangling visual inputs into static and dynamic components based on their temporal persistence.

A. Static-Dynamic Disentanglement

Instead of treating all image tokens equally, SD-VLA decomposes the visual token sequence $Z_t$ into:

Dynamic Tokens ( $Z^d_t$ ): Represent rapidly changing elements (e.g., moving objects, robot grippers). These are recomputed at every timestep.
Multi-Level Static Tokens ( $Z^s_t$ ): Represent temporally persistent elements (e.g., background, scene layout, stationary objects). The model introduces multi-level static tokens (e.g., Level 1 for global background, Level 2 for semi-static objects) to capture heterogeneity in temporal persistence.

Key Innovation: Static tokens are included only once in the input sequence across multiple timesteps, while only dynamic tokens are concatenated over time. This drastically reduces the effective context length.

B. Learnable Recache Gate

To determine when static tokens should be refreshed versus reused, the authors introduce a trainable Recache Gate ( $g_l$ ) for each static level $l$ .

Function: The gate takes the current observation and a cached reference from $\Delta$ timesteps ago to predict the probability that the static tokens need recomputation.
Mechanism: During inference, if the gate output exceeds a threshold, the static tokens are refreshed; otherwise, the cached Key-Value (KV) pairs are reused.
Hierarchy: If a higher-level cache (e.g., L1) is refreshed, lower-level caches (e.g., L2) are also forced to refresh to maintain consistency.

C. Training Objectives

Beyond the standard task loss, SD-VLA employs two auxiliary objectives:

Contrastive Loss for Static Tokens: Uses InfoNCE loss to ensure static tokens remain temporally persistent. Observations from the same trajectory form positive pairs, while different trajectories form negative pairs. This forces the model to learn representations that are invariant to time within a trajectory.
Recache Gate Regularization: A regularization term biases the gate toward reusing cached tokens when observations are close in time, preventing the trivial solution of recomputing everything at every step.

3. Key Contributions

SD-VLA Framework: A novel architecture that disentangles visual tokens into static and dynamic components, enabling the ingestion of multi-step observations without duplicating static information.
Adaptive Recache Mechanism: A learnable gate that dynamically decides when to refresh the KV cache, balancing computational efficiency with representation freshness.
LIBERO-Memory Benchmark: A new evaluation suite designed to explicitly test long-horizon temporal dependency modeling (episodic memory), addressing the gap in existing benchmarks that often fail to require true temporal reasoning.
Theoretical Efficiency: Theoretical analysis shows significant reductions in FLOPs and context length, as static tokens are processed only once per sequence rather than per frame.

4. Experimental Results

A. Temporal Dependency Modeling (LIBERO-Memory)

Evaluated on the new LIBERO-Memory benchmark, which requires robots to remember spatial layouts, elapsed time, and object states across a sequence of tasks.

Success Rate: SD-VLA achieved a 69.8% success rate on the "On Stove" task, significantly outperforming the best baseline (ContextVLA at 50.8%) and showing a 39.8% absolute improvement over the base model.
Memory Tasks: It achieved an 83.0% success rate on "Position Reset" (remembering where an object was) and reduced heating time error by 29.8% compared to baselines.
Comparison: Single-frame methods (TraceVLA) failed almost entirely, while pooling-based methods (ContextVLA) suffered from information loss.

B. Inference Efficiency (SimplerEnv & LIBERO)

Evaluated on standard benchmarks to measure speedup and accuracy trade-offs.

SimplerEnv: SD-VLA improved success rate by 4.9% over the base model and achieved a 2.26× inference speedup (latency reduced from 1360ms to 601ms).
LIBERO: Achieved a 0.7% success rate improvement and a 1.70× speedup over the base OpenVLA-OFT model.
Comparison: SD-VLA outperformed other acceleration methods like FlashVLA and VLA-Cache, which rely on heuristics and often degrade performance.

C. Ablation Studies

Removing the contrastive loss led to performance degradation, confirming the need for explicit temporal consistency training.
Removing the multi-level cache (using only one level) reduced performance, highlighting the benefit of capturing different temporal scales.
Replacing the learnable gate with a fixed-step refresh strategy resulted in significant performance drops, proving the necessity of adaptive caching.

D. Visualization

Attention map visualizations confirmed that:

Dynamic tokens focus on moving objects (grippers, apples).
L1 Static tokens focus on background/ambient regions (acting as "sink tokens").
L2 Static tokens focus on semi-static objects (drawers, robot arms).
Static token attention patterns remained highly consistent across timesteps, validating the disentanglement.

5. Significance

This paper presents a fundamental shift in how VLA models handle temporal context. By moving away from brute-force concatenation of frames and heuristic caching, SD-VLA leverages the intrinsic temporal redundancy of robotic environments.

Scalability: It enables the use of long-horizon contexts without hitting memory or compute limits, making complex, multi-step robotic tasks feasible.
Practicality: The significant inference speedup (up to 2.26×) makes real-time deployment on resource-constrained robotic systems more viable.
Benchmarking: The introduction of LIBERO-Memory sets a new standard for evaluating the "memory" capabilities of AI agents, moving beyond simple memoryless tasks.

In summary, SD-VLA demonstrates that explicitly modeling the distinction between static and dynamic visual information is a highly effective strategy for achieving both high performance and computational efficiency in robotic control.