DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

Imagine you are trying to teach a robot to pick up a black bowl from a drawer and put it on a plate. To do this, the robot uses a "brain" (a Vision-Language-Action model) that looks at the room through cameras, reads your instructions, and decides what to do.

The problem is that this brain is too thorough. Every time it looks at the room, it breaks the image into hundreds of tiny puzzle pieces (called "tokens"). It tries to analyze every single piece with the same intense focus, whether it's the bowl you want to grab or the distant wall behind it. This takes a lot of time, making the robot slow, hesitant, and sometimes too late to catch a falling object.

DepthCache is a new trick that makes this robot brain faster without making it "dumb." Here is how it works, using some everyday analogies:

1. The "Foveal Vision" Trick (Like Human Eyes)

Have you ever noticed how when you reach for a cup, your eyes focus sharply on the cup, but the background is a bit blurry? You don't need to see the wallpaper in high definition to grab the cup.

The Old Way: Previous methods tried to speed things up by just throwing away random pieces of the image or treating the whole room the same. This is like squinting at the whole room equally; you might miss the cup or lose track of where the table is.
The DepthCache Way: This system uses a depth map (a sensor that knows how far away things are) as a guide. It says: "Hey, the bowl is close (near-field), so let's keep that in high definition. The wall is far away (distant background), so let's blur that out a bit."
- Analogy: Imagine you are packing a suitcase. You carefully fold your expensive jewelry (the near-field objects) and keep them safe. But for the old t-shirts in the back of the closet (the distant background), you just stuff them in loosely. You save space (computation) without losing the important stuff.

2. The "Slow Fade" Instead of a "Hard Cut"

Imagine you are watching a movie. If the director suddenly cut the screen from 4K resolution to 144p in the middle of a scene, it would be jarring and confusing.

The Old Way: Some methods try to compress the image all at once in a single step. This causes the robot to "stutter" or hesitate because its view of the world suddenly changes drastically between one moment and the next.
The DepthCache Way: Instead of a hard cut, it uses a progressive merge. It slowly reduces the detail over a few seconds (or frames), like a smooth zoom-out effect.
- Analogy: Think of it like a dimmer switch on a light. Instead of snapping the light off and on, you slowly turn it down. The robot's brain gets used to the change gradually, so its movements stay smooth and fluid.

3. The "Motion Sensor" for the Wrist Camera

Robots often have a camera on their wrist (like a GoPro on a human hand).

The Problem: When the robot's arm is swinging through the air, the wrist camera sees a blurry mess of motion. It's useless. But when the arm stops to grab something, that camera becomes super important.
The DepthCache Way: It acts like a smart traffic light.
- Green Light (Moving): When the arm is swinging, the system says, "We don't need to look closely at this blurry mess," and it compresses the data heavily to save time.
- Red Light (Stopping): The moment the arm slows down to grab the bowl, the system instantly switches to "Full Resolution" mode to ensure the robot doesn't drop the object.

The Results: Faster, Smarter, and Safer

The researchers tested this on three different types of robot brains.

Speed: The robot became 1.28 times faster. In the real world, this meant the robot could finish tasks significantly quicker.
Accuracy: Despite looking at "blurry" backgrounds, the robot didn't make more mistakes. In fact, it was almost as good as the slow, perfect version (less than 1% drop in success rate).
No Training Needed: The best part? You don't have to re-teach the robot how to think. You just install this "DepthCache" filter, and it works immediately on any robot model.

In summary: DepthCache is like giving the robot a pair of smart glasses. It tells the robot, "Focus hard on what's right in front of your hand, but relax your gaze on the stuff far away, and only look closely when you're actually doing something." This saves the robot's brain power, making it faster and more responsive, just like a human would be.

Here is a detailed technical summary of the paper "DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference."

1. Problem Statement

Vision-Language-Action (VLA) models are powerful for generalist robotic manipulation but suffer from high inference latency, which conflicts with the real-time demands of reactive control.

The Bottleneck: A single camera view generates hundreds of visual patch tokens, which are processed by billion-scale language backbones. Multi-view setups exacerbate this, creating a "visual token budget" that widens the latency gap.
Limitations of Existing Solutions:
- Token Pruning: Selectively discards low-relevance tokens. While fast, it is inherently lossy and severs spatial relationships, degrading the fine-grained spatial reasoning required for precise manipulation (e.g., grasping).
- Token Merging: Aggregates tokens to preserve information but often applies uniform merge ratios across the entire image. This fails to distinguish between task-critical near-field objects and irrelevant distant backgrounds. Furthermore, many merging methods require architecture-specific modifications to the vision encoder, limiting portability.

2. Methodology: DepthCache

DepthCache is a training-free framework that leverages depth maps (available from simulators or RGB-D sensors) as a structural prior to guide visual token compression. It operates entirely outside the vision encoder, requiring no model modification or retraining.

The framework consists of two main pipelines:

A. Primary View Pipeline (Third-Person Camera)

This pipeline handles the main scene observation through a cyclic process:

Scene Initialization & Dual Protection:
- Semantic Protection: Accumulates LLM cross-attention weights over a warm-up period ( $N$ frames) to identify task-relevant regions (e.g., target objects).
- Geometric Protection: Uses depth gradient detection to preserve object boundaries and occlusion contours.
- Result: A combined protection set ( $P$ ) ensures critical tokens are never merged.
Depth-Based Region Partitioning:
- Unprotected tokens are partitioned into $K$ regions using K-Means clustering on depth values.
- Differentiated Merging: Regions are assigned merge ratios ( $r_k$ ) proportional to their mean depth. Distant backgrounds are compressed aggressively ( $r \approx r_{max}$ ), while near-field workspace regions retain higher resolution.
Progressive Token Merging (Temporal Coherence):
- Instead of merging all tokens in a single forward pass (which causes inter-frame inconsistency), DepthCache distributes the merging process over a window of $W$ consecutive frames.
- This ensures smooth transitions in visual conditioning signals, preventing action hesitation caused by sudden token count changes.
Dynamic Recovery & Re-initialization:
- Change Detection: Monitors depth variation. If a region becomes dynamic (e.g., an object moves), it is restored to full resolution.
- Re-initialization: If the target object is displaced independently (e.g., by external perturbation), the system re-runs the initialization phase to refresh the protection sets and depth partitioning.

B. Auxiliary View Pipeline (Wrist Camera)

Motion-Adaptive State Machine: Governs the wrist-mounted camera based on end-effector dynamics.
- Merge State: Activated during arm transit/transport when imagery is motion-blurred and low-value.
- Full-View State: Activated during fine manipulation (grasping/releasing) when high-resolution close-ups are critical.
- The system predicts state transitions based on the upcoming action chunk to eliminate temporal lag.

3. Key Contributions

First Depth-Guided Training-Free Compression: Introduces the first framework to repurpose depth from a perceptual input into an external structural prior for spatially differentiated token merging in VLA inference.
Temporally Coherent Pipeline: Proposes a progressive merging strategy that distributes token reduction across frames, eliminating the inter-frame instability found in "one-shot" merging methods.
Dual Protection Mechanism: Combines semantic attention signals with geometric edge detection to shield task-critical tokens, ensuring safety without retraining.
Architecture Agnosticism: The method operates outside the vision encoder, making it compatible with diverse VLA architectures (e.g., SigLIP, DINOv2, Eagle) without code changes.

4. Experimental Results

The framework was evaluated on the LIBERO benchmark (simulation) and a physical PIPER manipulator (real-world) across three diverse VLA models ( $\pi0.5$ , OpenVLA, GR00T).

Simulation (LIBERO):
- Speedup: Achieved up to 1.28× inference speedup.
- Performance: Maintained an average Success Rate (SR) degradation of <1% across all models.
- Comparison: Baseline pruning methods (FastV) caused 12–20% SR degradation; uniform merging (ToSA) caused up to 24% degradation. DepthCache significantly outperformed these in preserving spatial reasoning.
Real-World Experiments:
- Achieved 1.33× speedup on physical tasks (Pick & Place, Stacking, Drawer manipulation).
- Latency-Sensitive Scenarios: In perturbation recovery tasks (where an object is moved mid-task), the reduced latency allowed for faster re-planning, improving recovery time by 21.3% and slightly increasing success rates.
- Multi-Object Sorting: Reduced total task completion time by 22.7% due to accumulated inference savings.

5. Significance and Impact

Bridging the Latency Gap: DepthCache demonstrates that inference-time optimization can significantly reduce latency without sacrificing the spatial precision required for robotic control.
Biological Inspiration: The approach mimics human visuomotor control (foveal high-resolution focus on targets vs. peripheral compression), utilizing depth to automate this distinction.
Practical Deployment: As a training-free, model-agnostic solution, it offers an immediate path to deploying faster VLA models on resource-constrained or latency-sensitive robotic systems without the cost of retraining large foundation models.
Future Directions: The authors note that while token compression is effective, future work could combine this with action decoding acceleration (e.g., KV-cache optimization) to overcome Amdahl's law limitations.

DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

1. The "Foveal Vision" Trick (Like Human Eyes)

2. The "Slow Fade" Instead of a "Hard Cut"

3. The "Motion Sensor" for the Wrist Camera

The Results: Faster, Smarter, and Safer

1. Problem Statement

2. Methodology: DepthCache

A. Primary View Pipeline (Third-Person Camera)

B. Auxiliary View Pipeline (Wrist Camera)

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

A Hybrid Residue Floating Numerical Architecture with Formal Error Bounds for High Throughput FPGA Computation

On the Multi-Commodity Flow with convex objective function: Column-Generation approaches

VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation

AnalogToBi: Device-Level Analog Circuit Topology Generation via Bipartite Graph and Grammar Guided Decoding

Artificial Intelligence (AI) Maturity in Small and Medium-Sized Enterprises: A Framework of Internalized and Ecosystem-Embedded Capabilities