OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

OxyGen is a unified KV cache management paradigm for Vision-Language-Action Models that enables efficient on-device multi-task parallelism by treating KV cache as a shared resource, thereby eliminating redundant computation and achieving significant speedups in both language and action generation without quality degradation.

Xiangyu Li, Huaizhi Tang, Xin Ding, Weijun Wang, Ting Cao, Yunxin Liu

Published 2026-03-17
📖 4 min read☕ Coffee break read

Imagine you have a highly intelligent robot assistant named "Robo." Robo is trying to do three things at once:

  1. Move its arm to pick up a cup (Action).
  2. Talk to you about what it's doing (Language).
  3. Remember where it put the cup for later (Memory).

In the past, if you asked Robo to do all these things, it was like hiring three different people to do the same job, but they were all standing in a tiny kitchen trying to use the same stove. They kept bumping into each other, waiting in line, and re-cooking the same ingredients over and over again. The result? Robo was slow, clumsy, and often dropped the cup because it couldn't move its arm fast enough while it was busy talking.

This paper introduces OxyGen, a new way to manage Robo's brain so it can multitask smoothly. Here is how it works, using simple analogies:

The Problem: The "Isolated Kitchen"

Current robot systems treat every task as if it's happening in a separate, isolated kitchen.

  • Redundant Cooking: If Robo needs to look at a picture of a cup to move its arm and to talk about the cup, the old system makes the robot "cook" (process) that picture twice. It's a waste of time and energy.
  • The Traffic Jam: Even if they share the stove, the "Talking" task might take a long time to finish a sentence. While it's talking, the "Moving" task has to wait, even though moving the arm needs to happen instantly (like a reflex). This causes the robot to freeze or move jerkily.

The Solution: OxyGen (The Unified Manager)

OxyGen acts like a super-efficient head chef who manages the entire kitchen as one big, shared space. It uses two main tricks:

1. The "Shared Recipe Book" (Cross-Task KV Sharing)

In AI, the "KV Cache" is like a scratchpad or a working memory where the robot writes down what it has already seen and understood.

  • Old Way: Every time the robot starts a new task (move, talk, remember), it throws away the scratchpad and starts writing from scratch, even if it's looking at the same cup.
  • OxyGen Way: The head chef says, "Hey, we already looked at the cup! Let's just keep that page in the scratchpad."
  • The Result: The robot doesn't waste time re-reading the cup. It instantly shares that understanding between the arm-moving task and the talking task. It's like sharing a single Google Doc instead of emailing three different Word files back and forth.

2. The "Assembly Line" (Cross-Frame Continuous Batching)

Robots have different deadlines. Moving an arm needs to happen now (every 1/60th of a second). Talking can happen a bit slower, over several seconds.

  • Old Way: The robot stops everything to finish a whole sentence before moving its arm again. This makes the arm stop and start (jerky motion).
  • OxyGen Way: Imagine an assembly line. While the robot is moving its arm for this second, it is also quietly working on the next few words of the sentence in the background. It batches (groups) the talking tasks together so the computer chip works on them all at once, efficiently.
  • The Result: The arm moves smoothly and quickly (like a human), while the robot is simultaneously churning out a long, detailed story without slowing down.

The Real-World Impact

The researchers tested this on a powerful robot brain (called π0.5\pi0.5) running on a standard gaming graphics card (RTX 4090).

  • Speed: The robot became 3.7 times faster.
  • Smoothness: It could move its arm at 70 times per second (super smooth) while still talking at a very fast speed (200 words per second).
  • Quality: Crucially, it didn't make mistakes. The robot still picked up the cup correctly; it just did it much faster and without getting "tired" from the traffic jams.

Summary

Think of OxyGen as the difference between a chaotic kitchen where three cooks fight over one stove, versus a well-organized kitchen with one head chef who shares ingredients and runs an assembly line. It allows our robot friends to finally do what humans do best: walk and chew gum at the same time, but with the speed and precision of a machine.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →