MEM: Multi-Scale Embodied Memory for Vision Language Action Models

Imagine you are teaching a robot to cook a complex meal or clean a messy kitchen. If you ask a standard robot to do this, it's like asking a person with amnesia to follow a recipe. They can see the ingredients right in front of them, but the moment they turn their back to grab a spice, they forget what they were doing. If they drop a spoon, they don't remember they just dropped it, so they keep trying to pick it up the same way, over and over, failing every time.

This paper introduces a new system called MEM (Multi-Scale Embodied Memory) that gives robots a "brain" capable of remembering things in two very different ways, just like humans do.

Here is the simple breakdown of how it works, using some everyday analogies:

1. The Problem: The Robot's "Short Attention Span"

Most advanced robots today are like goldfish. They can see what is happening right now, but they can't remember what happened 10 seconds ago, let alone 10 minutes ago.

The Issue: If a robot is wiping a counter, it might forget it already wiped the left side. If it's cooking, it might forget it already added the salt.
The Old Way: To fix this, scientists tried to feed the robot every single video frame from the last hour. But this is like trying to read a 500-page book in 1 second. The robot's computer gets overwhelmed, and it slows down to a crawl.

2. The Solution: Two Types of Memory

The authors realized that humans don't remember everything the same way. We have short-term memory (what I just saw) and long-term memory (the plan I'm following). MEM gives the robot these two distinct tools.

A. Short-Term Memory: The "Super-Sharp Eye" (Video Memory)

What it is: A high-speed video camera that remembers the last few seconds in high definition.
The Analogy: Imagine you are trying to pick up a slippery piece of soap. You look at it, reach for it, and it slips.
- Without MEM: The robot forgets it slipped and tries the exact same grip again.
- With MEM: The robot's "eye" remembers, "Hey, I just tried that grip and it failed." It instantly adjusts its hand angle, like you would if you remembered the soap was slippery.
Why it's special: It solves problems like occlusion (when your own arm blocks your view of an object). The robot remembers what was there a split second ago, so it doesn't get confused when it can't see the object for a moment.

B. Long-Term Memory: The "Smart Diary" (Text Memory)

What it is: A compressed text summary of what has happened over the last 15 minutes.
The Analogy: Imagine you are cleaning a huge kitchen. You don't need to remember the exact color of every single plate you washed. You just need to remember the story: "I washed the plates, put them in the rack, and closed the cabinet."
The Magic: Instead of feeding the robot 10,000 video frames of cleaning, the robot writes a tiny note in its "diary": "Done: Plates. Next: Fridge." This is incredibly efficient. It allows the robot to keep track of a 15-minute task (like cooking a full dinner) without getting a "computer headache."

3. How They Work Together: The "Manager and the Worker"

The paper describes the robot's brain as having two parts working in tandem:

The High-Level Manager (The Text Memory): This part looks at the big picture. It reads the "diary" to know, "Okay, we are on step 4 of the recipe. We have the potatoes, but we haven't got the butter yet." It tells the robot what the next sub-task is.
The Low-Level Worker (The Video Memory): This part looks at the immediate action. It sees the butter jar, remembers the last time it tried to open it and slipped, and adjusts its grip to open it successfully.

4. The Results: What Can It Do Now?

Because of this dual-memory system, the robot can now do things that were previously impossible:

Clean a whole kitchen: It can remember which drawers it emptied, which surfaces it wiped, and that it needs to close the fridge door at the end.
Cook a meal: It can follow a recipe for 15 minutes, remembering which ingredients it added and when to flip the sandwich.
Adapt on the fly: If it tries to open a fridge and fails, it remembers the failure, realizes the door opens the other way, and tries again immediately. It doesn't get stuck in a loop of failure.

The Big Takeaway

Think of MEM as giving a robot a photographic memory for the immediate past (to handle tricky physical movements) and a smart summary notebook for the distant past (to keep track of long goals).

Before this, robots were like people with short attention spans who couldn't finish a sentence. Now, they are like a competent assistant who can remember the plan, notice when a mistake happens, and fix it—all while keeping the conversation going for a long time. This is a massive step toward robots that can actually live and work with us in our homes, handling complex chores without needing a human to hold their hand every step of the way.

Here is a detailed technical summary of the paper "MEM: Multi-Scale Embodied Memory for Vision Language Action Models."

1. Problem Statement

Current Vision-Language-Action (VLA) models typically operate without memory or rely on a limited context window of recent observations. This creates significant bottlenecks for long-horizon robotic tasks (spanning tens of minutes) and tasks requiring partial observability.

The Trade-off: Storing a dense sequence of all past observations (images, proprioception) is computationally intractable for long tasks due to latency constraints. However, simply compressing all history into a single representation loses critical fine-grained details needed for immediate manipulation (e.g., handling occlusions).
The Gap: Existing approaches often use a single modality for memory (e.g., only proprioception, only keyframes, or raw text). This results in a compromise: either the robot loses precise spatial/dynamic information needed to correct a grasp, or it loses the ability to track high-level semantic progress (e.g., "I have already added the butter").
Goal: Develop a memory architecture that can represent past events at multiple levels of granularity—dense short-term visual memory for immediate dynamics and compressed long-term semantic memory for task progress—while maintaining real-time inference speeds.

2. Methodology: Multi-Scale Embodied Memory (MEM)

The authors propose MEM, a mixed-modal memory system that factorizes the policy into two distinct but interacting components to handle different time scales:

A. Architecture Overview

The system decomposes the action prediction policy $\pi$ into:

High-Level Policy ( $\pi_{HL}$ ): Responsible for long-horizon planning and updating a Language Memory ( $m_t$ ).
Low-Level Policy ( $\pi_{LL}$ ): Responsible for immediate action generation, conditioned on a Short-Horizon Video Memory and the current subtask instruction.

The factorization is defined as:
$\pi(a_{t:t+H}, l_{t+1}, m_{t+1} | o_{t-T:t}, m_t, g) \approx \pi_{LL}(a_{t:t+H} | o_{t-K:t}, l_{t+1}, g) \cdot \pi_{HL}(l_{t+1}, m_{t+1} | o_{t}, m_t, g)$
Where $K \ll T$ (short horizon vs. long horizon).

B. Long-Term Memory: Language-Based Mechanism

Concept: Instead of storing raw history, the high-level policy maintains a compressed textual summary ( $m_t$ ) of semantic events (e.g., "Placed plate in cabinet").
Mechanism: The policy is trained to predict an updated summary $m_{t+1}$ based on the previous summary $m_t$ , current observations, and the task goal.
Compression Strategy: The system is explicitly trained to discard irrelevant details (e.g., specific colors of bowls) and retain only necessary state information (e.g., "Three bowls placed in top-right cabinet"). This prevents train-inference distribution shifts caused by repeating failed subtask instructions in the context window.
Training Data: Generated using an off-the-shelf LLM to summarize successful/failed subtask sequences from robot demonstrations, teaching the model to compress information effectively.

C. Short-Term Memory: Efficient Video Encoder

Concept: To handle occlusions, dynamics, and fine-grained manipulation, the low-level policy requires a dense sequence of recent observations ( $o_{t-K:t}$ ).
Challenge: Naively passing $K$ frames into a VLA backbone causes inference latency to explode (exceeding the 300ms real-time barrier).
Solution: A specialized Video Encoder based on Vision Transformers (ViT).
- Architecture: It interleaves standard spatial attention layers with causal-temporal attention layers (every 4th layer).
- Efficiency: It applies attention across time for the same patch, reducing complexity from $O(n^2K^2)$ to $O(Kn^2 + nK^2)$ .
- Token Reduction: It drops tokens from past timesteps in upper layers, passing only the current timestep's representation to the VLA backbone. This keeps the token count similar to single-frame VLAs while encoding temporal context.
- Initialization: The encoder is initialized from pre-trained ViT weights (e.g., SigLIP) with sinusoidal temporal position embeddings, requiring no new learnable parameters for the base vision model.

D. Integration with $\pi0.6$

The MEM system is integrated into the $\pi0.6$ VLA (a generalist model trained on diverse robot and internet data).

Input: Combines the video-encoded short-term history, proprioceptive state embeddings (projected linearly), and the language memory summary.
Training: Pre-trained on a diverse mixture of teleoperated data, policy rollouts, and video-language tasks. The model learns to update the language memory and utilize the video encoder simultaneously.

3. Key Contributions

Multi-Modal Memory Architecture: The first system to explicitly combine dense video-based short-term memory (for dynamics/occlusion) with compressed language-based long-term memory (for semantic state) in a single VLA.
Efficient Video Encoder: A novel, parameter-efficient video encoder that allows VLAs to process tens of seconds of video history without violating real-time latency constraints, enabling in-context adaptation.
Semantic Compression: A mechanism for the policy to actively summarize and compress its history, preventing context window overflow and distribution shifts during long-horizon tasks.
State-of-the-Art Performance: Demonstrated that memory-enhanced policies can solve complex, 15-minute tasks (e.g., cleaning a kitchen, cooking a grilled cheese sandwich) that are impossible for memory-less SOTA models.

4. Experimental Results

The authors evaluated MEM on the $\pi0.6$ backbone across various tasks:

Long-Horizon Tasks:
- Recipe Setup & Kitchen Cleanup: MEM achieved high success rates in tasks requiring up to 15 minutes of memory (e.g., tracking ingredients, closing drawers, washing dishes).
- Ablation: Removing either the video memory or the language memory caused significant performance drops. "Naive" text memory (concatenating all instructions without compression) failed due to distribution shifts.
In-Context Adaptation:
- Chopstick & Fridge Tasks: MEM policies could adapt strategies after a failure (e.g., adjusting grasp height for a chopstick or changing the door-opening direction) by observing the failure in their short-term video memory. Memory-less models repeated the same failure.
Core Memory Capabilities:
- Partial Observability & Counting: MEM outperformed baselines like "Pool Memory" (average pooling of past frames) and "Proprio Memory" (state-only history) in tasks like finding hidden objects, counting coffee scoops, and unpacking groceries.
- Pre-training Importance: Models pre-trained with the video encoder on diverse data significantly outperformed models where the memory module was introduced only during post-training.
Generalization: MEM matched the performance of the base $\pi0.6$ on standard manipulation tasks (e.g., folding laundry, box building) where memory is not strictly required, proving that adding memory does not degrade baseline performance.

5. Significance

Scalability: MEM demonstrates that robots can effectively manage memory spanning tens of minutes, a critical step toward autonomous agents capable of performing complex, multi-stage daily activities.
Efficiency: By decoupling short-term dense memory from long-term semantic memory, the system solves the "context window vs. latency" dilemma, making long-horizon control feasible on real hardware.
Robustness: The ability to perform in-context adaptation (learning from immediate past failures within a single episode) makes robots more robust to real-world unpredictability and occlusions.
Future Direction: This work lays the foundation for "continual learning" in robots, where memory could eventually span weeks or months, allowing robots to learn from deployment experiences over time.