MAGE: Multi-scale Autoregressive Generation for Offline Reinforcement Learning

Imagine you are trying to teach a robot how to navigate a giant, complex maze to find a hidden treasure. The catch? You can't walk the maze with the robot to show it the way. You only have a dusty, old notebook filled with recordings of other people trying (and sometimes failing) to solve it. This is the challenge of Offline Reinforcement Learning.

The paper introduces a new method called MAGE (Multi-scale Autoregressive Generation) to help the robot learn from this notebook. Here is how it works, explained through simple analogies.

The Problem: The "Blurry Photo" vs. The "Fine Print"

Previous methods tried to learn from the notebook in two main ways, but both had flaws:

The "Step-by-Step" Reader (Decision Transformers): Imagine reading a book one word at a time. You know what the current word is, but you might lose track of the overall plot. In a long maze, the robot might make a perfect move for the next second but forget it needs to turn left in 50 steps to reach the goal.
The "All-at-Once" Sketcher (Diffusion Models): Imagine trying to draw a whole landscape in one go by erasing and redrawing until it looks right. While this captures the general vibe, it often gets the details wrong. The robot might draw a path that looks good locally but leads straight into a wall because it didn't plan the whole route.

The Result: In long, difficult tasks where rewards are rare (like finding that hidden treasure), these robots get lost. They make locally smart moves but fail at the big picture.

The Solution: MAGE (The "Architect and the Mason")

MAGE solves this by acting like a master architect working with a mason. It breaks the problem down into multiple scales (levels of detail) and builds the solution from the top down.

1. The Multi-Scale Autoencoder (The "Zoom Lens")

First, MAGE looks at the old notebook of robot attempts. Instead of just seeing a list of moves, it uses a special "Zoom Lens" to compress that history into different layers of detail:

The Coarse Layer (The Architect's Blueprint): This captures the big picture. Where is the treasure? What is the general path? It ignores the tiny details of how the robot's fingers moved.
The Fine Layer (The Mason's Brickwork): This captures the tiny details. Exactly how much force to apply to the door handle? Which specific millimeter to turn?

Think of it like looking at a map. The coarse layer is the highway system (getting you to the right city), and the fine layer is the street map (getting you to the specific house).

2. The Multi-Scale Transformer (The "Top-Down Builder")

Now, MAGE generates a new plan for the robot. It doesn't guess every move at once. Instead, it builds the plan from the top down:

Step 1: It draws the Coarse Blueprint first. "Okay, go North, then East, then South."
Step 2: It takes that blueprint and fills in the Medium Details. "Go North for 10 steps, turn right."
Step 3: Finally, it fills in the Fine Details. "Turn the wheel 3 degrees left, press the gas pedal 20%."

This is like writing a story. You first write the outline (Chapter 1, 2, 3), then the scene summaries, and finally, you write the dialogue. This ensures the robot never loses sight of the goal while figuring out the tiny steps.

3. The Condition-Guided Decoder (The "GPS Correction")

Sometimes, even with a great plan, the robot might drift off course. Maybe the "Coarse Blueprint" says "Go North," but the robot starts slightly to the East.
MAGE has a built-in GPS Correction system. Before the robot starts moving, MAGE checks: "Does this plan actually start where we are right now?" If the plan starts in the wrong spot, MAGE tweaks the details until the plan perfectly aligns with the robot's current reality. This prevents the robot from hallucinating a path that starts in a wall.

Why is this a Big Deal?

In the real world, many tasks are long and sparse.

Example: A robot arm trying to assemble a piece of furniture. It has to pick up a screw, find the hole, and twist it. If it gets the first step wrong, the whole thing fails, and it gets no "points" (reward) until the very end.
MAGE's Superpower: Because it plans the "Big Picture" first, it knows why it is picking up that screw. It doesn't just react to the immediate moment; it understands the long-term goal.

The Results

The authors tested MAGE on five different "mazes" (robotic tasks), including:

Dexterous Hands: Making a robot hand write with a pen or open a door.
Kitchen Tasks: Making a robot cook a meal by opening a microwave, boiling water, and turning on a light in the right order.
Navigation: Guiding a robot ant through a giant, complex maze.

The Verdict: MAGE beat 15 other existing methods. It was especially good at the long, hard tasks where other robots got lost or gave up. It also works fast enough to be used in real-time, meaning a robot could actually use this brain to walk around a factory floor without crashing.

Summary

MAGE is like a smart project manager for robots. Instead of trying to figure out every single second of a long journey at once, it:

Sketches the big route (Coarse scale).
Fills in the details (Fine scale).
Double-checks the starting point (Conditioning).

This allows robots to learn from old data and solve complex, long-term problems that they previously couldn't handle.

1. Problem Statement

Offline Reinforcement Learning (RL) aims to learn policies from static datasets without further environment interaction. While generative models (e.g., Diffusion, Transformers) have shown promise in modeling complex trajectory distributions, they face significant challenges in long-horizon tasks with sparse rewards:

Inadequate Temporal Modeling: Existing methods often fail to capture multi-scale temporal dependencies. Transformers are limited by unidirectional autoregression (lacking global context), while Diffusion models suffer from "local generation bias," producing locally plausible but globally incoherent trajectories.
Hierarchical Limitations: Current Hierarchical Generation Methods (HGM) typically use a rigid two-layer structure (high-level subgoals + low-level actions). This often leads to optimization difficulties and fails to capture the full spectrum of multi-scale temporal abstractions inherent in complex trajectories.
Control Issues: Generating trajectories that strictly adhere to specific initial conditions (state and target return) is difficult, often leading to trajectory divergence.

2. Methodology: MAGE

The authors propose MAGE (Multi-scale Autoregressive GEneration), a framework that generates trajectories in a top-down, coarse-to-fine manner. It integrates a multi-scale autoencoder with a condition-guided autoregressive transformer.

Core Components

Multi-Scale Trajectory Autoencoder (MTAE):
- Architecture: Based on a Vector Quantized Variational Autoencoder (VQ-VAE) with a hierarchical quantization scheme.
- Process: It encodes a trajectory $\tau$ (composed of state $s$ and Return-to-Go $R$ ) into a hierarchy of discrete token maps $M = (m_1, m_2, \dots, m_K)$ .
- Resolution: $m_1$ captures the coarsest, global structure (long-term dependencies), while $m_K$ captures the finest-grained details (short-term dynamics).
- Shared Codebook: All scales share a single codebook to ensure consistent vocabulary size.
Multi-Scale Condition-Guided Autoregressive Generator:
- Mechanism: A Transformer generates the token maps sequentially from coarse to fine ( $m_1 \to m_2 \to \dots \to m_K$ ).
- Conditioning: The generation of map $m_k$ is conditioned on the initial state $s_0$ , the target return $R_0$ , and all previously generated coarser maps $m_{<k}$ .
- Loss Function: Trained using Cross-Entropy loss to predict the ground-truth token maps.
Action Determination & Refinement:
- Latent Inverse Dynamics: Instead of decoding the full trajectory to get actions, MAGE uses a latent inverse dynamics model $I$ to predict the action $a$ directly from the aggregated latent representation $Z = (z_1, \dots, z_K)$ . This preserves dynamics-consistent information.
- Condition-Guided Refinement: To ensure the generated trajectory strictly starts from $s_0$ and targets $R_0$ , a condition-guided decoder with a lightweight adapter module is employed. It minimizes a Mean Squared Error (MSE) loss between the decoded initial state-return pair and the true condition $(s_0, R_0)$ . This corrects deviations caused by quantization and autoregressive errors.

3. Key Contributions

Novel Architecture: Introduced the first offline RL method that explicitly models trajectories across multiple temporal scales using a coarse-to-fine autoregressive generation process, inspired by Visual Autoregressive (VAR) models.
Unified Policy: Unlike traditional hierarchical methods that require separate policies for different levels, MAGE learns a single unified policy across all latent temporal hierarchies, simplifying optimization.
Conditional Control: Developed a condition-guided refinement mechanism using adapter modules to ensure precise alignment with initial states and target returns, addressing the "drift" problem common in generative RL.
Efficiency: Achieves high inference speeds (approx. 27ms/step), making it suitable for real-time robotic control, unlike slower diffusion-based planners.

4. Experimental Results

MAGE was evaluated on five offline RL benchmarks against 15 baseline algorithms (including BC, CQL, IQL, Decision Transformer, Diffuser, Decision Diffuser, and various Hierarchical methods).

Sparse Reward / Long-Horizon Tasks:
- Adroit (Dexterous Manipulation): MAGE significantly outperformed all baselines on the Pen, Door, and Hammer tasks (e.g., 147.8 vs. 121.4 on Pen-Expert). It successfully handled high-dimensional control and sparse rewards where other methods failed.
- Franka Kitchen: Achieved state-of-the-art (SOTA) results (88.8 average score), demonstrating superior ability to sequence complex sub-goals correctly.
- Navigation (AntMaze, Maze2D, Multi2D): MAGE achieved the best performance on 5 out of 6 AntMaze datasets and all Maze2D/Multi2D datasets, successfully navigating large mazes where baselines often got stuck or crossed walls.
Dense Reward Tasks: MAGE remained competitive in dense-reward Gym locomotion tasks, achieving top performance in 7 out of 9 tasks.
Ablation Studies:
- Multi-Scale: Performance improved as the number of scales ( $K$ ) increased up to 8, confirming the necessity of multi-resolution modeling.
- Conditioning: Removing the Return-to-Go (RTG) conditioning or the refinement loss ( $L_{cond}$ ) caused significant performance drops, validating the importance of global guidance.
- Inference Speed: MAGE is ~50x faster than Hierarchical Diffuser (HD) and ~80x faster than Decision Diffuser (DD).

5. Significance

Solving the Long-Horizon Problem: MAGE addresses the fundamental limitation of existing generative RL methods in long-horizon tasks by explicitly modeling temporal dependencies at multiple resolutions. This allows the agent to plan a global route (coarse) while refining local actions (fine) without losing coherence.
Bridging Generative and Control: By combining the representational power of generative models with the precision of conditional guidance and inverse dynamics, MAGE offers a robust solution for real-world applications like robotics and clinical decision-making where safety and trajectory coherence are paramount.
Practical Applicability: The method's fast inference speed and ability to handle sparse rewards make it a viable candidate for deployment in complex, real-world sequential decision-making scenarios where data is limited and interaction is costly.

In conclusion, MAGE represents a significant advancement in offline RL by effectively integrating multi-scale temporal modeling with conditional guidance, setting a new state-of-the-art for long-horizon, sparse-reward tasks.

MAGE: Multi-scale Autoregressive Generation for Offline Reinforcement Learning

The Problem: The "Blurry Photo" vs. The "Fine Print"

The Solution: MAGE (The "Architect and the Mason")

1. The Multi-Scale Autoencoder (The "Zoom Lens")

2. The Multi-Scale Transformer (The "Top-Down Builder")

3. The Condition-Guided Decoder (The "GPS Correction")

Why is this a Big Deal?

The Results

Summary

1. Problem Statement

2. Methodology: MAGE

Core Components

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank