Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

Imagine you are trying to teach a robot to navigate a maze or pick up a cup. To do this, the robot needs a "World Model"—a mental simulation of how the world works. It needs to ask itself: "If I move my arm forward, what will the cup look like in the next second?"

For a long time, AI researchers have tried to build these mental models by creating perfect, photorealistic movies in their heads. They want the simulation to look exactly like a high-definition camera recording, capturing every shadow, texture, and speck of dust.

The Problem:
Creating these perfect movies is incredibly slow and expensive. It's like trying to plan your day by watching a 4K movie of every possible future scenario in real-time. By the time the robot figures out what to do, it's already too late. The "movie" takes too long to render.

The Solution: CompACT (The "Sketch" Approach)
This paper introduces CompACT, a new way to build these mental models. Instead of trying to remember every single pixel of a photo, CompACT teaches the AI to remember the world using just 8 tiny "tokens" (think of them as 8 digital Lego bricks).

Here is how it works, using a few analogies:

1. The "Sketch vs. Photo" Analogy

Imagine you need to give directions to a friend.

The Old Way (SD-VAE): You send them a 4K photograph of the street. It has the exact color of the bricks, the way the light hits the windows, and the texture of the asphalt. It's beautiful, but it's huge and takes forever to send.
The CompACT Way: You send them a simple stick-figure sketch with 8 lines. It doesn't show the brick texture, but it perfectly shows: Where the wall is, where the door is, and where the obstacle is.

For planning a route, the sketch is actually better. You don't need to know the texture of the wall to know you can't walk through it. You just need to know the wall exists. CompACT strips away the "boring" details (textures, lighting) and keeps only the "important" details (objects, locations, relationships).

2. The "Frozen Brain" Trick

How does the AI know what to keep and what to throw away?
Usually, AI learns to draw by trying to copy a photo perfectly. But CompACT uses a pre-trained "frozen brain" (a model called DINOv3) that already understands the world.

Think of this frozen brain as an expert art teacher who has seen millions of paintings.
CompACT doesn't ask the teacher to redraw the painting. Instead, it asks the teacher: "What are the 8 most important things in this picture?"
Because the teacher already understands "objects" and "space," it ignores the paint texture and says, "It's a cat sitting on a chair."
CompACT saves just that sentence (8 tokens) instead of the whole painting.

3. The "Magic Reconstructor"

You might ask: "If we only have 8 tokens, how do we see the image again?"
This is the cleverest part. CompACT uses a Generative Decoder.

Think of the 8 tokens as a recipe card.
The recipe doesn't contain the cake itself; it just says "Flour, Sugar, Eggs."
When the robot needs to see the image, it takes the recipe (the 8 tokens) and a Magic Baker (a pre-trained generator) who knows how to bake. The Baker looks at the recipe and instantly bakes a delicious cake that looks like the original, even though the recipe was tiny.
The AI doesn't need to remember the cake; it just needs the recipe to know what to do next.

Why This Changes Everything

The paper tested this on navigation (walking through rooms) and robot arms (picking up objects).

Speed: Because the AI only has to process 8 tokens instead of hundreds, it plans 40 times faster. It's the difference between reading a novel and reading a headline.
Smarter Decisions: Surprisingly, the robot made better decisions with the sketch than with the photo. Why? Because the photo distracted the robot with irrelevant details (like a shiny floor), while the sketch forced it to focus on the goal (the door).

The Bottom Line

CompACT is like teaching an AI to think in stick figures instead of photographs. By compressing the world into just 8 tiny, meaningful pieces of information, it allows robots to simulate the future instantly, making real-time control and planning finally possible for the real world.

It proves that to be a good planner, you don't need to remember everything perfectly; you just need to remember the right things.

Here is a detailed technical summary of the paper "Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model" (CompACT).

1. Problem Statement

World models are powerful frameworks for simulating environment dynamics to enable action planning and policy learning. However, their application to real-time decision-time planning (e.g., Model Predictive Control or MPC) is currently hindered by computational bottlenecks.

The Bottleneck: Conventional world models rely on tokenizers (like SD-VAE) that encode a single image into hundreds of latent tokens (e.g., 784 tokens).
The Consequence: Since most world models use attention-based architectures (like Transformers), computational cost scales quadratically with the number of tokens. This makes planning slow and resource-intensive (e.g., taking minutes per episode), rendering them impractical for real-world control.
The Gap: Existing generative models prioritize photorealistic reconstruction (textures, lighting, shadows), which requires high-frequency details irrelevant to decision-making. The authors hypothesize that planning requires only low-frequency semantic information (object identities, spatial relationships), suggesting that extreme compression is possible without sacrificing planning performance.

2. Methodology: CompACT

The authors propose CompACT, a compact discrete tokenizer that compresses each observation into as few as 8 to 16 discrete tokens (approx. 128–256 bits per image). The architecture consists of three main components:

A. Semantic Encoding (The Encoder)

Instead of training an encoder end-to-end for pixel reconstruction, CompACT leverages frozen pretrained vision foundation models (specifically DINOv3).

Mechanism: The input image is processed by the frozen DINOv3 encoder to extract semantic patch representations.
Latent Resampler: A learnable transformer decoder with a small set of query tokens (8 or 16) attends to the frozen DINO features via cross-attention. This "resampling" process distills only high-level semantic cues (objects, layout) while discarding low-level perceptual details.
Quantization: The output is discretized using Finite Scalar Quantization (FSQ), resulting in a sequence of discrete tokens.
Key Insight: By freezing the vision encoder, the system is forced to preserve semantic structure rather than reconstruction fidelity, as the foundation model already abstracts away textures.

B. Generative Decoding

Directly reconstructing pixels from 8 tokens is an ill-posed problem due to the information bottleneck. CompACT solves this via generative decoding:

Intermediate Target: The decoder does not map directly to pixels. Instead, it maps the compact tokens ( $z$ ) to a high-resolution latent space of a pretrained "target tokenizer" (MaskGIT-VQGAN, which uses ~256 tokens).
Masked Generation: The decoder learns to unmask the target tokens conditioned on the compact semantic tokens. This transforms the intractable decompression problem into a tractable conditional generation task, synthesizing fine-grained details consistent with the high-level semantics.

C. Latent World Model

The world model is trained directly in this compact discrete latent space ( $N \le 16$ tokens).

Training: It uses masked generative modeling to predict the next state $z_{t+1}$ given the current state $z_t$ and action $a_t$ .
Planning: During inference, the model performs rollouts in the latent space. Because the token count is drastically reduced, the quadratic attention cost is minimized, enabling rapid simulation of future trajectories.

3. Key Contributions

Extreme Compression for Planning: Demonstrates that world models can operate effectively with 8–16 tokens per frame, a reduction of ~50x compared to state-of-the-art baselines (784 tokens), without losing planning-critical information.
Semantic-First Tokenization: Introduces a novel architecture using frozen vision foundation models as encoders. This shifts the inductive bias from "reconstruction" to "semantic abstraction," naturally filtering out irrelevant high-frequency noise.
Generative Decoding Strategy: Proposes a two-stage decoding process (Compact $\to$ High-Res Latent $\to$ Pixel) that bypasses the limitations of deterministic reconstruction from extreme compression.
Efficiency Breakthrough: Achieves order-of-magnitude speedups in planning latency (up to 40x–80x) while maintaining or improving planning accuracy compared to models using hundreds of tokens.

4. Experimental Results

The authors evaluated CompACT on navigation (RECON, SCAND, HuRoN) and robot manipulation (RoboNet) tasks.

Planning Performance (Navigation):
- On the RECON benchmark, the CompACT-based world model achieved planning accuracy (ATE/RPE) comparable to the SD-VAE baseline (784 tokens) but with a ~40x reduction in planning latency (from ~178s to ~4.8s per trajectory).
- It outperformed FlexTok (a flexible tokenizer) at both 16 and 64 token configurations, proving that how tokens are compressed matters more than just the count.
Action Relevance (Manipulation):
- An Inverse Dynamics Model (IDM) trained on CompACT tokens (16 tokens) achieved higher accuracy ( $R^2 = 0.716$ ) than one trained on the baseline tokenizer (256 tokens, $R^2 = 0.684$ ). This confirms that CompACT tokens preserve action-relevant dynamics better than dense pixel-based tokens.
Video Prediction:
- On RoboNet, CompACT reduced Action Prediction Error (APE) by 3x compared to the 256-token baseline while generating videos 5.2x faster.
Qualitative Analysis:
- Attention maps show that each compact token focuses on semantically coherent regions (e.g., specific objects or end-effectors) rather than distributing information uniformly across the image.

5. Significance

This work challenges the prevailing paradigm in world modeling that "more tokens = better performance."

Paradigm Shift: It establishes that semantic abstraction is superior to photorealistic reconstruction for planning tasks.
Real-World Viability: By reducing planning latency from minutes to seconds, CompACT makes the deployment of world models for real-time robotic control and autonomous navigation practically feasible.
Scalability: The compact latent space allows for scaling up world model capacity (e.g., larger Transformers) without incurring prohibitive computational costs, as the token sequence length remains small.

In summary, CompACT proves that by prioritizing decision-critical semantics over visual fidelity, we can build world models that are not only faster but also more effective at planning in complex environments.