Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

The paper introduces CompACT, a compact discrete tokenizer that compresses observations into just 8 tokens to enable computationally efficient, real-time decision planning within world models while maintaining competitive performance.

Dongwon Kim, Gawon Seo, Jinsung Lee, Minsu Cho, Suha Kwak

Published 2026-03-06
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot to navigate a maze or pick up a cup. To do this, the robot needs a "World Model"—a mental simulation of how the world works. It needs to ask itself: "If I move my arm forward, what will the cup look like in the next second?"

For a long time, AI researchers have tried to build these mental models by creating perfect, photorealistic movies in their heads. They want the simulation to look exactly like a high-definition camera recording, capturing every shadow, texture, and speck of dust.

The Problem:
Creating these perfect movies is incredibly slow and expensive. It's like trying to plan your day by watching a 4K movie of every possible future scenario in real-time. By the time the robot figures out what to do, it's already too late. The "movie" takes too long to render.

The Solution: CompACT (The "Sketch" Approach)
This paper introduces CompACT, a new way to build these mental models. Instead of trying to remember every single pixel of a photo, CompACT teaches the AI to remember the world using just 8 tiny "tokens" (think of them as 8 digital Lego bricks).

Here is how it works, using a few analogies:

1. The "Sketch vs. Photo" Analogy

Imagine you need to give directions to a friend.

  • The Old Way (SD-VAE): You send them a 4K photograph of the street. It has the exact color of the bricks, the way the light hits the windows, and the texture of the asphalt. It's beautiful, but it's huge and takes forever to send.
  • The CompACT Way: You send them a simple stick-figure sketch with 8 lines. It doesn't show the brick texture, but it perfectly shows: Where the wall is, where the door is, and where the obstacle is.

For planning a route, the sketch is actually better. You don't need to know the texture of the wall to know you can't walk through it. You just need to know the wall exists. CompACT strips away the "boring" details (textures, lighting) and keeps only the "important" details (objects, locations, relationships).

2. The "Frozen Brain" Trick

How does the AI know what to keep and what to throw away?
Usually, AI learns to draw by trying to copy a photo perfectly. But CompACT uses a pre-trained "frozen brain" (a model called DINOv3) that already understands the world.

  • Think of this frozen brain as an expert art teacher who has seen millions of paintings.
  • CompACT doesn't ask the teacher to redraw the painting. Instead, it asks the teacher: "What are the 8 most important things in this picture?"
  • Because the teacher already understands "objects" and "space," it ignores the paint texture and says, "It's a cat sitting on a chair."
  • CompACT saves just that sentence (8 tokens) instead of the whole painting.

3. The "Magic Reconstructor"

You might ask: "If we only have 8 tokens, how do we see the image again?"
This is the cleverest part. CompACT uses a Generative Decoder.

  • Think of the 8 tokens as a recipe card.
  • The recipe doesn't contain the cake itself; it just says "Flour, Sugar, Eggs."
  • When the robot needs to see the image, it takes the recipe (the 8 tokens) and a Magic Baker (a pre-trained generator) who knows how to bake. The Baker looks at the recipe and instantly bakes a delicious cake that looks like the original, even though the recipe was tiny.
  • The AI doesn't need to remember the cake; it just needs the recipe to know what to do next.

Why This Changes Everything

The paper tested this on navigation (walking through rooms) and robot arms (picking up objects).

  • Speed: Because the AI only has to process 8 tokens instead of hundreds, it plans 40 times faster. It's the difference between reading a novel and reading a headline.
  • Smarter Decisions: Surprisingly, the robot made better decisions with the sketch than with the photo. Why? Because the photo distracted the robot with irrelevant details (like a shiny floor), while the sketch forced it to focus on the goal (the door).

The Bottom Line

CompACT is like teaching an AI to think in stick figures instead of photographs. By compressing the world into just 8 tiny, meaningful pieces of information, it allows robots to simulate the future instantly, making real-time control and planning finally possible for the real world.

It proves that to be a good planner, you don't need to remember everything perfectly; you just need to remember the right things.