CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Imagine you are trying to teach a robot to draw a picture, but instead of giving it a blank canvas and a full set of instructions all at once, you have to tell it what to draw one word at a time, just like reading a sentence from left to right.

This is the challenge of Autoregressive (AR) Vision Models. They want to generate images like language models generate text: step-by-step, causally. But images are messy. They are 2D grids of pixels, not 1D lines of words.

Here is the story of CaTok, a new method that solves this problem, explained through a few simple analogies.

1. The Problem: The "Flat" vs. The "Flow"

Imagine you have a photo of a cat.

Old Methods (The "Flattened" Approach): To make the robot understand the cat, old methods would cut the photo into tiny squares (patches) and lay them out in a long, flat line. But this line is chaotic. The robot might see the tail before the head, or the ear before the eye. It lacks causality (a logical order). It's like trying to read a book where the pages are shuffled randomly.
The "Flow" Problem: Other methods tried to fix this by only showing the robot the first few squares and guessing the rest. But this is like giving a student only the first sentence of a story and asking them to finish the whole novel. It creates an imbalance: the robot gets too much information about the beginning and not enough about the end, leading to blurry or weird pictures.

2. The Solution: CaTok (The "Time-Traveling" Tokenizer)

The authors created CaTok (Causal Tokenizer). Think of CaTok as a smart time-machine for images.

Instead of cutting the image into static squares, CaTok treats the image generation process like a movie playing in reverse.

The Concept: Imagine a movie where the final frame is a clear picture of a cat, and the first frame is just static noise (snow on an old TV).
The Innovation: CaTok doesn't just look at the whole movie at once. It picks a specific time interval (say, from minute 10 to minute 20) and says, "Okay, during this specific chunk of time, here are the 1D tokens (the 'words') that describe what happens."
The "MeanFlow" Magic: They use a math trick called MeanFlow. Instead of trying to guess the exact speed of the movie at one single instant (which is hard and error-prone), they calculate the average speed over that time interval.
- Analogy: If you want to know how fast a car traveled from New York to Boston, you don't need to know its speed at every single second. You just need the total distance and total time. CaTok does this for images. It learns the "average flow" of the image coming into existence.

3. Why This is a Big Deal

Because CaTok learns the "average flow" over specific time intervals, it solves two huge problems at once:

It's Causal (Logical Order): The tokens naturally follow a sequence. Token 1 describes the "noise-to-something" phase, Token 2 describes "something-to-more," and so on. It's like reading a sentence where the words actually make sense in order.
It's Fast AND High Quality:
- The "One-Step" Superpower: Because it learned the average speed, the robot can sometimes skip the whole movie and jump straight to the end result in one single step. It's like knowing the destination and the average speed, so you can instantly teleport there.
- The "Multi-Step" Quality: If you want a perfect, high-definition picture, it can still play the movie step-by-step (25 steps) to get every detail right.

4. The "Teacher" (REPA-A)

To make sure the robot learns quickly and doesn't get confused, the authors added a special helper called REPA-A.

Analogy: Imagine a student (CaTok) trying to learn to draw. Instead of just practicing alone, they have a master artist (a pre-trained Vision Foundation Model) sitting next to them.
Every time the student draws a rough sketch, the master artist whispers, "Hey, that shadow looks a bit off; make it more like a real cat's shadow."
This "whispering" (alignment) helps the student learn much faster and produce better results without needing to practice for years.

5. The Results: What Can It Do?

The paper shows that CaTok is a game-changer:

Reconstruction: If you give CaTok an image, it can compress it into a tiny list of numbers (tokens) and rebuild the image almost perfectly. It beat previous records for image quality (PSNR and SSIM).
Generation: When used to create new images from scratch (like "draw a cat"), it produces images that look just as good as the best current methods, but it does it with a much simpler, more logical structure.
Flexibility: It can generate an image in a split second (1 step) for speed, or take a little longer (25 steps) for perfection.

Summary

CaTok is like a new way of teaching a computer to "read" and "write" images.

Old way: Shuffling a deck of cards and hoping the picture makes sense.
CaTok way: Reading a story where the plot flows naturally from beginning to end, allowing the computer to either read the whole story slowly for detail or skip to the ending instantly for speed.

It bridges the gap between how computers understand text (word by word) and how they understand images, paving the way for future AI that can generate videos and complex scenes as easily as it writes a sentence.

1. Problem Statement

The autoregressive (AR) paradigm has revolutionized Large Language Models (LLMs) by relying on causal tokenization (predicting the next token based on previous ones). However, extending this to computer vision remains challenging due to the lack of a natural ordering in images.

Current Limitations:
- 2D Tokenizers (e.g., VQGAN, VAR): Flatten 2D patches into 1D sequences (lacking true causality) or use multi-scale coarse-to-fine ordering (compromising the standard "next-token prediction" pattern).
- Diffusion Autoencoders:
  - Naïve Flow Decoders: Condition on all tokens simultaneously, destroying causal structure and hindering AR learning.
  - Consistency Decoders: Use "nested dropout" to condition only on the first $k$ tokens. However, $k$ is often selected via random sampling or timestep binding, which introduces imbalance (early tokens are over-represented), degrading AR generation performance.
Goal: Develop a 1D image tokenizer that inherently supports causality, balances token contribution, and enables both high-fidelity multi-step sampling and fast one-step generation.

2. Methodology: CaTok

CaTok is a diffusion autoencoder architecture designed to learn causal 1D visual representations. It consists of a Causal ViT Encoder and a MeanFlow Diffusion Transformer (DiT) Decoder.

A. Architecture

Encoder: A Causal Vision Transformer (ViT) with Registers (learnable vectors) that extract rich visual information into 1D tokens. A causal attention mask ensures 1D tokens only attend to preceding tokens, enforcing the causal structure.
Decoder: A MeanFlow DiT that reconstructs images from the 1D tokens.

B. Core Innovation: MeanFlow Decoder with Interval Selection

To solve the causality and imbalance issues, CaTok binds the MeanFlow objective to specific token intervals:

Time Interval Selection: Instead of conditioning on all tokens or a random prefix, the decoder selects 1D tokens corresponding to a sampled time interval $[r, t]$ (where $r < t$ ).
MeanFlow Objective: The decoder predicts the average velocity field $u$ $u$ over the interval $[r, t]$ $[r, t]$ rather than the instantaneous velocity at a single timestep.
- Mathematically, it models the average velocity $u(zt, r, t)$ along the flow path.
- This approach naturally maintains causality (tokens in $[r, t]$ depend only on prior information) and balance (no single token is disproportionately favored).
Dual Velocity Modeling: To stabilize training, the model also learns the instantaneous velocity field $v$ (where $r=t$ ) using all tokens $V_K$ .

C. Regularization: REPA-A

To accelerate and stabilize training, the authors propose REPA-A (Representation Alignment - A):

Unlike previous methods that align decoder features, REPA-A aligns the Encoder's image features ( $H_e$ ) with high-quality representations from a pre-trained Vision Foundation Model (VFM, e.g., DINOv2).
This forces the encoder to produce more discriminative and informative semantic representations, improving the quality of the extracted 1D tokens.

D. Autoregressive Modeling

Once trained, the encoder is frozen to extract 1D tokens. A standard AR model (using a diffusion loss for continuous tokens) is trained to predict the next token in the sequence. During generation, the AR model predicts a sequence of tokens, which are fed into the MeanFlow decoder for one-step sampling (rendering the image in a single step).

3. Key Contributions

Novel Architecture: A 1D causal image tokenizer based on diffusion autoencoders with a MeanFlow objective, bridging the gap between AR language models and vision models.
Balanced Causality: A mechanism to select tokens within a time interval $[r, t]$ that inherently maintains causality and token balance, avoiding the pitfalls of nested dropout.
One-Step Sampling: The MeanFlow formulation enables high-quality image reconstruction and generation in a single step, significantly improving inference speed.
REPA-A: A new regularization technique that aligns encoder features with VFMs, stabilizing training and accelerating convergence.

4. Experimental Results

Experiments were conducted on ImageNet-1K (256×256 resolution).

Reconstruction Performance:
- CaTok-L-256 achieved state-of-the-art (SOTA) results among 1D tokenizers: 0.75 rFID, 22.53 PSNR, and 0.674 SSIM.
- It outperformed competitors like Semanticist and FlowMo in PSNR/SSIM while requiring fewer training epochs (160 vs. 300+).
- It supports both one-step (fast) and multi-step (high-fidelity) sampling.
Autoregressive Generation:
- When paired with an AR generator (LlamaGen), CaTok achieved a gFID of 2.95 with 128 tokens, comparable to leading approaches but with significantly fewer tokenization training epochs.
- Ablation studies confirmed that the interval selection $[r, t]$ is crucial; using "All tokens" or "First $k$ tokens" resulted in significantly worse gFID (e.g., 13.54 vs. 4.91).
Generalization:
- The model generalized well to the COCO-val-5K dataset.
- It is compatible with both autoregressive and non-autoregressive (mask-based) generators.

5. Significance

Paradigm Shift: CaTok successfully adapts the "next-token prediction" paradigm to vision without sacrificing image quality or requiring complex multi-scale architectures.
Efficiency: By enabling one-step sampling, it drastically reduces the inference cost of diffusion-based generation, making it competitive with GANs and VQ-VAEs in terms of speed while maintaining diffusion-level quality.
Theoretical Insight: The paper demonstrates that balancing token contribution via time-interval selection is critical for successful autoregressive visual modeling, solving a key limitation of previous consistency-based decoders.
Practical Impact: The combination of high reconstruction quality, fast generation, and effective AR training makes CaTok a strong candidate for next-generation visual foundation models.