Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

Imagine you are trying to teach a robot how to paint a picture, one brushstroke at a time. The robot is very smart, but it has a very specific way of thinking: it can only look at what it has already painted to decide what to paint next. It cannot see the future. This is how "Autoregressive" models work—they predict the next piece of a puzzle based only on the pieces already placed.

However, there's a problem with how we usually give these robots "puzzle pieces" (which are called tokens in AI).

The Problem: The "Backwards-Reading" Dictionary

In the past, when we turned an image into a sequence of tokens for the robot, we used a method that was like a group project. Every single token in the sequence was allowed to peek at all the other tokens, including the ones that hadn't been painted yet.

The Analogy: Imagine you are writing a story, but every time you write a sentence, you are allowed to read the entire rest of the book to help you decide what to write next.
The Result: The "dictionary" (the tokenizer) becomes incredibly efficient at describing the picture perfectly. But it creates a huge problem for the robot. When the robot tries to paint the picture, it asks, "What comes next?" and the answer it gets is, "It depends on the part of the picture you haven't painted yet!"
The Confusion: Since the robot can't see the future, it gets confused. It's like trying to solve a math problem where the answer depends on a number you haven't calculated yet. The robot struggles, the images look blurry, and it takes a long time to figure things out.

The Solution: AliTok (The Aligned Tokenizer)

The authors of this paper, "Towards Sequence Modeling Alignment Between Tokenizer and Autoregressive Model," invented a new way to prepare these puzzle pieces. They call it AliTok.

Think of AliTok as a strict teacher who forces the group project to change its rules.

The "Causal" Rule: The teacher tells the tokens: "You can look at everyone before you, but you are forbidden from looking at anyone after you."
The Training Trick: To make sure the picture still looks perfect (high quality), they use a two-step training process:
- Step 1: They train the "encoder" (the part that turns the image into tokens) using a strict "no-peeking-forward" rule. This forces the tokens to pack all the necessary information into themselves so that the next token doesn't need to peek ahead to understand the picture.
- Step 2: They add a special "helper" (prefix tokens) for the very first row of the image, which is usually the hardest part to guess because it has no history.
- Step 3: Once the tokens are trained to be "forward-thinking," they swap in a super-powerful decoder just to make sure the final picture looks crystal clear.

The Magic Result

By doing this, the "dictionary" and the "robot" finally speak the same language.

Before: The robot was trying to guess the next step while the dictionary was whispering secrets about the future. The robot was slow and made mistakes.
After: The dictionary gives the robot a clear, logical path. "Here is what you have, and here is exactly what comes next."

The Payoff:
The paper shows that with this new method, a relatively small robot (model) can paint pictures that are better than the massive, slow robots used by the most advanced "Diffusion" models (the current state-of-the-art).

Speed: It's 10 times faster. If the old way took 10 seconds to paint a picture, AliTok does it in 1 second.
Quality: The images are sharper and more detailed.
Simplicity: They didn't need to invent a complex new robot; they just fixed the dictionary so the old, simple robot could do amazing things.

In a Nutshell

The paper is about alignment. It realized that the way we were translating images into data was fighting against the way the AI learns. By changing the translation method to match the AI's natural "forward-only" thinking style, they unlocked a new level of speed and quality, proving that sometimes, the best way to build a better AI isn't to make the brain bigger, but to teach it a better language.

1. Problem Statement

Autoregressive (AR) image generation models, which predict the next token based on previous ones, have shown great promise but face a fundamental limitation when applied to images.

The Misalignment: Conventional image tokenizers (e.g., VQ-VAE, VQ-GAN) are trained to maximize reconstruction fidelity. To achieve this, they employ bidirectional encoders that allow every token to attend to all other tokens (past and future). This creates bidirectional dependencies within the token sequence.
The Conflict: Standard decoder-only AR models rely on unidirectional (causal) attention. They must predict token $x_i$ using only $x_{<i}$ . However, if the ground-truth token $x_i$ was encoded using information from $x_{>i}$ (future context), the AR model faces a high-entropy, unpredictable learning task. The target distribution becomes a marginalization over unseen futures, severely limiting generation quality.
Current Workarounds: Existing solutions often modify the AR model itself (e.g., Masked AR, Next-Scale prediction) to handle bidirectional data. These approaches complicate the architecture and diverge from the simple, scalable "next-token prediction" paradigm that makes AR models successful in NLP.

2. Methodology: AliTok

The authors propose AliTok, a novel image tokenizer designed to align the token sequence's dependency structure with the unidirectional nature of AR models, rather than modifying the AR model.

Core Architecture

AliTok is built upon a Vision Transformer (ViT) framework with a unique two-stage training strategy:

Causal Decoder Constraint (Stage 1):
- Bidirectional Encoder + Causal Decoder: The tokenizer uses a standard bidirectional encoder to capture rich global semantics but pairs it with a causal decoder.
- Mechanism: During reconstruction, the decoder is strictly constrained to view tokens in a raster-scan order (left-to-right, top-to-bottom). It cannot see future tokens.
- Effect: This acts as a powerful implicit regularizer. To minimize reconstruction loss under this constraint, the bidirectional encoder is forced to reorganize information so that all necessary context for reconstructing patch $i$ is contained within the causal history $x_{1 \dots i}$ . This instills a forward-dependency into the token sequence.
Prefix Tokens & Auxiliary Loss:
- Problem: A purely causal structure leads to poor reconstruction of the first row of an image due to a lack of preceding context.
- Solution: The authors introduce 16 prefix tokens specifically for the first row.
- Training: An auxiliary loss ( $L_{aux}$ ) is applied to these prefix tokens. The perceptual loss is calculated on a "reconstructed" image formed by concatenating the prefix-reconstructed first row with the rest of the image, but gradients are detached from the rest of the image. This forces the prefix tokens to learn the necessary contextual priors for the first row without corrupting the global representation.
Two-Stage Training Strategy:
- Stage 1: Train the encoder and codebook using the causal decoder constraint to ensure the tokens are "AR-friendly" (highly predictable).
- Stage 2: Freeze the encoder and codebook. Retrain a bidirectional decoder (with increased capacity/buffer tokens) to maximize reconstruction fidelity.
- Benefit: This decouples the goals. Stage 1 ensures the tokens are easy to generate; Stage 2 ensures the tokens can be perfectly reconstructed, achieving high fidelity without sacrificing AR predictability.

Autoregressive Model Adaptation

The authors use a standard decoder-only AR model (based on LlamaGen) with minor modifications:

Positional Embeddings: 1D RoPE for the 16 prefix tokens and 2D RoPE for the remaining 256 image tokens.
Stabilization: Introduction of QK-Norm to stabilize training at larger scales.

3. Key Contributions

Identification of a Fundamental Bottleneck: The paper reveals that the bidirectional dependency in conventional tokenizers is the primary factor restricting AR image generation performance.
AliTok Design: A novel tokenizer architecture that uses a causal decoder to force a bidirectional encoder to produce a causally structured, highly predictable token sequence while retaining global semantic richness.
State-of-the-Art Performance: Demonstrates that a standard decoder-only AR model (without complex masking or multi-scale mechanisms) can outperform state-of-the-art diffusion models on ImageNet benchmarks.
Efficiency: Achieves significantly faster sampling speeds compared to diffusion and masked AR models due to full compatibility with KV-caching.

4. Experimental Results

Experiments were conducted on ImageNet-256 and ImageNet-512.

ImageNet-256 (256x256):

AliTok-XL (662M parameters): Achieved a gFID of 1.28 (with classifier-free guidance) and IS of 306.3.
- Comparison: Surpassed LightningDiT (a SOTA diffusion model, 675M params) which achieved 1.35 gFID.
- Efficiency: Sampling is 10x faster than LightningDiT (6.3 images/sec vs 0.6 images/sec on A800).
AliTok-L (318M parameters): Achieved a gFID of 1.38, outperforming much larger AR models like RAR-XXL (1.5B params, gFID 1.48) and MAR-H (943M params, gFID 1.55).
Training Accuracy: The AR model achieved a training accuracy of 12.2% (AliTok-XL), a massive improvement over the ~5-6% typical of standard tokenizers, indicating the sequence is highly predictable.

ImageNet-512 (512x512):

AliTok-L (318M): Achieved a SOTA gFID of 1.39, outperforming MAR-L (1.73) and REPA (2.08).

Ablation Studies:

Removing the causal decoder (using bidirectional) caused gFID to jump from 1.44 to 2.96.
Removing prefix tokens degraded the first-row reconstruction significantly.
The two-stage training was crucial for maintaining high reconstruction quality (rFID 0.86) while keeping generation quality high.

5. Significance

Paradigm Shift: The paper challenges the notion that AR models are inherently inferior to diffusion models for image generation. It argues that the failure was due to data-tokenizer misalignment, not the AR paradigm itself.
Simplicity & Scalability: By keeping the generator as a standard decoder-only transformer, AliTok leverages the massive scaling laws and engineering maturity of LLMs (e.g., KV-caching, efficient inference) for image generation.
Multimodal Unification: The approach paves the way for unified multimodal models where text and images are processed by the same simple, scalable AR architecture, as the token sequences for both modalities can be made causally consistent.
Efficiency: The 10x speedup over diffusion models makes high-quality image generation more accessible for real-time applications.

In conclusion, AliTok successfully bridges the gap between high-fidelity image compression and autoregressive generation by reshaping the data (tokens) to fit the model, rather than reshaping the model to fit the data.

Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

The Problem: The "Backwards-Reading" Dictionary

The Solution: AliTok (The Aligned Tokenizer)

The Magic Result

In a Nutshell

1. Problem Statement

2. Methodology: AliTok

Core Architecture

Autoregressive Model Adaptation

3. Key Contributions

4. Experimental Results

5. Significance

More like this

VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering

Unbiased Rectification for Sequential Recommender Systems Under Fake Orders

Self-Sovereign Agent

Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

GAN-Enhanced Deep Reinforcement Learning for Semantic-Aware Resource Allocation in 6G Network Slicing