Latent Denoising Makes Good Tokenizers

Imagine you are trying to teach a robot artist how to paint beautiful pictures. To do this, you can't just show it raw pixels (millions of tiny colored dots); that's too much information. Instead, you need to give the robot a "compressed language" or a set of tokens—like a shorthand vocabulary—that represents the image in a compact way.

For a long time, the standard way to create this vocabulary was to teach the robot: "Here is a picture. Compress it into a few words. Now, try to draw the exact same picture back from those words."

The problem? The robot got really good at memorizing the picture, but it was fragile. If you gave it a slightly blurry or corrupted version of those "words," it would panic and fail to draw anything good.

The Big Idea: "Latent Denoising" (l-DeTok)

The authors of this paper asked a simple question: What if we taught the robot a different skill?

Instead of just saying, "Reconstruct the picture perfectly," they said: "Here is a picture, but I'm going to smash it, blur it, and hide parts of it. Now, can you still figure out what the original picture was?"

They call this Latent Denoising.

The Analogy: The "Broken Pencil" Test

Think of the tokenizer (the vocabulary creator) as a teacher preparing a student for a difficult exam.

The Old Way (Standard Tokenizers): The teacher gives the student a perfect, clean textbook. The student memorizes it. On the exam, if the question is slightly different or the paper is smudged, the student freezes.
The New Way (l-DeTok): The teacher takes the textbook, rips out random pages, scribbles over the text with a marker, and mixes in some random noise. Then, they ask the student: "Based on this destroyed version, can you tell me what the original story was?"

The student has to learn the essence of the story, not just memorize the words. They become robust. They learn to ignore the noise and focus on the core meaning.

How It Works in the Paper

The Corruption: During training, the computer takes the "compressed words" (latent embeddings) of an image and deliberately ruins them. It does this in two ways:
- Interpolative Noise: It mixes the clean data with random static (like turning up the volume on a radio until it's just white noise).
- Masking: It hides random chunks of the image (like putting a hand over parts of a photo).
The Reconstruction: The tokenizer tries to fix these ruined words and draw the original image back.
The Result: Because the tokenizer was trained to handle "disaster," the "words" it produces are incredibly strong and stable.

Why This Matters for AI Art

Modern AI art generators (like the ones that make images from text) work by a similar process: they start with random noise and slowly "denoise" it until an image appears.

Before: The AI generator had to work hard to fix weak, fragile "words" from the tokenizer. It was like trying to build a house on a shaky foundation.
Now: Because the tokenizer was trained to survive heavy corruption, the "words" it gives the AI generator are like a solid, reinforced concrete foundation. The generator can build much better, sharper, and more realistic images with less effort.

The Results: A Supercharged Foundation

The paper tested this new method on six different types of AI art generators. The results were impressive:

Better Quality: The images looked significantly more realistic (lower "FID" scores, which is a fancy way of saying "closer to real photos").
No Magic Required: Unlike other methods that try to "steal" knowledge from massive, pre-trained super-computers (called "semantics distillation"), this method teaches the tokenizer from scratch using simple math. It's a "self-taught" genius.
Versatility: It worked for both "Autoregressive" models (which draw line-by-line) and "Diffusion" models (which paint the whole picture at once).

The Takeaway

The paper's main insight is simple but profound: To make a good generator, you need a tokenizer that has been "stress-tested."

Just like a firefighter trains in a burning building so they can handle a real fire, this tokenizer trains on "burnt" (noisy) data so it can handle the messy, noisy process of generating new art. By making the tokenizer robust against corruption, the whole AI system becomes better at creating beauty.

In short: They made the AI's "vocabulary" tougher by teaching it to speak clearly even when the microphone is broken. And because the vocabulary is so strong, the AI can now paint masterpieces.

1. Problem Statement

Modern visual generative models (both autoregressive and diffusion-based) rely heavily on tokenizers to compress images into compact latent embeddings. However, current tokenizer design has lagged behind generative model advancements.

The Gap: Most tokenizers are trained as standard Variational Autoencoders (VAEs) optimized for pixel-level reconstruction.
The Misalignment: Modern generative models (Diffusion, Flow Matching, Autoregressive) operate on a fundamentally different principle: denoising. They learn to reconstruct clean signals from corrupted inputs (e.g., adding Gaussian noise or masking).
The Question: What properties make a tokenizer effective for generative modeling? The authors hypothesize that tokenizers trained solely for reconstruction are misaligned with the downstream denoising objectives of generative models.

2. Methodology: Latent Denoising Tokenizer (l-DeTok)

The core insight is that tokenizers should be trained to produce latent embeddings that remain reconstructable even under significant corruption, thereby aligning with the downstream generative task.

A. Core Framework

l-DeTok follows an Encoder-Decoder architecture (based on Vision Transformers, ViT) but modifies the training objective to include deconstruction-reconstruction strategies during the tokenizer training phase.

Deconstruction (Corruption): Before the decoder attempts reconstruction, the latent embeddings produced by the encoder are corrupted using two complementary strategies:
- Interpolative Latent Noise: Instead of standard additive noise ( $x' = x + \epsilon$ ), l-DeTok uses interpolative noise:
  $x' = (1 - \tau)x + \tau \epsilon(\gamma)$
  Where $\tau \sim U(0,1)$ is a random interpolation factor and $\epsilon$ is Gaussian noise. This ensures that at high noise levels, the original signal is heavily suppressed, forcing the model to learn robust features rather than relying on shortcuts.
- Random Masking: Inspired by Masked Autoencoders (MAE), a random subset of image patches is masked. The masking ratio $m$ is sampled from a biased distribution to minimize the gap between training (masked) and inference (unmasked).
Reconstruction: The decoder is trained to reconstruct the original clean image from these heavily corrupted latent embeddings and mask tokens.
Training Objective: The total loss combines standard VAE objectives with the denoising requirement:
$L_{total} = L_{MSE} + \lambda_{KL}L_{KL} + \lambda_{percep}L_{percep} + \lambda_{GAN}L_{GAN}$
- MSE: Pixel-wise reconstruction loss.
- Perceptual Loss: VGG/ConvNeXt based losses for semantic fidelity.
- GAN Loss: Adversarial loss to sharpen details (activated later in training).
- KL Regularization: Standard VAE constraint.

B. Key Design Choices

Interpolative vs. Additive: Experiments show interpolative noise significantly outperforms additive noise because it prevents the original signal from dominating the corrupted input, forcing the latent space to be more robust.
Noise Strength: Stronger corruption (higher noise standard deviation $\gamma$ and higher masking ratios) generally leads to better downstream generative performance, confirming that "challenging" denoising tasks produce better representations.
Architecture Agnostic: The method is applied to both Transformer-based and Convolutional-based tokenizers, as well as 1D and Vector-Quantized (VQ) tokenizers.

3. Key Contributions

Unified Perspective: The paper identifies denoising as a fundamental design principle for tokenizers, bridging the gap between tokenizer training and the training objectives of modern generative models.
l-DeTok Algorithm: A simple, effective tokenizer training recipe that injects interpolative noise and random masking into latent embeddings during training.
Generalizability: The method improves performance across diverse generative paradigms:
- Non-Autoregressive: Diffusion models (DiT, SiT, LightningDiT).
- Autoregressive: Models like MAR, RandomAR, and RasterAR.
Independence from Distillation: Unlike recent SOTA tokenizers that rely on semantics distillation from massive pretrained encoders (e.g., DINOv2, CLIP), l-DeTok achieves superior results without external teachers, making it applicable to domains where such encoders do not exist (e.g., video, audio, 3D).

4. Experimental Results

The authors evaluated l-DeTok on ImageNet (256x256, 512x512) and MS-COCO (Text-to-Image).

A. Quantitative Improvements

Autoregressive Models (MAR):
- MAR-B: FID improved from 2.31 (baseline) to 1.55 (with l-DeTok), matching the performance of the much larger MAR-H model.
- MAR-L: FID improved from 1.78 to 1.35.
- RandomAR/RasterAR: Significant FID reductions (e.g., RandomAR FID dropped from ~11.78 to 5.22).
Non-Autoregressive Models (SiT/DiT):
- SiT-B: FID improved from 6.97 to 5.50 (with CFG).
- Outperformed standard tokenizers (SD-VAE, MAR-VAE) and matched or surpassed semantics-distilled tokenizers (VA-VAE, MAETok) in many metrics.
Text-to-Image (MS-COCO):
- Achieved lower FID (better diversity) and higher CLIP scores (better text alignment) compared to all baselines.
- Qualitative: Eliminated "spot artifacts" commonly seen in other tokenizers under text conditioning.

B. Ablation Studies

Noise Type: Interpolative noise consistently beat additive noise.
Masking: Random masking ratios (70-90%) outperformed constant ratios.
Scalability: The gains persisted when scaling model sizes (SiT-B/L/XL, MAR-B/L).
Architecture: The method worked for CNNs, 1D tokenizers, and VQ tokenizers.

5. Significance and Impact

Paradigm Shift: The paper challenges the assumption that tokenizers must be optimized for pixel reconstruction or rely on massive external semantic distillation. It proposes that task-aligned training (training the tokenizer to solve the same "denoising" problem the generator will face) is the key to high-quality generation.
Simplicity and Efficiency: The method adds almost no system complexity or computational overhead (training time is comparable to standard VAEs) yet yields state-of-the-art results.
Future Directions: It opens new research avenues for designing tokenizers for modalities lacking strong pretrained encoders (e.g., video, audio, scientific data) and suggests a potential unification of reconstruction, denoising, and generation objectives.

In summary, l-DeTok demonstrates that by explicitly training tokenizers to be robust against latent corruption, we can significantly enhance the performance of downstream generative models across a wide range of architectures and tasks, all without relying on expensive external semantic distillation.