There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training

Imagine you want to teach a robot to paint a masterpiece, like a perfect photo of a cat, from scratch.

For the last few years, the best way to do this has been a two-step process involving a translator.

The Translator (VAE): First, you teach a translator to turn the real photo of a cat into a secret code (a "latent space"). This code is short, efficient, and easy to understand.
The Painter (Diffusion Model): Then, you teach the painter to learn from that secret code. Once the painter is done, you use the translator again to turn the code back into a real photo.

The Problem: The translator is hard to train. It often makes mistakes, losing details or blurring the image. Also, the painter is limited by how well the translator works. If the translator is bad, the painter can't be great. It's like trying to paint a masterpiece based on a blurry sketch; you can't fix the blur later.

The Paper's Big Idea: "There is No VAE"
This paper says: "Let's fire the translator." They want to teach the painter to work directly on the high-resolution photo (the "pixel space") without any secret codes.

But here's the catch: Teaching a painter to work directly on a giant, detailed photo is incredibly hard. It's like trying to learn to paint by staring at a million tiny dots of color all at once. It takes forever, and the robot gets confused.

The Solution: A Two-Stage Training Camp

The authors came up with a clever two-stage training method, inspired by how humans learn to recognize objects.

Stage 1: The "Concept Teacher" (Pre-training)

Imagine you are teaching a student to recognize a cat.

The Old Way: You show them a clear photo of a cat.
The Paper's Way: You show them a photo of a cat that is covered in heavy static noise (like a broken TV screen).
The Trick: You ask the student: "Even though this image is full of static, can you tell me what the 'essence' of the cat is? And can you tell me how that essence changes as the static slowly clears up?"

The model learns to ignore the noise and focus on the meaning (the shape, the ears, the tail) of the image. It learns that a noisy cat and a clean cat are the same "thing" just at different stages of clarity. This is done using a technique called Self-Supervised Learning, where the model teaches itself by comparing different versions of the same image.

Stage 2: The "Master Painter" (Fine-tuning)

Now that the student has learned the concepts of cats, dogs, and cars from the noisy images, you give them a blank canvas and a brush.

You pair this "Concept Teacher" (the Encoder) with a brand new "Pixel Generator" (the Decoder).
You tell them: "Use your knowledge of cat concepts to paint the actual pixels of the cat."
Because the "Concept Teacher" already knows what a cat looks like, the "Pixel Generator" doesn't have to guess. It just has to fill in the details.

Why This is a Big Deal (The Analogies)

The "Direct Line" vs. The "Detour":
- Old Way (VAE): You have to drive from New York to LA, but you have to stop in a tiny, cramped town (the latent space) to change cars. The town is small, so you have to leave your luggage (details) behind.
- New Way (EPG): You drive straight from New York to LA in a luxury limo. You don't stop, you don't lose luggage, and you get there faster.
The "Efficiency" Miracle:
- The paper claims their new method is 30% cheaper to train than the current best methods (like DiT), yet it produces better pictures.
- It's like finding a way to bake a perfect cake using half the flour and half the time, without needing a special oven.
The "One-Shot" Wonder (Consistency Models):
- Most AI image generators are like a sculptor chipping away stone: they need many, many steps (hundreds of tiny adjustments) to get the shape right.
- This paper trained a model that can generate a high-quality image in one single step. It's like having a sculptor who can look at a block of stone and instantly snap it into a perfect statue. This is the first time this has been done successfully on high-resolution images without using a translator (VAE).

The Results

Quality: The images are sharper and more realistic than previous attempts at direct pixel painting.
Speed: It generates images much faster because it doesn't need to decode a secret code.
Cost: It saves a massive amount of computer power (money and energy).

In a Nutshell

The authors realized that instead of forcing AI to learn a "secret language" (VAE) to understand images, we should teach it to understand the meaning of images directly, even when they are messy and noisy. Once it understands the meaning, it can paint the picture perfectly, quickly, and without needing any middlemen.

They call their model EPG (End-to-end Pixel-space Generative model), and it's essentially saying: "We don't need a translator to speak the language of images anymore."

1. Problem Statement

Pixel-space generative models (training directly on raw image pixels) have historically lagged behind latent-space models (which use pre-trained Variational Autoencoders, or VAEs) in both generation quality and training efficiency.

The Bottleneck: Latent-space methods rely on VAEs to compress images into a lower-dimensional space. While effective, VAEs introduce a performance ceiling due to reconstruction errors and a fixed capacity that limits the generative model's ability to adapt to new data. Training VAEs is also difficult and computationally expensive.
The Challenge: Direct pixel-space training suffers from high computational costs (due to high-resolution inputs) and slow convergence. Previous attempts to train diffusion models directly on pixels have failed to match the efficiency and quality of VAE-based counterparts (e.g., DiT, SiT).
The Gap: There is a persistent gap where pixel-space models are either inefficient or produce lower-quality images compared to latent-space models. Furthermore, training Consistency Models (CMs) directly on high-resolution pixel data without pre-trained diffusion models or VAEs has not been successfully achieved.

2. Methodology: The EPG Framework

The authors propose EPG (End-to-end Pixel-space Generative model), a novel two-stage training framework inspired by Self-Supervised Learning (SSL) principles (specifically the encoder-decoder decomposition).

Stage 1: Self-Supervised Pre-Training (Representation Learning)

Instead of training a generative model from scratch, the authors first pre-train an encoder to learn high-level visual semantics from noisy images.

Objective: The encoder learns to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory (ODE trajectory) that evolves from pure Gaussian noise to the data distribution.
Loss Function: The pre-training utilizes a combination of two losses based on the NT-Xent metric:
1. Contrastive Loss: Learns general visual semantics by contrasting augmented views of clean images.
2. Representation Consistency Loss: Enforces alignment between features of temporally adjacent points on the same ODE trajectory (e.g., $x_{t_n}$ and $x_{t_{n-1}}$ ). This ensures the representation is consistent across different noise levels.
Key Innovation: Unlike standard SSL which fails on highly noisy images, this method uses a temperature schedule ( $\tau$ ) that starts loose (to handle early training instability on noisy data) and tightens as training progresses. This allows the model to learn robust features even from high-noise inputs.
Architecture: Uses a Vision Transformer (ViT) backbone. The input includes learnable [CLS] tokens, time tokens, and image tokens.

Stage 2: End-to-End Fine-Tuning

After pre-training, the projector layers are discarded. The pre-trained encoder is combined with a randomly initialized decoder, and the entire model is fine-tuned end-to-end for specific generative tasks.

Diffusion Models: Fine-tuned using the standard denoising objective (predicting $x_0$ from noisy $x_t$ ).
Consistency Models: Fine-tuned to approximate the trajectory end-point $f(x_t, t) = x_0$ $f (x_{t}, t) = x_{0}$ .
- Crucial Addition: To address slow convergence and sparse supervision in consistency training, the authors introduce an auxiliary loss that aligns the model's output with the clean image $x_0$ using a frozen copy of the pre-trained encoder. This provides strong supervisory signals without requiring external pre-trained diffusion models.

Efficiency Optimization

To maintain efficiency across resolutions, the model adjusts the patch size proportionally to the image resolution (e.g., $16\times16 $for ImageNet-256,$ 32\times32$ for ImageNet-512). This keeps the number of input tokens constant, ensuring computational cost (GFLOPs) remains stable regardless of resolution.

3. Key Contributions

Novel Training Framework: A two-stage framework that frames pixel-space diffusion training as a self-supervised learning problem, successfully bridging the performance and efficiency gap between pixel-space and latent-space methods.
First Pixel-Space Consistency Model: The first successful training of a Consistency Model directly on high-resolution images (ImageNet-256) without relying on pre-trained VAEs or pre-trained diffusion models.
State-of-the-Art Performance: Achieves SOTA results on ImageNet for pixel-space generation, surpassing leading latent-space models (like DiT and SiT) in quality while using significantly less compute.
Scalability: Demonstrates that scaling pre-training compute and model parameters leads to consistent improvements in downstream generation tasks.

4. Experimental Results

The model was evaluated on the ImageNet-1K dataset at 256x256 and 512x512 resolutions.

Diffusion Models (ImageNet-256):
- Achieved an FID of 1.58 (with 75 NFEs) using the EPG-G/16 model.
- Outperformed prior pixel-space methods (e.g., SiD, ADM) by a large margin.
- Surpassed the latent-space DiT-XL/2 (FID 2.27) while using only ~30% of the training compute.
- Achieved an FID of 2.35 on ImageNet-512.
Consistency Models (ImageNet-256):
- Achieved an FID of 8.82 in a single generation step (1 NFE).
- Significantly outperformed latent-space counterparts like iCT-XL/2 (FID 34.24) and Shortcut-XL/2.
- This result marks a breakthrough, as previous CMs required pre-trained diffusion models or VAEs to achieve competitive results.
Efficiency:
- The pre-training stage took 57 hours on 8xH200 GPUs, compared to 160 hours for training a standard VAE (sd-vae-mse).
- The total training cost for EPG-XXL/16 (160 hours) was lower than DiT-XL/2 (506 hours) while achieving better FID.

5. Significance and Impact

Paradigm Shift: The paper challenges the dominance of the Latent Diffusion Model (LDM) paradigm. It proves that VAEs are not strictly necessary for high-quality, efficient generative modeling if the training process is properly structured.
Simplification: By removing the need for a separate, difficult-to-train VAE stage, the pipeline is simplified, and the generative model is no longer bottlenecked by the VAE's reconstruction capacity.
Efficiency: The method offers a highly scalable solution for high-resolution generation, making it feasible to train powerful generative models on raw pixels with fewer resources than current latent-space approaches.
Future Direction: The work suggests that treating generative modeling as a self-supervised representation learning problem is a viable and superior path for future pixel-space research, potentially unlocking new capabilities in multi-modal and high-fidelity generation.