Autoregressive Image Generation with Randomized Parallel Decoding

Imagine you are trying to paint a masterpiece, but you are forced to follow a very strict, boring rule: you must paint the picture one single pixel at a time, starting from the top-left corner and moving strictly row by row, like a laser scanner. This is how most current AI image generators work. They are accurate, but they are incredibly slow because they can't skip ahead or paint the sky and the grass at the same time.

This paper introduces a new method called ARPG (Autoregressive Image Generation with Randomized Parallel Decoding). Think of ARPG as giving the artist a set of magic, floating brushes that can paint multiple parts of the picture simultaneously, in any order they choose, as long as they have a map telling them where to paint next.

Here is a breakdown of how it works, using simple analogies:

1. The Problem: The "Strict Line-Cutter"

Traditional AI models are like a strict line-cutter. They have to cut a piece of paper (the image) into tiny strips, one by one.

The Issue: If you want to fix a mistake in the middle of the paper, the strict line-cutter has to cut everything from the beginning again. It's slow.
The Limitation: Because they are so rigid, they can't easily do things like "fill in a missing part of a photo" (inpainting) or "make the photo bigger" (outpainting) without re-doing the whole thing.

2. The Solution: The "Magic Map and Floating Brushes"

ARPG changes the game by separating two jobs that were previously stuck together: Knowing the content and Knowing the location.

Imagine a construction site:

Pass 1 (The Content Crew): This team looks at all the bricks that are already on the ground. They mix the mortar and prepare the bricks, but they don't place them yet. They just organize the "ingredients" (the visual data) and put them in a big, organized toolbox (this is the KV Cache).
Pass 2 (The Location Crew): This team holds a Magic Map. The map doesn't have bricks on it; it just has arrows pointing to empty spots saying, "Put a brick here!" or "Put a flower here!"

The Innovation:
In older models, the "Location Crew" had to wait for the "Content Crew" to finish everything before they could start.
In ARPG, the Location Crew can look at the Magic Map and say, "Okay, I need to fill spots A, B, and C right now." They grab the pre-mixed bricks from the toolbox and place them all at the same time.

3. Why "Randomized" is Better

Most models paint in a fixed order (Top-Left to Bottom-Right). ARPG is Randomized.

Analogy: Imagine a team of painters in a room.
- Old Way: They must paint the wall in a strict spiral pattern. If one painter is slow, everyone waits.
- ARPG Way: The painters can jump around the room. One paints the ceiling, another paints the floor, and a third paints the corner, all at the same time. As long as they know where they are supposed to paint (thanks to the Magic Map), the order doesn't matter.

This allows the AI to generate an entire image in 32 steps instead of hundreds, making it 30 times faster than previous methods.

4. The "Zero-Shot" Superpower

"Zero-shot" is a fancy way of saying "doing something it was never explicitly taught to do."

The Analogy: Imagine you taught a robot to paint a whole house.
- Old Robot: If you ask it to paint just the front door, it gets confused because it only knows how to paint the whole house in order.
- ARPG Robot: Because it understands "Content" (what a door looks like) and "Location" (where the door goes) separately, you can hand it a photo with a hole in the door and say, "Fill this hole." It instantly knows to grab the "door paint" and put it in the "door hole" without needing a new training class.

5. The Results: Faster, Cheaper, Better

The paper shows that ARPG isn't just fast; it's also high quality.

Speed: It can generate images 30 times faster than the old "strict line" methods.
Memory: It uses 75% less computer memory. Think of it as packing a suitcase so efficiently that you can fit a week's worth of clothes in a backpack, whereas others need a giant trunk.
Quality: The images are sharper and more realistic, scoring better on standard quality tests (FID) than competitors.

Summary

ARPG is like upgrading from a single-lane, one-way street to a multi-lane highway with a GPS.

Old Way: Drive one car at a time, in a straight line, slowly.
ARPG Way: Send 10 cars down the highway at once, in any direction, guided by a GPS that tells them exactly where to go.

This breakthrough allows AI to generate high-quality images instantly, fix photos, and create new art with a level of flexibility and speed that was previously impossible.

1. Problem Statement

Autoregressive (AR) models have achieved success in language modeling but face significant challenges when applied to image generation:

Sequential Inefficiency: Traditional AR models flatten 2D images into 1D sequences following a rigid, predefined order (e.g., raster-scan). This strictly sequential token prediction is computationally inefficient, especially for high-resolution images.
Limited Zero-Shot Generalization: The causal nature of standard AR models prevents them from handling tasks requiring non-causal dependencies, such as image inpainting, outpainting, or resolution expansion, without specific fine-tuning.
Trade-offs in Existing Solutions:
- Masked Modeling (e.g., MaskGIT): Enables parallel generation but relies on bidirectional attention, preventing the use of KV caches and resulting in high memory/compute overhead.
- Block-wise AR (e.g., VAR): Allows block-level parallelism but is constrained by fixed block orderings and sampling schedules.
- Random Order AR (e.g., RandAR): Supports random orders but doubles sequence length by interleaving position tokens with content tokens, increasing memory and compute costs while suffering from suboptimal fidelity due to coupled learning.

2. Methodology: ARPG

The authors propose ARPG, a novel visual AR model that enables Randomized Parallel Generation through a decoupled decoding framework.

Key Insights

Explicit Positional Guidance: Effective random-order modeling requires explicit instructions to determine where the next token should be predicted, rather than relying on the sequence order.
Training Inefficiency of Masked Modeling: Standard masked modeling is inefficient because gradients are only computed for masked tokens, while unmasked tokens (which carry semantic content) receive no direct optimization signal, yet incur full computational costs.
Redundancy of Attention to Mask Tokens: Empirical analysis shows that attention weights are predominantly allocated to unmasked (semantic) tokens; attending to other masked tokens is redundant.

Architecture: Two-Pass Decoder

ARPG decouples the prediction process into two distinct passes to separate content representation from positional guidance:

Pass-1: Content Representation Learning (Self-Attention)
- Input: A shuffled sequence of known content tokens.
- Mechanism: Uses standard causal self-attention to process the sequence.
- Output: Generates rich, context-aware representations ( $h$ ) which are projected into Key-Value (KV) pairs.
- Goal: To create a comprehensive summary of historical information without directly predicting tokens. This pass learns the semantic content independently.
Pass-2: Position-Guided Decoding (Cross-Attention)
- Input: Data-independent [MASK] tokens infused with positional information (acting as queries).
- Mechanism: Uses causal cross-attention. The queries (derived from [MASK] tokens + position embeddings) attend to the KV pairs generated in Pass-1.
- Output: Predicts the target tokens ( $\hat{x}$ ) for specific positions.
- Goal: The queries act as "target-aware" instructions, directing the model to reconstruct specific tokens based on the global context provided by Pass-1.

Inference Strategy

Parallel Decoding: During inference, the model first computes the KV cache from known tokens using Pass-1. Then, multiple target-aware queries (representing multiple future positions) are generated simultaneously. These queries attend to the shared KV cache in Pass-2, enabling multi-token prediction in a single step.
Attention Generalization: While training uses causal attention, the model generalizes to block-wise attention during inference. This allows the model to utilize local bidirectional context for tokens generated in the previous step, improving fidelity without breaking the causal probability model.

3. Key Contributions

Novel Decoupled Framework: Introduced a two-pass decoding mechanism that separates content learning (KV generation) from position-guided prediction (Query generation), enabling fully random-order training and generation.
Efficiency & Scalability: Achieved a 30 $\times$ speedup over raster-order AR models and a 3 $\times$ speedup over recent parallel AR models, while reducing memory consumption by 75%.
Zero-Shot Generalization: The architecture naturally supports zero-shot inference for complex tasks (inpainting, outpainting, resolution expansion) without task-specific fine-tuning, as the positional queries can be dynamically defined for any missing region.
Controllable Generation: Extended the framework to support controllable generation (e.g., depth maps, Canny edges) by replacing positional queries with conditional input representations.

4. Experimental Results

Experiments were conducted on ImageNet-1K (256 $\times$ 256) and 512 $\times$ 512 benchmarks.

Quality (FID):
- ARPG-XXL (1.3B params) achieved an FID of 1.83 with only 32 steps.
- This outperforms LlamaGen-XXL (FID 2.62, 576 steps) and VAR-d24 (FID 2.09, 10 steps).
- At 512 $\times$ 512, ARPG-XL achieved an FID of 2.82, surpassing other AR and diffusion-based methods in efficiency.
Efficiency:
- Throughput: ARPG-L achieved 130.14 img/s (32 steps), significantly higher than LlamaGen-L (4.33 img/s) and VAR-d16 (123.21 img/s but with much higher memory).
- Memory: Reduced memory usage by 75% compared to VAR and LlamaGen at similar scales (e.g., 2.78 GB vs. 10.23 GB for LlamaGen-L).
Zero-Shot & Controllable Tasks:
- Demonstrated high-fidelity results in inpainting, outpainting, and resolution expansion without fine-tuning.
- In controllable generation (Canny/Depth), ARPG-L achieved an FID of 7.39/4.06, outperforming ControlVAR and ControlAR.
Text-to-Image: Achieved competitive GenEval scores using only 4M training data (vs. 60M for LlamaGen), with a throughput of 30.11 img/s.

5. Significance

ARPG represents a paradigm shift in autoregressive image generation by breaking the dependency on rigid token ordering.

Bridging the Gap: It successfully combines the efficiency of parallel decoding (traditionally associated with masked models) with the computational benefits of causal attention (KV caching), overcoming the memory bottlenecks of bidirectional models.
Practical Deployment: The drastic reduction in inference time (30 $\times$ ) and memory footprint makes high-quality AR image generation feasible for real-time applications and resource-constrained environments.
Flexibility: The ability to perform zero-shot editing and resolution expansion without retraining makes it a versatile foundation model for diverse generative tasks, challenging the dominance of diffusion models in these specific domains.

In summary, ARPG establishes a new benchmark for efficient, high-quality, and flexible autoregressive image generation, proving that random-order modeling can be both computationally efficient and semantically robust.