Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

Imagine you are trying to paint a massive, detailed mural on a wall.

The Old Way: The Slow, One-Brush Artist

Traditional AI image generators (like the ones used before this paper) work like a very strict, slow artist. They have a rule: "You must paint exactly one square inch at a time, in a perfect zig-zag line from top-left to bottom-right."

The Problem: If your mural is 256x256 inches, the artist has to make 256 separate trips across the wall to finish it.
The Bottleneck: Every time the artist picks up a new square, they have to run back to the paint shelf (the computer's memory) to grab the specific color they need. Even if the artist is super fast at painting, they spend 90% of their time just running back and forth to the shelf. This is called being "memory-bound." It's like a race car stuck in traffic; the engine is great, but the road is too narrow.

The New Way: The "Locality-Aware" Team (LPD)

The researchers in this paper (from MIT, NVIDIA, and First Intelligence) invented a new method called Locality-Aware Parallel Decoding (LPD). Instead of one slow artist, they turned the process into a smart construction crew.

Here is how they did it, using two main tricks:

Trick 1: The "Magic Clipboard" (Flexible Parallelized Modeling)

In the old method, the artist couldn't paint the next square until the previous one was dry.
In the new method, the AI uses "Position Query Tokens." Think of these as magic sticky notes.

Instead of waiting for the wall to be painted in order, the AI places sticky notes on the wall where it wants to paint next.
It can say, "Okay, Crew! Paint the top-left corner, the bottom-right corner, and the middle spot all at the same time!"
The Secret Sauce: Usually, if you paint two spots at once, they might not "see" each other, leading to a messy result (like a blue sky next to a green tree that doesn't match). The researchers created a special "visibility rule" (an attention mechanism) that ensures all the painters in the crew can see each other's work instantly, so the colors blend perfectly even though they are working simultaneously.

Trick 2: The "Smart Neighborhood" Rule (Locality-Aware Ordering)

Just because you can paint many spots at once doesn't mean you should pick random spots.

The Mistake: If you tell the crew to paint the top-left corner and the bottom-right corner at the same time, they have no idea what color to use because they are too far apart to see the context.
The Solution: The researchers noticed that in images, things are usually local. A leaf is close to a branch; a cloud is close to the sky.
The Strategy: The AI uses a "Smart Neighborhood" schedule.
1. Stay Close: It picks new spots to paint that are right next to what has already been painted. This gives the painters plenty of context (e.g., "I see a tree branch here, so I'll paint a leaf next to it").
2. Spread Out: It makes sure the spots being painted at the exact same time are far apart from each other. This prevents the painters from stepping on each other's toes or getting confused.

The Result: From a Marathon to a Sprint

By combining these two tricks, the results are incredible:

Old Way: To paint a 256x256 image, the AI took 256 steps (one for every patch).
New Way (LPD): The AI can do the same image in just 20 steps.
Speed: It is 3.4 times faster than previous "fast" methods.
Quality: Despite being so much faster, the images look just as good (or better) than the slow ones. The "Smart Neighborhood" rule ensures the image doesn't look glitchy.

Why This Matters

Think of it like upgrading from a single-lane dirt road to a multi-lane highway with smart traffic lights.

Before, the AI was stuck in a single lane, waiting for the car in front to move.
Now, the AI has 20 lanes open, and the "traffic lights" (the locality rules) tell the cars exactly where to go so they don't crash.

This means we can generate high-quality images, edit photos, or even create videos in a fraction of the time it used to take, making AI tools much more practical for everyday use.

1. Problem Statement

Autoregressive (AR) image generation has emerged as a powerful paradigm, particularly for unified multimodal systems. However, traditional AR models face a critical efficiency bottleneck:

Memory-Bound Latency: Standard AR models generate images token-by-token (next-patch prediction) in a fixed raster order. This creates a memory-bound workload where latency scales linearly with the number of tokens (e.g., 256 steps for a $256\times256$ image), as the GPU must repeatedly load model parameters for each step.
Limited Parallelization: Existing attempts to accelerate AR generation by predicting multiple patches simultaneously (multi-patch prediction) have achieved only limited parallelization. They often suffer from:
- Inconsistency: Spatially adjacent tokens have strong mutual dependencies; predicting them independently leads to generation artifacts.
- Rigid Ordering: Many methods rely on fixed generation orders or encoder-decoder architectures that prevent arbitrary generation sequences, limiting flexibility for tasks like inpainting.
- Incompatibility: Some high-speed methods (like next-scale prediction) use multi-scale representations that are incompatible with flat token representations used in foundational vision models (e.g., CLIP, DINO).

2. Methodology: Locality-Aware Parallel Decoding (LPD)

The authors propose LPD, a framework designed to maximize parallelization while maintaining generation quality and compatibility with flat token representations. It consists of two core innovations:

A. Flexible Parallelized Autoregressive Modeling

This is a novel architecture that decouples the roles of context provision and token generation, enabling arbitrary generation orders and degrees of parallelization.

Position Query Tokens: Instead of generating the next token in a sequence, the model uses learnable position query tokens corresponding to specific target positions in the image grid.
Decoupled Roles:
- Context: Previously generated image tokens provide the context (hidden states).
- Generation: Position query tokens drive the prediction of new tokens at arbitrary target locations.
Specialized Attention Mechanism:
- Context Attention: Ensures new tokens attend causally to all previously generated tokens.
- Query Attention: Crucially, this ensures mutual visibility among all position query tokens generated within the same parallel step. This allows concurrently generated tokens to condition on each other, preventing the inconsistency issues seen in independent sampling.
KV Cache Optimization: Unlike some encoder-decoder approaches where query tokens must be cached (doubling memory), LPD only caches the generated image tokens, keeping memory overhead low.

B. Locality-Aware Generation Ordering

To maximize the effectiveness of parallel decoding, the authors introduce a scheduling algorithm based on the observation that image generation attention exhibits strong spatial locality (tokens attend primarily to nearby regions).

Two Guiding Principles:
1. High Proximity to Context: New tokens should be spatially close to already generated tokens to leverage strong contextual support.
2. Low Proximity within Groups: Tokens generated in the same parallel step should be spatially distant from each other to minimize mutual dependency and reduce generation inconsistency.
Algorithm: The scheduler iteratively selects tokens for the next step by:
1. Calculating proximity scores of unselected tokens to the existing context.
2. Selecting high-proximity tokens that are sufficiently far apart (using a repulsion threshold).
3. If the group size is not met, using Farthest Point Sampling to select remaining tokens from distant regions to ensure low intra-group dependency.
Result: This schedule dynamically adjusts group sizes (starting small and growing) and order, optimizing the trade-off between context availability and parallel independence.

3. Key Contributions

Novel Architecture: Introduction of a flexible parallelized AR model using position query tokens and specialized attention masks to enable arbitrary generation orders and mutual visibility among parallel tokens.
Locality-Aware Scheduling: A generation order algorithm that explicitly optimizes for spatial locality principles, significantly reducing intra-group dependencies while maximizing contextual support.
Efficiency without Quality Loss: The method reduces generation steps drastically (e.g., from 256 to 20 for $256\times256$) without compromising image quality (FID), achieving a 3.4x to 4.2x reduction in latency compared to previous parallelized AR models.
Versatility: The framework supports zero-shot image editing tasks (inpainting, outpainting, class-conditional editing) due to its ability to generate tokens in arbitrary orders.

4. Experimental Results

The method was evaluated on ImageNet class-conditional generation ($256\times256 $and$ 512\times512 $) and text-to-image generation ($ 1024\times1024$).

Generation Steps Reduction:
- $256\times256$: Reduced from 256 steps (raster) to 20 steps (LPD).
- $512\times512$: Reduced from 1024 steps to 48 steps.
- $1024\times1024$: Reduced from 4096 steps to 64 steps.
Quality (FID): LPD models achieved competitive or superior FID scores compared to state-of-the-art AR and Diffusion models.
- Example: LPD-XL ($752M $params) achieved **FID 2.10** in 20 steps, matching the quality of ARPG-XXL ($ 1.3B$ params) which requires 64 steps, but with 3.4x lower latency.
Efficiency:
- Latency: Achieved at least 3.4x lower latency than previous parallelized AR models (e.g., ARPG, RandAR).
- Throughput: In memory-bound regimes (small batch sizes), LPD showed up to 12x higher throughput than raster-order baselines, scaling linearly with batch size until hitting compute limits.
Ablation Studies:
- Removing the "mutual visibility" mechanism caused significant FID degradation as parallelization increased.
- Using the locality-aware schedule outperformed random, raster, and Halton order schedules, confirming the importance of balancing context proximity and group independence.

5. Significance

LPD represents a significant leap forward in autoregressive image generation by solving the fundamental trade-off between speed and quality.

Scalability: It makes AR image generation viable for real-time applications by drastically reducing the sequential dependency bottleneck.
Unified Multimodal Potential: By maintaining a flat token representation, LPD remains compatible with existing vision foundation models (like CLIP), facilitating the development of unified multimodal systems that can both understand and generate images efficiently.
Flexibility: The ability to generate in arbitrary orders opens new possibilities for interactive editing and complex image manipulation tasks that were previously difficult with rigid raster-order AR models.

In summary, LPD demonstrates that with the right architectural decoupling and spatially aware scheduling, autoregressive models can achieve the efficiency of non-autoregressive methods while retaining the high-quality generation capabilities of AR models.