Progressive Checkerboards for Autoregressive Multiscale Image Generation

Imagine you are trying to paint a massive, incredibly detailed mural of a city, but you have a strict rule: you can only paint one square at a time, and you must wait for the previous square to dry before painting the next one.

If you paint from top-left to bottom-right (like reading a book), you are painting very slowly. If you try to paint the whole city at once, the colors might clash because you didn't know what the neighbor's house looked like yet.

This paper introduces a clever new way to "paint" (generate) images using Artificial Intelligence. Instead of painting line-by-line or block-by-block, the authors use a "Progressive Checkerboard" strategy.

Here is the breakdown of their idea using simple analogies:

1. The Problem: The "Slow Painter" vs. The "Messy Painter"

The Old Way (Slow Painter): Traditional AI models paint the image pixel by pixel, or row by row. It's very careful, but it takes forever because it has to wait for every single step.
The "Messy" Way: Some newer models try to paint big chunks at once to go faster. But if they paint two neighboring houses at the same time without talking to each other, one might be red and the other blue, even though they should be the same color. This creates "glitches" or weird artifacts.
The "Zoom" Problem: Some models try to fix this by painting a tiny sketch first, then a medium sketch, then the final image. But if they jump from "tiny sketch" to "huge image" too quickly, they miss the details in between, and the picture looks blurry or wrong.

2. The Solution: The "Checkerboard Dance"

The authors propose a method that is like a dance party on a checkerboard.

Instead of painting in a straight line, imagine the image is a giant chessboard.

The First Move: You paint only the white squares.
The Second Move: You paint only the black squares.
The Magic: Because you painted all the white squares first, the black squares now know exactly what their neighbors (the white ones) look like. They can match colors perfectly.

But they didn't stop there. They did this progressively:

Level 1: Paint a tiny checkerboard (very blurry, just the big shapes).
Level 2: Paint a slightly bigger checkerboard (adding more detail).
Level 3: Paint the full-size checkerboard (adding the final sharp details).

At every single level, they paint half the board, then the other half. This keeps the "conversation" between neighbors alive without having to wait for the whole image to be finished.

3. The Big Discovery: It Doesn't Matter How You Slice the Cake

One of the most surprising findings in the paper is about speed vs. quality.

Usually, people think: "If I want a high-quality image, I need to take many small, careful steps."
The authors found that it doesn't actually matter how you divide the steps.

Scenario A: Take 4 big steps (jumping from small to medium to large to huge).
Scenario B: Take 8 tiny steps (slowly growing the image).

As long as the total number of steps is the same, the final picture looks almost identical! It's like climbing a mountain: you can take 10 giant leaps or 20 small steps; if you take the same total number of steps to get to the top, you end up at the same view.

This means the AI can be much faster. Instead of taking 100 tiny steps like a snail, it can take 17 "checkerboard" steps and get the same high-quality result.

4. Why This Matters

Speed: The AI generates images much faster than previous methods because it paints in parallel (many squares at once) rather than one by one.
Quality: Because the "checkerboard" pattern keeps neighbors talking to each other, the images don't have those weird glitches or mismatched colors.
Efficiency: You don't need to over-complicate the process. Whether you zoom in slowly or quickly, as long as you check in often enough, the result is great.

The Bottom Line

Think of this method as a smart construction crew building a skyscraper.

Old methods built one brick at a time (too slow).
Other methods tried to pour the whole floor at once (too messy).
This method builds the floor in a checkerboard pattern: they pour the left side, then the right side, then the next floor, then the next. They check their work constantly, ensuring the left side matches the right side, but they do it in big, efficient batches.

The result? A beautiful, high-quality image built in record time.

1. Problem Statement

Autoregressive (AR) image generation faces a fundamental trade-off between parallelism (efficiency) and dependency modeling (quality).

The Challenge: To generate high-quality images, AR models must model mutual dependencies between pixels. However, sampling independent locations in parallel often leads to "mode mixing" (incompatible values) for adjacent or nearby pixels.
Limitations of Existing Approaches:
- Scalewise AR (e.g., VAR): Conditions from coarse to fine scales. To avoid mode mixing, these models use very slow scale-up factors (e.g., $\sqrt[3]{2} \approx 1.26$ ), resulting in many sequential steps and slow inference.
- Parallel AR (e.g., PAR, RandAR): Sample multiple locations at once but often rely on simple partitions or random orders, limiting their ability to model complex spatial dependencies or requiring complex dynamic evolution.
Goal: Develop a method that enables fast scale-up (large scaling factors) while maintaining strong conditioning between scales and within scales, thereby reducing the total number of sequential sampling steps without sacrificing image quality.

2. Methodology

The authors propose a Multiscale Progressive Checkerboard Autoregressive Model. The core innovation is a specific sampling order that balances spatial diversity and conditional dependency.

A. Progressive Checkerboard Ordering

Instead of raster scanning or random sampling, the model uses a divide-and-conquer approach to generate a fixed, spatially balanced ordering:

Algorithm: The 2D grid is recursively subdivided into quadrants. At each recursion level, locations are selected in a "round-robin" fashion across the four quadrants (Top-Left, Bottom-Right, Top-Right, Bottom-Left).
Result: This creates a "checkerboard" pattern where sampled locations are evenly spaced at every scale of the quadtree subdivision.
Benefit: This ordering ensures that when a block of pixels is sampled, the previously sampled pixels (in the same scale) are spatially distant, reducing mutual dependence within the block while maintaining a balanced context.

B. Multiscale Architecture

The model operates on a pyramid of scales:

Upsampling: Latent codes from the previous scale ( $s-1$ ) are upsampled to the current scale ( $s$ ) to serve as a conditioning input ( $z_{up}$ ).
Blockwise Processing: The current scale is divided into $P$ blocks based on the progressive checkerboard order.
Sequential Blocks, Parallel Tokens:
- Blocks are processed sequentially (autoregressively).
- Tokens within a block are processed in parallel.
- Input Composition: For a block $b_i$ $b_{i}$ , the transformer input concatenates:
  - Upsampled latents from the previous scale.
  - Outputs from the previous block ( $b_{i-1}$ ) in the current scale.
  - Position embeddings (both spatial and scale-based).
Tokenization: Uses a VAE-based autoencoder with quantized latent codes (AR-VQ).

C. Position Encodings (RoPE Mixing)

To handle the unique sampling order, the authors experimented with mixing Rotary Position Embeddings (RoPE) for attention keys. They found that while mixing coefficients could be learned, the model naturally extracts conditional information from previously sampled locations in the first two layers only. Consequently, simple input concatenation (Eq. 1) is sufficient, and complex RoPE mixing is not strictly necessary for performance.

3. Key Contributions

Novel Sampling Order: Introduction of the Progressive Checkerboard ordering, which maintains spatial balance at all quadtree levels. This allows for effective conditioning both between scales (coarse-to-fine) and within scales (local dependencies).
Decoupling Scale Factor from Performance: The paper demonstrates a counter-intuitive finding: in a spatially balanced setting, the total number of sequential steps is the dominant factor determining performance, not the specific scale-up factor.
- Large scale factors (e.g., 2x, 3x, 4x) yield similar results if the total step count is constant.
- This contrasts with prior work (VAR) which required small scale factors ( $\approx 1.26$ ) to avoid errors.
Efficiency: The method achieves state-of-the-art results with significantly fewer sampling steps (17 steps) compared to other AR methods (e.g., 147 steps for PAR, 88 for RandAR).

4. Experimental Results

Experiments were conducted on ImageNet 256x256 class-conditional generation.

Performance Metrics:
- FID (Fréchet Inception Distance): The proposed Checkerboard-L model achieved an FID of 2.72 (with 2x scaling, 4 steps/scale) and 2.79 (with 4x scaling, 8 steps/scale).
- IS (Inception Score): Achieved scores of 302.5 and 311.5 respectively.
- Comparison: These results are competitive with or superior to recent SOTA AR models (e.g., ARPG, LPD, PAR) while using fewer steps.
Inference Speed:
- The model generates an image in 0.52 seconds on an A100 GPU.
- This is significantly faster than PAR (3.38s) and RandAR (1.97s).
Ablation Studies:
- Scale Ratio vs. Steps: Models with scale ratios of 2, 3, and 4 performed similarly when the total number of steps was fixed (around 17 steps).
- Steps per Scale: 4 steps per scale was found to be sufficient; increasing to 8 steps yielded diminishing returns.
- Entropy Analysis: Entropy drops significantly as sampling progresses within a scale, with a jump in entropy when moving to a new (higher resolution) scale, confirming the model successfully resolves modes and introduces new details.

5. Significance

This work fundamentally shifts the paradigm for autoregressive image generation:

Efficiency: It proves that high-quality image generation does not require the slow, incremental scale-up factors previously thought necessary to prevent mode mixing. By using a spatially balanced checkerboard order, models can "jump" to larger scales (e.g., 4x) without losing coherence.
Simplicity: The method relies on a simple, fixed regular pattern rather than complex dynamic evolution or random orderings, making it easier to implement and train.
Scalability: The finding that total step count drives performance suggests that future AR models can be optimized by focusing on the depth of the conditional chain rather than the granularity of the scale steps, potentially enabling faster generation for video and other modalities.

In summary, the Progressive Checkerboard method offers a highly efficient, competitive alternative to diffusion and existing AR models, achieving SOTA image quality with a fraction of the sampling steps and inference time.