Revisiting Autoregressive Models for Generative Image Classification

The Big Idea: The "One-Size-Fits-All" Problem

Imagine you are trying to identify a mysterious object in a dark room.

The Old Way (Standard AI): You are forced to look at the object strictly from left to right, top to bottom, like reading a book. You might see a handle first and guess "cup," but you miss the steam rising from the top that would tell you it's "hot coffee." Your view is limited by the order you are forced to scan.
The Competitor (Diffusion Models): Recently, a new type of AI (Diffusion Models) became famous. It's like looking at a blurry photo that slowly sharpens. It's very good at guessing what the object is, but it's incredibly slow. It takes hundreds of "looks" (computations) to get a clear answer.

The Problem: The authors noticed that the "Left-to-Right" AI (called an Autoregressive or AR model) was being unfairly ignored because it seemed less accurate than the slow Diffusion models. They realized the AR model wasn't "dumb"; it was just stuck in a rigid routine.

The Solution: The "Group Discussion" Analogy

The authors asked: What if we didn't force the AI to look at the image in just one order?

They introduced a method called Order-Marginalized Classification. Here is how it works using a metaphor:

Imagine you have a puzzle of a picture, but the pieces are scattered.

The Old AR Model: You try to solve the puzzle by picking up pieces strictly from the top-left corner to the bottom-right. If the first few pieces are confusing, you might guess the wrong picture.
The New Method (RandAR): Instead of one person solving the puzzle, you gather a team of 20 people.
- Person A starts from the top-left.
- Person B starts from the bottom-right.
- Person C starts from the center.
- Person D starts from the edges.
- ...and so on.

Each person looks at the image in a different random order. Then, you take all their guesses and average them out.

Why this works: If Person A gets confused by a weird texture, Person B might see the shape clearly. By combining all these different perspectives, the group gets a much smarter, more complete understanding of the image than any single person could.

The Results: Fast, Accurate, and Robust

The paper shows that this "Group Discussion" approach makes the AR model a superstar for three reasons:

It's a Speed Demon:
- Diffusion Models are like a slow, meticulous detective who needs to re-examine the crime scene 200 times to be sure.
- The New AR Model is like a team of detectives who can look at the scene from 20 different angles simultaneously and give an answer almost instantly.
- The Stat: The new method is up to 25 times faster than the diffusion models while being more accurate.
It's Harder to Fool:
- Standard AI models often get tricked by "shortcuts." For example, if they see a dog in a grassy field, they might just guess "dog" because of the grass, not the dog.
- Because the new model looks at the image from so many different angles, it can't rely on just one shortcut. It has to understand the whole picture. This makes it much better at handling weird images (like sketches, paintings, or photos with noise).
It Competes with the Best:
- Usually, "Generative" models (models that create images) are worse at "Discriminative" tasks (identifying images) than models built specifically for identification.
- This new method is so good that it actually beats the current state-of-the-art identification models in many difficult scenarios, proving that "creating" and "identifying" are two sides of the same coin.

Summary

The authors took a model that was previously considered "too rigid" because it only looked at images in one specific order (like reading a book). They unlocked its potential by letting it "read" the image in random orders and then combining those views.

The Result: A classifier that is smarter (because it sees more perspectives), faster (because it doesn't need hundreds of slow steps), and tougher (because it can't be easily tricked by visual tricks). It's like upgrading from a single person reading a map in the dark to a whole team shining flashlights from every angle at once.

1. Problem Statement

Generative models (GMs), particularly Diffusion Models (DMs), have recently emerged as powerful tools for Generative Classification (GC). Unlike discriminative models that learn a direct mapping $p(y|x)$ , GCs estimate the class-conditional likelihood $p(x|y)$ and apply Bayes' rule to derive the posterior $p(y|x)$ . While DMs have shown superior performance and robustness in this setting compared to earlier approaches, Autoregressive (AR) models have been largely underexplored and often outperformed by DMs in visual classification tasks.

The authors identify a critical limitation in prior AR-based GC approaches: reliance on a fixed token order (typically raster-scan, left-to-right, top-to-bottom). This imposes a restrictive inductive bias, causing the model to rely on partial discriminative cues specific to that order. Consequently, single-order AR predictions can be less accurate and robust compared to the hierarchical, frequency-based generation of diffusion models.

2. Methodology

The paper proposes an Order-Marginalized Autoregressive Classifier that leverages recent advances in "any-order" AR models to overcome the limitations of fixed token sequences.

Core Insight

The authors observe that the classification outcome of an AR model is highly sensitive to the token generation order. A single fixed order may miss global context or rely on spurious local features. However, averaging predictions over multiple random token orders provides a more comprehensive signal, capturing contextual information from diverse image regions.

Technical Framework

Base Model: The method utilizes RandAR, a state-of-the-art decoder-only AR model capable of generating images in arbitrary token orders. RandAR augments the image token sequence with position instruction tokens, allowing it to condition on a random permutation $\pi$ of token indices.
Order-Marginalized Likelihood Estimation:
- Instead of estimating the likelihood for a single fixed order, the method estimates the order-unconditional likelihood $p(x|c)$ by marginalizing over all possible permutations $\pi$ .
- Directly computing the expectation of the probability $E_\pi[p(x|\pi, c)]$ is difficult. Instead, the authors use Jensen's inequality to estimate the lower bound of the log-likelihood, which aligns better with the model's training objective:
  $\log p(x|c) \geq E_\pi [\log p(x|\pi, c)] \approx \frac{1}{K} \sum_{k=1}^K \log p(x|\pi_k, c)$
- Here, $K$ represents the number of random permutations (token orders) sampled for a single image.
Classification Procedure:
- For a given image $x$ and a set of classes $C$ , the model generates $K$ random token orders.
- For each class $c_i$ , the model computes the log-likelihood for each of the $K$ orders.
- These are aggregated to compute the final score for each class.
- The predicted class is $\arg\max_{c_i} \log p(x|c_i)$ .

Efficiency Advantage

A key advantage of this approach is computational efficiency. While Diffusion Classifiers typically require 100–250 forward passes (timesteps) to estimate the likelihood, the proposed AR method can evaluate the order-marginalized likelihood with a much smaller $K$ (e.g., $K=20$ ) in a single forward pass per order. This results in significantly faster inference.

3. Key Contributions

Revisiting AR Models: The paper demonstrates that AR models, when equipped with order-marginalization, can outperform diffusion-based classifiers, challenging the notion that DMs are the sole superior paradigm for generative classification.
Order-Marginalization Strategy: The authors introduce a practical framework to leverage "any-order" AR models (specifically RandAR) to estimate order-unconditional likelihoods, effectively removing the inductive bias of fixed raster orders.
State-of-the-Art Performance: The proposed method achieves competitive or superior results compared to both diffusion classifiers and state-of-the-art self-supervised discriminative models (like DINOv2).
Efficiency: The approach offers up to 25 $\times$ faster inference compared to diffusion-based classifiers while maintaining higher accuracy.

4. Experimental Results

The authors evaluated their method (RandAR with order-marginalization) against various baselines on ImageNet and Out-of-Distribution (OOD) benchmarks.

In-Domain Accuracy (ImageNet-Val):
- RandAR-XL achieved 81.3% top-1 accuracy, surpassing the fixed-order RandAR (70.1%) and diffusion classifiers like DiT (77.2%).
- It also outperformed the self-supervised discriminative model DINOv2 (81.9% for XL) in OOD settings, though DINOv2 remained slightly higher in pure in-domain accuracy.
Robustness (OOD Benchmarks):
- The method showed significant gains on OOD datasets (ImageNet-R, ImageNet-S, ImageNet-A, ImageNet-C).
- For example, on ImageNet-R, RandAR-XL achieved 53.0% accuracy, significantly outperforming DiT (40.2%) and DINOv2 (48.6%).
- On ImageNet-Sketch, it achieved 45.9% vs. DiT's 36.7%.
Efficiency:
- RandAR with $K=20$ was 25 $\times$ faster than DiT (which requires 250 timesteps) while delivering higher accuracy.
Real-World Distribution Shifts (WILDS):
- On datasets with subpopulation shifts (CelebA, FMoW, Camelyon17), RandAR demonstrated superior robustness, often outperforming both discriminative baselines (ERM, RWY) and diffusion classifiers in worst-group accuracy.

5. Significance and Conclusion

This work fundamentally shifts the perspective on autoregressive models for classification. It proves that the "weakness" of AR models (dependence on token order) can be turned into a strength through order-marginalization.

Theoretical Implication: The results suggest that averaging over multiple token orders allows the model to capture a more holistic, shape-biased representation of images, aligning better with human perception and reducing reliance on texture-based shortcuts common in discriminative models.
Practical Impact: By achieving state-of-the-art generative classification performance with significantly lower computational cost than diffusion models, this approach makes high-quality generative classifiers more viable for real-world applications.
Future Directions: The authors highlight potential for further scaling, integration with self-supervised learning techniques, and the possibility of distilling these powerful generative classifiers into efficient discriminative models.

In summary, the paper establishes that order-marginalized AR models are a highly efficient and robust alternative to diffusion models for image classification, bridging the gap between generative and discriminative performance.