Revisiting Autoregressive Models for Generative Image Classification

This paper proposes a novel generative image classification method that leverages any-order autoregressive models to estimate order-marginalized predictions, thereby overcoming the limitations of fixed token ordering and achieving superior accuracy and efficiency compared to both diffusion-based classifiers and state-of-the-art discriminative models.

Ilia Sudakov, Artem Babenko, Dmitry Baranchuk

Published 2026-03-20
📖 4 min read☕ Coffee break read

The Big Idea: The "One-Size-Fits-All" Problem

Imagine you are trying to identify a mysterious object in a dark room.

  • The Old Way (Standard AI): You are forced to look at the object strictly from left to right, top to bottom, like reading a book. You might see a handle first and guess "cup," but you miss the steam rising from the top that would tell you it's "hot coffee." Your view is limited by the order you are forced to scan.
  • The Competitor (Diffusion Models): Recently, a new type of AI (Diffusion Models) became famous. It's like looking at a blurry photo that slowly sharpens. It's very good at guessing what the object is, but it's incredibly slow. It takes hundreds of "looks" (computations) to get a clear answer.

The Problem: The authors noticed that the "Left-to-Right" AI (called an Autoregressive or AR model) was being unfairly ignored because it seemed less accurate than the slow Diffusion models. They realized the AR model wasn't "dumb"; it was just stuck in a rigid routine.

The Solution: The "Group Discussion" Analogy

The authors asked: What if we didn't force the AI to look at the image in just one order?

They introduced a method called Order-Marginalized Classification. Here is how it works using a metaphor:

Imagine you have a puzzle of a picture, but the pieces are scattered.

  1. The Old AR Model: You try to solve the puzzle by picking up pieces strictly from the top-left corner to the bottom-right. If the first few pieces are confusing, you might guess the wrong picture.
  2. The New Method (RandAR): Instead of one person solving the puzzle, you gather a team of 20 people.
    • Person A starts from the top-left.
    • Person B starts from the bottom-right.
    • Person C starts from the center.
    • Person D starts from the edges.
    • ...and so on.

Each person looks at the image in a different random order. Then, you take all their guesses and average them out.

  • Why this works: If Person A gets confused by a weird texture, Person B might see the shape clearly. By combining all these different perspectives, the group gets a much smarter, more complete understanding of the image than any single person could.

The Results: Fast, Accurate, and Robust

The paper shows that this "Group Discussion" approach makes the AR model a superstar for three reasons:

  1. It's a Speed Demon:

    • Diffusion Models are like a slow, meticulous detective who needs to re-examine the crime scene 200 times to be sure.
    • The New AR Model is like a team of detectives who can look at the scene from 20 different angles simultaneously and give an answer almost instantly.
    • The Stat: The new method is up to 25 times faster than the diffusion models while being more accurate.
  2. It's Harder to Fool:

    • Standard AI models often get tricked by "shortcuts." For example, if they see a dog in a grassy field, they might just guess "dog" because of the grass, not the dog.
    • Because the new model looks at the image from so many different angles, it can't rely on just one shortcut. It has to understand the whole picture. This makes it much better at handling weird images (like sketches, paintings, or photos with noise).
  3. It Competes with the Best:

    • Usually, "Generative" models (models that create images) are worse at "Discriminative" tasks (identifying images) than models built specifically for identification.
    • This new method is so good that it actually beats the current state-of-the-art identification models in many difficult scenarios, proving that "creating" and "identifying" are two sides of the same coin.

Summary

The authors took a model that was previously considered "too rigid" because it only looked at images in one specific order (like reading a book). They unlocked its potential by letting it "read" the image in random orders and then combining those views.

The Result: A classifier that is smarter (because it sees more perspectives), faster (because it doesn't need hundreds of slow steps), and tougher (because it can't be easily tricked by visual tricks). It's like upgrading from a single person reading a map in the dark to a whole team shining flashlights from every angle at once.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →