Next Visual Granularity Generation

This paper introduces Next Visual Granularity (NVG), a novel image generation framework that progressively refines images from global layouts to fine details through a structured sequence of varying token granularities, achieving state-of-the-art performance on ImageNet with FID scores superior to the VAR series.

Yikai Wang, Zhouxia Wang, Zhonghua Wu, Qingyi Tao, Kang Liao, Chen Change Loy

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you are an artist tasked with painting a masterpiece, but you have a very specific rule: you must paint from the inside out, or rather, from the "big picture" down to the tiny details.

Most AI image generators today work like a frantic scribe writing a story one word at a time. They start with the first word, then the second, then the third. If they make a mistake early on (like writing "a dog" instead of "a cat"), they have to keep writing the rest of the story based on that mistake, leading to errors that pile up.

Other models work like a sculptor chipping away at a block of marble, or a photographer developing a photo in a darkroom, slowly revealing the image from a blur.

"Next Visual Granularity" (NVG) is a new approach that acts like a master architect and a team of specialized painters working together. Here is how it works, broken down into simple concepts:

1. The "Russian Nesting Doll" of Images

Imagine an image isn't just a flat picture, but a set of Russian nesting dolls.

  • The Big Doll (Coarse Granularity): This is the very first layer. It's just a rough sketch. Maybe it's a big blob of green for grass and a big blob of blue for the sky. It doesn't have a dog or a tree yet; it just has the idea of where things go.
  • The Middle Doll (Medium Granularity): Now we open the big doll. Inside, we see the shapes. The green blob is now a field, and the blue is a sky with a cloud. We can see the outline of a dog, but it's still just a silhouette.
  • The Tiny Doll (Fine Granularity): Finally, we get to the smallest doll. This is where the fur on the dog's ear, the texture of the grass, and the reflection in the eye appear.

NVG builds images exactly like this. It doesn't guess pixel by pixel. It builds a hierarchy of detail, starting with the "skeleton" of the image and slowly adding the "flesh" and "skin."

2. The Two-Step Dance: The Blueprint and The Paint

The magic of NVG is that it separates the job into two distinct teams that work in a loop at every stage:

  • Team Structure (The Architect): First, the AI draws a binary map (a black-and-white blueprint). It decides, "Okay, in this corner, we are going to have a 'cluster' of pixels that will become a dog. In that corner, we have a 'cluster' for the sky." It's like drawing the floor plan of a house before laying a single brick.
  • Team Content (The Painter): Once the blueprint is drawn, the second team looks at it and says, "Ah, this cluster is for a dog. Let's fill it with the right colors and textures."

Why is this better?
If the Architect makes a mistake (e.g., puts the dog in the sky), the Painter can fix it in the next layer. But more importantly, because the Architect plans the structure first, the Painter doesn't have to guess where things go. This prevents the "error accumulation" problem of other models.

3. The "Inpainting" Trick

Think of the generation process like filling in a coloring book.

  1. Stage 1: The Architect draws a few big, simple shapes (like a circle for a head and a rectangle for a body). The Painter fills them in with basic colors.
  2. Stage 2: The Architect looks at the result and says, "Okay, inside that 'head' circle, let's split it into two: an eye and an ear." The Painter then adds the details for the eye and ear.
  3. Stage 3: The Architect says, "Now, let's split the 'eye' into a pupil and a highlight." The Painter adds the tiny white dot.

At every single step, the AI is refining the previous image, not starting over. It's like zooming in on a map: first you see the continent, then the country, then the city, then the street.

4. Why This Matters (The "Superpower")

The paper shows that this method is incredibly powerful for two reasons:

  • It's Smarter: Because it understands the structure of an image (the "skeleton"), it creates images that look more realistic and have fewer weird glitches (like a dog with six legs).
  • It's Controllable: Imagine you have a photo of a Flamingo. You can take the "Blueprint" (the structure map) of that Flamingo and feed it to the AI, but tell it, "Paint a Rabbit in this shape."
    • The AI will keep the Flamingo's pose and stance (because the blueprint says so) but change the content to be a Rabbit.
    • This is like taking a mannequin's pose and dressing it in a completely different outfit.

The Bottom Line

Previous AI models were like trying to write a novel by guessing the next word, hoping the story makes sense.
NVG is like an architect who draws the blueprints first, then builds the foundation, then the walls, and finally the decorations.

The result? Images that are sharper, more diverse, and much easier to control, all while using less computer power than the current "state-of-the-art" models. It's a shift from "guessing the next pixel" to "planning the next level of detail."