Next Visual Granularity Generation

Imagine you are an artist tasked with painting a masterpiece, but you have a very specific rule: you must paint from the inside out, or rather, from the "big picture" down to the tiny details.

Most AI image generators today work like a frantic scribe writing a story one word at a time. They start with the first word, then the second, then the third. If they make a mistake early on (like writing "a dog" instead of "a cat"), they have to keep writing the rest of the story based on that mistake, leading to errors that pile up.

Other models work like a sculptor chipping away at a block of marble, or a photographer developing a photo in a darkroom, slowly revealing the image from a blur.

"Next Visual Granularity" (NVG) is a new approach that acts like a master architect and a team of specialized painters working together. Here is how it works, broken down into simple concepts:

1. The "Russian Nesting Doll" of Images

Imagine an image isn't just a flat picture, but a set of Russian nesting dolls.

The Big Doll (Coarse Granularity): This is the very first layer. It's just a rough sketch. Maybe it's a big blob of green for grass and a big blob of blue for the sky. It doesn't have a dog or a tree yet; it just has the idea of where things go.
The Middle Doll (Medium Granularity): Now we open the big doll. Inside, we see the shapes. The green blob is now a field, and the blue is a sky with a cloud. We can see the outline of a dog, but it's still just a silhouette.
The Tiny Doll (Fine Granularity): Finally, we get to the smallest doll. This is where the fur on the dog's ear, the texture of the grass, and the reflection in the eye appear.

NVG builds images exactly like this. It doesn't guess pixel by pixel. It builds a hierarchy of detail, starting with the "skeleton" of the image and slowly adding the "flesh" and "skin."

2. The Two-Step Dance: The Blueprint and The Paint

The magic of NVG is that it separates the job into two distinct teams that work in a loop at every stage:

Team Structure (The Architect): First, the AI draws a binary map (a black-and-white blueprint). It decides, "Okay, in this corner, we are going to have a 'cluster' of pixels that will become a dog. In that corner, we have a 'cluster' for the sky." It's like drawing the floor plan of a house before laying a single brick.
Team Content (The Painter): Once the blueprint is drawn, the second team looks at it and says, "Ah, this cluster is for a dog. Let's fill it with the right colors and textures."

Why is this better?
If the Architect makes a mistake (e.g., puts the dog in the sky), the Painter can fix it in the next layer. But more importantly, because the Architect plans the structure first, the Painter doesn't have to guess where things go. This prevents the "error accumulation" problem of other models.

3. The "Inpainting" Trick

Think of the generation process like filling in a coloring book.

Stage 1: The Architect draws a few big, simple shapes (like a circle for a head and a rectangle for a body). The Painter fills them in with basic colors.
Stage 2: The Architect looks at the result and says, "Okay, inside that 'head' circle, let's split it into two: an eye and an ear." The Painter then adds the details for the eye and ear.
Stage 3: The Architect says, "Now, let's split the 'eye' into a pupil and a highlight." The Painter adds the tiny white dot.

At every single step, the AI is refining the previous image, not starting over. It's like zooming in on a map: first you see the continent, then the country, then the city, then the street.

4. Why This Matters (The "Superpower")

The paper shows that this method is incredibly powerful for two reasons:

It's Smarter: Because it understands the structure of an image (the "skeleton"), it creates images that look more realistic and have fewer weird glitches (like a dog with six legs).
It's Controllable: Imagine you have a photo of a Flamingo. You can take the "Blueprint" (the structure map) of that Flamingo and feed it to the AI, but tell it, "Paint a Rabbit in this shape."
- The AI will keep the Flamingo's pose and stance (because the blueprint says so) but change the content to be a Rabbit.
- This is like taking a mannequin's pose and dressing it in a completely different outfit.

The Bottom Line

Previous AI models were like trying to write a novel by guessing the next word, hoping the story makes sense.
NVG is like an architect who draws the blueprints first, then builds the foundation, then the walls, and finally the decorations.

The result? Images that are sharper, more diverse, and much easier to control, all while using less computer power than the current "state-of-the-art" models. It's a shift from "guessing the next pixel" to "planning the next level of detail."

1. Problem Statement

Current generative models for image synthesis generally fall into three categories, each with specific limitations regarding structural control and representation:

Token-based Autoregressive (AR) & Masked Models: Treat images as 1D sequences (like text). This approach often ignores the inherent 2D spatial structure, suffers from exposure bias (error accumulation), and struggles with unidirectional generation where early tokens lack global context.
Diffusion & Flow Models: View images as samples from a high-dimensional probability distribution. While they produce high fidelity, they often lack explicit structural control, requiring post-hoc modules or extensive fine-tuning to guide the generation process.
Visual Autoregressive (VAR) Models: Use a residual visual pyramid to decompose images into multiple resolutions. However, they rely on spatial resizing, which can mix distinct semantic information in early stages (where a single token represents a large, diverse region), leading to representation ambiguity.

Core Problem: Existing methods lack a mechanism to explicitly model and control the hierarchical visual structure of an image during generation. They fail to naturally progress from a coarse global layout to fine details in a way that allows for explicit, interpretable control over the generation process.

2. Methodology: Next Visual Granularity (NVG)

The authors propose NVG, a framework that represents images as a structured sequence of visual granularities. Instead of varying spatial resolution (like VAR), NVG varies the number of unique tokens used to represent the image at a fixed spatial resolution.

A. Visual Granularity Sequence Construction

The framework constructs a data-driven sequence $T = \{c_i, s_i\}_{i=0}^K$ where:

Granularity Levels: The sequence starts with a single cluster (global layout) and progressively splits into finer clusters until every pixel (or latent token) is unique.
Content ( $c_i$ ): A set of unique tokens representing the visual features at stage $i$ .
Structure ( $s_i$ ): A spatial map (matrix) indicating how the unique tokens in $c_i$ are arranged in the latent space.
Construction Algorithm: A bottom-up clustering strategy is used. Starting from the finest granularity (all unique tokens), the model iteratively groups the most similar tokens (using a greedy strategy with $k=2$ ) into clusters. This creates a hierarchy where each stage $i$ has $2^i$ unique tokens (for a latent size of $16^2$ , this results in 8 stages).
Residual Learning: The content tokens are constructed in a residual manner. The target for stage $i$ is the quantization error of the previous stages, similar to VAR but guided by the induced structure map rather than spatial resizing.

B. Structure Embedding

To handle the hierarchical nature of the structure maps, the authors introduce a compact hierarchical structure embedding:

It uses a bit-style vector where each stage adds a bit (0 or 2) to the parent's vector, with 1s used as padding.
This design preserves parent-child relationships, distinguishes stages within a unified space, and is compatible with Rotary Position Embeddings (RoPE).

C. The Generation Pipeline

The generation process is iterative and consists of two distinct steps per stage, moving from coarse to fine:

Structure Generation: A lightweight Rectified Flow model generates the structure map (binary cluster map) for the current stage.
- It treats the task as "inpainting" where known previous stages are fixed, and the model predicts the split for the current stage.
- It uses Gumbel-top-k sampling to ensure balanced splitting of clusters, increasing diversity.
Content Generation: A Transformer-based model generates the content tokens based on the predicted structure, the current canvas (accumulated image so far), and the text condition.
- Target: The model predicts the final canvas ( $x$ ) directly, rather than just the next token.
- Residual Prediction: The actual output for the current stage is derived by computing the difference between the predicted final canvas and the current canvas ( $f_c(x_i) - x_i$ ). This simulates the quantization target and reduces exposure bias.
- Structure-Aware RoPE: The attention mechanism is extended to encode the hierarchical structure, ensuring tokens within the same cluster are treated as having the same structural position.

3. Key Contributions

Structured Visual Granularity: A novel representation that decomposes images into sequences of varying token counts at fixed resolution, naturally capturing multi-level visual granularity (from global layout to fine details).
Explicit Structure Control: Unlike diffusion or standard AR models, NVG explicitly generates a structure map at every step. This allows for "structure-guided generation" where users can provide binary masks or reference structures to control the layout without retraining.
Hybrid Generation Framework: Combines the efficiency of flow-based models (for structure generation) with the representational power of autoregressive transformers (for content generation), utilizing a residual learning objective to minimize error accumulation.
Scalability: The framework demonstrates clear scaling laws, where performance consistently improves with model size, outperforming previous state-of-the-art (SOTA) models.

4. Experimental Results

The authors trained NVG models (NVG-d16, d20, d24) on the ImageNet 256x256 class-conditional generation task.

Quantitative Performance:
- FID Scores: NVG consistently outperforms the VAR series (the primary competitor).
  - NVG-d16: 3.03 vs. VAR-d16: 3.30
  - NVG-d20: 2.44 vs. VAR-d20: 2.57
  - NVG-d24: 2.06 vs. VAR-d24: 2.09
- Inception Score (IS) & Recall: NVG achieves comparable or superior scores across all metrics.
- Efficiency: NVG requires fewer training steps and parameters to achieve better results compared to many diffusion and AR baselines. It also shows lower memory usage than diffusion models (SiT-X) and competitive inference times.
Qualitative Analysis:
- Structure Alignment: Generated images align perfectly with the generated binary structure maps.
- Structure Transfer: The model can reuse structure maps from reference images (e.g., a wallaby) to generate new content (e.g., a rabbit) while maintaining the layout.
- Robustness: The model handles extreme cases (unclear structures, multi-object scenes) by progressively refining coarse shapes into detailed objects.

5. Significance and Future Impact

Paradigm Shift: NVG moves away from treating images as flat sequences or pure distributions, introducing a hierarchical, structure-aware approach that mimics the "coarse-to-fine" process of human art creation.
Controllability: By decoupling structure and content generation, NVG offers a new level of control for generative AI, enabling applications in design, scientific visualization, and video generation where structural consistency is paramount.
Future Directions: The authors suggest that this framework can be extended to region-aware generation (using domain-specific annotations), physical-aware video generation (tracking regions over time), and hierarchical spatial reasoning.

In summary, Next Visual Granularity Generation presents a robust, scalable, and highly controllable framework for image synthesis that bridges the gap between the structural interpretability of autoregressive models and the high fidelity of diffusion models.