NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers

Imagine you are an artist hired to paint a massive, hyper-realistic mural of a bustling city.

The Old Way (Traditional AI Models):
Most current AI image generators work like a perfectionist painter who insists on painting the entire mural at full resolution from the very first brushstroke. They try to get the tiny details of a single brick, the texture of a leaf, and the reflection in a window all at once.

The Problem: This is incredibly slow and exhausting. It requires a huge team of painters (billions of parameters) working non-stop. Even for a small sketch, they waste time trying to perfect details that haven't even been sketched in yet.

The New Way (NAMI):
The authors of this paper, NAMI, came up with a smarter strategy. They realized that painting a picture is a progressive process. You don't start with the fine details; you start with a rough sketch, then add layers, and finally polish the details.

Here is how NAMI works, broken down with simple analogies:

1. The "Matryoshka" Strategy (Progressive Resolution)

Instead of painting the whole 1024x1024 pixel image at once, NAMI breaks the job into three distinct stages, like building a house:

Stage 1 (The Blueprint): The AI starts with a tiny, low-resolution sketch (256 pixels). It only uses a small, lightweight team of painters to figure out the big picture: "Where is the sky? Where is the building? Is there a tree?" It ignores all the tiny details.
Stage 2 (The Framing): The sketch is blown up to a medium size (512 pixels). Now, a medium-sized team joins in to add structure and shapes.
Stage 3 (The Finishing Touches): The image is blown up to full size (1024 pixels). Now, the full, heavy-duty team arrives to add the intricate details, textures, and lighting.

Why this is cool: In the old way, the heavy-duty team was working on the "blueprint" stage, which is a waste of their expensive skills. NAMI saves money and time by using the right-sized team for the right job.

2. The "Bridge" (BridgeFlow)

When you zoom in from a small sketch to a larger one, things can get messy. The lines might get blurry, or the colors might shift weirdly.

The Old Fix: Previous methods would just "guess" or "re-noise" the image when zooming in, which is like trying to fix a blurry photo by squinting at it. It's slow and often inaccurate.
The NAMI Fix: They built a special BridgeFlow module. Think of this as a smart translator or a perfectly fitted adapter. When the image moves from the "Small Team" stage to the "Medium Team" stage, this bridge instantly and smoothly translates the rough sketch into a clean, ready-to-work canvas. It ensures the "blueprint" matches perfectly with the "framing" without any glitches.

3. The "Assembly Line" (Efficiency)

Because NAMI uses fewer layers (painters) for the early stages and only adds more layers as the image gets bigger, it runs much faster.

The Result: They claim to cut the time it takes to generate a high-quality image by 64%. It's like switching from a single person painting the whole mural to an assembly line where specialized workers handle specific parts of the process.

4. The "New Test" (NAMI-1K)

The authors also noticed that the standard tests used to judge AI art were a bit boring and repetitive (like asking the AI to draw "a cat" or "a dog" over and over).

They created their own test called NAMI-1K. Imagine a test that asks the AI to draw "a sad clown eating a taco on a rainy Tuesday" or "a futuristic city made of glass." It tests the AI on complex stories, weird combinations, and human preferences, not just simple objects.

Summary

NAMI is like a smart construction manager for AI art. Instead of throwing a giant, expensive crew at every single task, it:

Starts small: Uses a tiny crew to sketch the layout.
Grows gradually: Adds more workers only when the image gets bigger.
Bridges the gaps: Uses a special tool to make sure the transition between stages is smooth.

The result? You get beautiful, high-quality images much faster and with less computing power, making it easier for everyone to use these powerful tools.

1. Problem Statement

Current state-of-the-art text-to-image (T2I) models, particularly those based on Diffusion Transformers (DiT) and Rectified Flow (e.g., SD3, FLUX), achieve high-quality generation but suffer from high inference latency and computational costs due to their massive parameter sizes.

Inefficiency: Existing methods often perform unified denoising across all sampling stages using the full model capacity, ignoring the fact that early stages only require generating coarse layouts while later stages require fine details.
Redundancy: Using the same number of Transformer layers for low-resolution (layout) and high-resolution (detail) stages leads to significant parameter redundancy.
Benchmark Limitations: Existing evaluation benchmarks (e.g., GenEval, DPG-Benchmark) suffer from limited prompt diversity, distributional biases, and a lack of real-world user scenario representation.

2. Methodology: NAMI

The authors propose NAMI (Bridged Progressive Rectified Flow Transformers), a framework that decomposes the generation process across temporal, spatial, and architectural dimensions to improve efficiency without sacrificing quality.

A. Progressive Rectified Flow Architecture

Instead of a single unified model, NAMI divides the rectified flow generation process into $K$ resolution stages (e.g., 256 $\to$ 512 $\to$ 1024).

Temporal & Spatial Decomposition: The flow is split into $K$ time windows corresponding to different resolutions.
Architectural Scaling:
- Low-Resolution Stages: Use a subset of Transformer layers to generate image layouts and concept contours quickly.
- High-Resolution Stages: Progressively add more Transformer layers as the resolution increases to handle fine-grained details.
- This creates a "spatially cascaded" DiT structure where the model depth grows with resolution.

B. BridgeFlow Module

To connect these different stages, the authors introduce a learnable BridgeFlow module.

Function: It aligns the probability distributions between the endpoint of a lower-resolution stage and the starting point of the next higher-resolution stage.
Mechanism: Unlike previous non-parametric methods (like Pyramid Flow) that rely on rescaling and re-noising (which are inefficient and lack adaptation), BridgeFlow uses a learnable linear transformation ( $W \cdot \text{Up}(\hat{x}) + B$ ).
Benefit: This ensures continuity in the probabilistic path while maintaining computational efficiency and robustness.

C. Multi-Resolution Joint Training

NAMI employs a novel training strategy where data of different resolutions is processed simultaneously:

Simultaneous Optimization: The model is trained on images of varying resolutions (e.g., 256, 512, 1024) within the same batch.
Knowledge Sharing: This allows the model to learn semantics at low resolutions and details at high resolutions concurrently, preventing "catastrophic forgetting" often seen in sequential fine-tuning.
Loss Weighting: Losses from different time windows are dynamically weighted to balance convergence.

3. Key Contributions

Bridged Progressive Rectified Flow: A novel architecture that enables multi-resolution training and inference, accelerating model convergence by learning semantics early and details later.
Efficient Inference via Spatial Cascading: By using fewer layers for low-resolution stages and progressively adding layers, NAMI-2B reduces inference time by 64% for 1024 $\times$ 1024 images compared to a baseline FLUX-2B model of the same total parameter size.
BridgeFlow Module: A learnable alignment mechanism that replaces inefficient non-parametric transitions, ensuring smooth flow continuity between stages.
NAMI-1K Benchmark: A new evaluation dataset comprising 1,000 prompts with diverse lengths and topics (short, human-created, and AI-generated long prompts) to mitigate distributional bias and better assess real-world performance.

4. Experimental Results

The authors evaluated NAMI-2B (2 Billion parameters) against state-of-the-art models like FLUX-dev (12B), SD3-medium (2B), and SANA.

Inference Speed:
- NAMI-2B generates 1024 $\times$ 1024 images in 2.98 seconds (on A100), compared to 8.47 seconds for the baseline FLUX-based model.
- This represents a 64.82% reduction in inference time.
- The flow piecewise design alone saves 53% of time; the model partitioning saves an additional 11%.
Generation Quality:
- Quantitative Benchmarks: NAMI-2B achieves competitive or leading scores on GenEval and DPG-Benchmarks, often outperforming larger models (e.g., SD3-medium) despite having fewer parameters.
- Human Evaluation (NAMI-1K): On the new benchmark, NAMI-2B scores 70.69 overall, significantly outperforming SD3-medium (69.97), Infinity (69.77), and SANA (67.80), and approaching the performance of the much larger FLUX-dev (85.05).
- Ablation Studies: Experiments confirmed that both the flow piecewise strategy and the model layer partitioning are essential for the performance gains. The BridgeFlow module was shown to be more efficient and effective than MLP or CNN-based alternatives.

5. Significance

Commercial Viability: By drastically reducing inference latency and computational cost while maintaining high-quality output, NAMI makes high-fidelity T2I generation more feasible for commercial deployment and real-time applications.
Paradigm Shift: It challenges the "one-size-fits-all" approach to DiT architectures, demonstrating that adaptive model capacity based on generation stages is a more efficient strategy.
Evaluation Standard: The introduction of NAMI-1K addresses critical gaps in current evaluation metrics, providing a more holistic view of model capabilities regarding prompt diversity and human preference.
Versatility: The framework is shown to be adaptable to other tasks, such as image editing, by reusing early-stage layouts and modifying later-stage instructions.

In conclusion, NAMI represents a significant step forward in efficient generative AI, proving that strategic architectural decomposition and multi-resolution training can yield faster, high-quality image generation without the need for massive parameter counts.