NNiT: Width-Agnostic Neural Network Generation with Structurally Aligned Weight Spaces

Imagine you are trying to teach a robot to pick up a cube. To do this, the robot needs a "brain" (a neural network) with specific instructions (weights) on how to move its arm.

Usually, if you want a robot brain that is slightly bigger or smaller than the one you trained, you have to start from scratch and train it all over again. It's like trying to fit a suit made for a giant onto a child; you can't just stretch it, you have to sew a whole new one.

NNiT is a new invention that solves this problem. It's like a "universal tailor" that can instantly generate a perfect, working brain for a robot of any size, even sizes it has never seen before.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Shuffled Deck" Mess

Imagine a neural network is a deck of cards. The order of the cards matters for the math, but for the robot's brain, it doesn't matter which specific card is in which spot, as long as the pattern of the deck works.

The Old Way: When computers train these brains, they shuffle the deck randomly every time. One time, the "Ace" is at the top; the next time, it's at the bottom. Because the order is random and messy, a computer trying to learn from these decks gets confused. It can't tell if a new, wider deck is just a bigger version of an old one or a completely different game.
The Result: If you try to make a brain wider (add more neurons), the old computer models break because they were trained on a specific, rigid size.

2. The Secret Sauce: The "Graph HyperNetwork" (GHN)

The authors realized that if they used a special tool called a Graph HyperNetwork (GHN) to create the training data, they could fix the mess.

The Analogy: Think of the GHN as a strict architect. Instead of letting the robot brain be built randomly, the architect forces every single brain to be built in the exact same logical order.
The Result: Suddenly, the "cards" in the deck are no longer shuffled. They are neatly stacked. If you look at the "Ace" in a small brain, it's in the same spot as the "Ace" in a huge brain. This creates a structured map where the computer can see the patterns clearly, regardless of the brain's size.

3. The Magic Trick: "Patch Tokenization"

Now that the data is organized, the authors introduced NNiT (Neural Network Diffusion Transformers).

The Old Way: Imagine trying to describe a picture by listing every single pixel in a long line. If you want a bigger picture, you have to rewrite the whole list.
The NNiT Way: Instead of listing pixels, NNiT cuts the picture into small square patches (like a mosaic).
- If you want a wider brain, you don't change the rules. You just add more patches to the mosaic.
- Because the GHN made sure the patches are always organized the same way, the computer knows exactly how to stitch them together. It's like playing with LEGO bricks: whether you build a small house or a skyscraper, you use the same types of bricks; you just use more of them.

4. The Result: Zero-Shot Magic

The paper tested this on a robot arm in a simulation (ManiSkill3).

The Test: They trained the AI on robots with specific brain sizes. Then, they asked it to build a brain for a robot with a completely new size (one it had never seen).
The Outcome:
- Old AI Models: Failed miserably. They tried to stretch their old knowledge and broke.
- NNiT: Succeeded with over 85% success rate. It looked at the new size, grabbed the right "patches" from its memory, and built a working brain instantly.

Summary

NNiT is like a master chef who doesn't just cook one specific meal.

They organize their ingredients perfectly (using the GHN so everything is in the right place).
They chop everything into standard-sized cubes (Patch Tokenization).
When a customer orders a meal for 2 people or 200 people, the chef just adds more cubes to the pot. The recipe remains the same, but the size changes effortlessly.

This allows robots to adapt instantly to new hardware or tasks without needing hours of retraining.

1. Problem Statement

The paper addresses two fundamental challenges in generative neural network synthesis (creating functional neural networks directly from a distribution without traditional training):

Permutation Symmetry: In standard Multi-Layer Perceptrons (MLPs) trained via Stochastic Gradient Descent (SGD), neurons in a hidden layer can be arbitrarily permuted without changing the network's input-output function. This results in weight matrices where adjacent weights are spatially uncorrelated, making it difficult for generative models (like Diffusion Transformers) to learn local spatial structures or "patches."
Width-Agnostic Generalization: Existing generative models typically flatten weight matrices into fixed-dimensional vectors. This rigidly couples the generative prior to specific matrix dimensions seen during training. Consequently, these models fail to generalize to architectures with different widths (number of neurons) or unseen topologies, as changing the width alters the token dimensionality and disrupts learned correspondences.

2. Methodology: NNiT Framework

The authors propose Neural Network Diffusion Transformers (NNiT), a framework that treats neural network synthesis as a single multimodal sequence modeling task. The methodology consists of three core components:

A. Structural Alignment via Graph HyperNetworks (GHNs)

To solve the permutation symmetry and lack of spatial structure, the authors use a Graph HyperNetwork (GHN) as a data generator and alignment mechanism.

Mechanism: The GHN takes a neural architecture (represented as a graph) and generates weights using a CNN decoder.
Effect: Unlike SGD, which produces unstructured weights, the GHN's CNN decoder imposes an explicit locality bias. It maps compact node embeddings to full parameter tensors, creating weight matrices with consistent local spatial correlations (e.g., vertical banding structures) across different random seeds.
Result: This transforms the weight space from a permutation-ambiguous set of vectors into a structurally aligned spatial field, making it suitable for patch-based processing.

B. Multimodal Tokenization

NNiT unifies discrete architecture tokens and continuous weight patches into a single sequence:

Architecture Tokens: The network depth and layer widths are represented as a sequence of discrete tokens (e.g., $[n_1, n_2, \dots, n_S]$ ).
Weight Patches: Instead of flattening weights into 1D vectors, the aligned weight tensors are decomposed into non-overlapping $p \times p$ patches.
Width-Agnostic Generation: Because weights are treated as spatial patches, changing the width of a layer simply corresponds to generating more patches (increasing resolution) rather than changing the tokenization scheme. This decouples the generative prior from fixed matrix dimensions.

C. Diffusion Transformer Backbone

The core model is a Diffusion Transformer (DiT) that processes the unified sequence $z = [z_{arch}; z_{weights}]$ .

Training Objective: The model uses a Mixture of Noise Levels (MoNL) framework to support two modes:
1. Joint Generation: Diffusing both architecture and weights simultaneously to learn the joint distribution $p(a, w)$ .
2. Conditional Synthesis: Keeping architecture tokens noise-free while diffusing weights to learn $p(w|a)$ , allowing the generation of weights for a user-specified topology.
Conditioning: Adaptive Layer Norm (AdaLN-Zero) with dual-timestep embeddings is used to dynamically regulate the processing based on the noise levels of both modalities.

3. Key Contributions

Structural Alignment Discovery: The authors demonstrate that GHNs with CNN decoders inherently align weight spaces, collapsing permutation symmetries into a consistent topological structure with reliable local correlations.
Patch Tokenization for Weights: They introduce a patch-based tokenization strategy that makes weight generation width-agnostic, enabling zero-shot synthesis for architectures with unseen widths.
NNiT Architecture: A novel multimodal Diffusion Transformer that jointly models discrete architectures and continuous weights, enabling both co-design (generating new architectures) and conditional synthesis (generating weights for fixed architectures).

4. Experimental Results

The model was evaluated on ManiSkill3 robotics tasks using MLP policies.

Structural Alignment Validation: Experiments confirmed that GHN-generated weights exhibit consistent local spatial correlations across 35 independent seeds, whereas SGD-trained weights showed unstructured noise. Crucially, this alignment did not result in "mode collapse"; GHN policies maintained high weight diversity.
Zero-Shot Width Transferability:
- Baselines: State-of-the-art baselines (SANE, D2NWG) performed well on seen architectures but failed significantly on unseen widths (success rates dropped to 0–59%).
- NNiT: Achieved >85% success rates on unseen architectural topologies and widths, demonstrating robust generalization where baselines failed.
Multimodal Joint Synthesis: NNiT successfully generated complete, functional policies from scratch (sampling both $a$ and $w$ ) without a fixed architectural prompt, achieving near-perfect success rates (99–100%) on tasks like PickCube and PushCube.

5. Significance and Impact

Decoupling Logic from Dimensions: NNiT breaks the dependency between network function and fixed matrix sizes, allowing generative models to scale to arbitrary network widths and depths.
Efficiency in Robotics: In robotics, where small weight errors can cause task failure, NNiT provides a way to synthesize robust policies for diverse hardware constraints (e.g., varying compute budgets) without retraining.
Foundation for Meta-Learning: By enabling the generation of weights for unseen structures, NNiT opens avenues for meta-learning across tasks and rapid adaptation to new environments.
Scalability: The patch-based approach mirrors video generation techniques, suggesting that NNiT could eventually scale to synthesize billion-parameter foundation models by leveraging efficiency optimizations from Video Diffusion Transformers.

In summary, NNiT solves the long-standing problem of generating neural networks for unseen architectures by using GHNs to create a "spatially aligned" weight space and treating weights as image-like patches within a Diffusion Transformer.