Weight Space Representation Learning on Diverse NeRF Architectures

Imagine you have a magical box that can store a 3D object, like a toy car or a chair. In the past, to store this object, you had to take thousands of photos from every angle and save them all. That takes up a lot of space.

NeRFs (Neural Radiance Fields) changed the game. Instead of saving photos, a NeRF saves the object as a recipe written in the "weights" (the numbers) of a tiny neural network. If you feed the right coordinates into this recipe, it tells you exactly what color that point in space should be. It's like having a single, tiny file that can generate an infinite number of views of your object.

The Problem: The "Language Barrier"

Here's the catch: Scientists have been inventing different ways to write these recipes.

Recipe A (MLP): Writes the recipe like a long, straight list of instructions.
Recipe B (Tri-Plane): Writes the recipe like a 3D grid of notes.
Recipe C (Hash Table): Writes the recipe like a giant, organized phonebook.

The problem? The AI tools we built to understand these recipes were monolingual.

One tool could only read Recipe A.
Another tool could only read Recipe B.
If you handed the "Phonebook" recipe to the "List" reader, it would be confused and fail.

This meant that if a scientist invented a new, better way to write the recipe tomorrow, all our existing tools would become useless. We couldn't compare a "List" car to a "Phonebook" car because they spoke different languages.

The Solution: The "Universal Translator"

This paper introduces a new framework that acts as a Universal Translator for 3D objects.

1. Turning Recipes into Graphs (The Map)
First, the authors realized that no matter how the recipe is written (List, Grid, or Phonebook), it's all just a bunch of connections between numbers. They figured out how to turn every single recipe type into a map (a graph).

Think of the weights as cities on a map.
Think of the connections between them as roads.
Whether the map looks like a subway system (List), a city grid (Grid), or a highway network (Phonebook), it's still just a map.

2. The Graph Meta-Network (The Translator)
They built a special AI called a Graph Meta-Network. Imagine this as a super-smart librarian who doesn't care what language the book is written in. As long as the book is a "map," the librarian can read it.

The librarian looks at the map of the "List" car and the map of the "Phonebook" car.
Instead of getting confused by the different layouts, the librarian learns to ignore the style of the map and focus on the content.
If both maps describe a "yellow pickup truck," the librarian puts them in the same pile, regardless of whether one was written as a list or a phonebook.

3. The "Contrastive" Lesson (The Teacher)
How did they teach the librarian to ignore the style? They used a clever training trick called Contrastive Learning.

Imagine you show the librarian two pictures of a cat: one is a photo, the other is a sketch.
You say, "These are the same cat! Put them next to each other."
Then you show a picture of a dog and say, "This is different. Put it far away."
By doing this thousands of times with different "styles" of NeRFs representing the same object, the librarian learns to create a universal language where "Car" always means "Car," no matter how the recipe was written.

Why This Matters

This is a huge leap forward for three reasons:

Future-Proofing: If scientists invent a new way to write NeRF recipes next year, this framework can likely understand it immediately without needing to be rebuilt.
Mixing and Matching: You can now search a database for "chairs." It doesn't matter if the database has chairs written in Lists, Grids, or Phonebooks. The system finds them all.
New Capabilities: For the first time, the system can handle "Hash Table" recipes (which are very popular and fast), opening the door to faster and more efficient 3D AI applications.

The Bottom Line

Think of this paper as building a Rosetta Stone for 3D objects. Before, we needed a different dictionary for every type of 3D file. Now, we have one master key that unlocks the meaning of any 3D object, regardless of how it was encoded. This allows us to finally treat 3D data as a unified, searchable, and understandable format, paving the way for smarter AI that truly understands the 3D world.

1. Problem Statement

Neural Radiance Fields (NeRFs) have become a standard for representing 3D scenes by encoding geometry and appearance into neural network weights. While recent works (e.g., nf2vec, Cardace et al.) have demonstrated that NeRF weights can be processed directly for downstream tasks (classification, retrieval, language) without rendering, these existing frameworks suffer from a critical limitation: architectural rigidity.

Current methods are designed for specific NeRF architectures (e.g., standard MLPs or Tri-planes with fixed resolutions).
They cannot process NeRFs with different parameterizations (e.g., Multi-resolution Hash Tables) or variations in hyperparameters (e.g., different hidden layer sizes).
This limits their applicability in a rapidly evolving field where new NeRF designs are constantly proposed.

The core challenge is to learn a latent representation of NeRF weights that is architecture-agnostic, meaning the distance in the latent space should reflect the semantic similarity of the 3D content (shape/appearance) rather than the specific neural architecture used to encode it.

2. Methodology

The authors propose a novel framework that processes NeRF weights using a Graph Meta-Network (GMN) within an unsupervised representation learning setting.

A. From NeRFs to Graphs (Parameter Graphs)

To feed NeRF weights into a Graph Neural Network (GNN), the authors convert the neural architectures into parameter graphs:

MLPs & Tri-planes: They adopt the existing parameter graph formulation by Lim et al. (2024), where nodes represent neurons/activations and edges represent weights.
Multi-resolution Hash Tables (Novel Contribution): The authors introduce a new graph construction for hash tables. Instead of explicitly modeling the underlying 3D voxel grid (which would be memory-intensive and scale cubically with resolution), they construct a subgraph where:
- Nodes represent table entries and feature vector dimensions.
- Edges connect entry nodes to feature nodes, storing the feature values.
- This preserves the memory efficiency of hash tables while making them compatible with GNNs.

B. Encoder-Decoder Architecture

Encoder: A Graph Meta-Network (GMN) that ingests the parameter graph. It uses message-passing to update node and edge features, followed by average pooling of edge features to produce a fixed-size latent vector (embedding).
Decoder: A standard MLP decoder (based on nf2vec) that takes the latent embedding and a frequency-encoded 3D coordinate $(x, y, z)$ to reconstruct the radiance field value $(r, g, b, \sigma)$ .

C. Training Objective

The framework is trained end-to-end using a combination of two loss functions:

Rendering Loss ( $L_R$ ): Ensures the decoder can reconstruct the original NeRF from the latent embedding. This forces the embedding to capture the geometric and appearance details of the scene.
Contrastive Loss ( $L_C$ ): Specifically, a SigLIP (Sigmoid Loss for Language Image Pre-training) objective.
- Positive Pairs: NeRFs of the same object but with different architectures (e.g., an MLP version and a Hash Table version of a car).
- Negative Pairs: NeRFs of different objects.
- Goal: This loss pushes embeddings of the same object (regardless of architecture) closer together in the latent space while pushing different objects apart.

The total loss is $L_{R+C} = L_R + \lambda L_C$ .

3. Key Contributions

First Architecture-Agnostic Framework: The first system capable of processing NeRFs with diverse architectures (MLPs, Tri-planes, and Hash Tables) and performing inference on architectures unseen during training.
Contrastive Learning for NeRFs: Demonstrates that a rendering loss alone causes different architectures to cluster separately even for the same object. Introducing a contrastive objective is essential to align these architectures into a unified latent space.
Hash Table Processing: Introduces the first method to process NeRFs parameterized by multi-resolution hash tables via weight-space learning.
Graph Conversion for Hash Tables: Proposes a memory-efficient parameter graph construction for hash tables that avoids the cubic scaling of explicit voxel grids.
Generalization: Shows the ability to generalize to unseen architectural variations (e.g., different hidden layer counts) and unseen datasets (Objaverse).

4. Experimental Results

The authors evaluated their method on 13 diverse NeRF architectures across three families (MLP, Tri-plane, Hash Table) using the ShapeNetRender dataset, with generalization tests on Objaverse.

Classification:
- In multi-architecture settings (trained on all types), the framework achieves high accuracy (90%+) on both seen and unseen architectures.
- The combination of Rendering + Contrastive loss ( $L_{R+C}$ ) significantly outperforms using Rendering loss alone when testing on architectures different from the training set.
- The method outperforms state-of-the-art single-architecture baselines (nf2vec, Cardace et al.) even when restricted to single-architecture scenarios.
Retrieval (k-NN):
- The framework successfully retrieves NeRFs of the same object across different architectures (e.g., querying with an MLP and retrieving a Hash Table version).
- $L_{R+C}$ achieves Recall@10 of ~72% for cross-architecture retrieval (MLP vs. TRI/HASH), whereas the rendering-only baseline ( $L_R$ ) performs near random chance (~8%) because it clusters by architecture rather than content.
Language Tasks (Captioning & Q&A):
- By integrating the GMN encoder into the LLaNA pipeline (Large Language and NeRF Assistant), the authors show that the learned embeddings are effective for downstream language tasks.
- The method achieves comparable or superior results to the original LLaNA (which uses nf2vec embeddings) in multi-architecture settings, proving the embeddings capture semantic 3D information effectively.
Latent Space Analysis (t-SNE):
- $L_R$ alone creates clusters separated by architecture.
- $L_C$ alone creates clusters separated by class but with poor inter-class separation.
- $L_{R+C}$ strikes a balance, creating clusters organized by class where different architectures for the same object are tightly grouped.

5. Significance

This work represents a significant step toward NeRF Foundation Models. By decoupling the representation learning from specific neural architectures, the authors enable:

Interoperability: Systems can now ingest NeRFs from any source, regardless of the underlying encoding (MLP, Tri-plane, or Hash).
Scalability: As new NeRF architectures are developed, they can be integrated into the framework simply by defining their parameter graph, without retraining the entire representation learning pipeline from scratch.
Efficiency: Processing weights directly avoids the computational cost of rendering views for downstream tasks.

The paper establishes that contrastive learning is the key mechanism for unifying the weight spaces of diverse neural architectures, paving the way for robust, generalizable 3D understanding systems.

Weight Space Representation Learning on Diverse NeRF Architectures

The Problem: The "Language Barrier"

The Solution: The "Universal Translator"

Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology

A. From NeRFs to Graphs (Parameter Graphs)

B. Encoder-Decoder Architecture

C. Training Objective

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization