Implicit-Zoo: A Large-Scale Dataset of Neural Implicit Functions for 2D Images and 3D Scenes

Imagine you have a giant library of photos and 3D objects. Usually, to store a photo, you save it as a grid of pixels (like a mosaic made of tiny colored tiles). To store a 3D object, you save it as a mesh of polygons (like a wireframe model).

The Problem:
This paper argues that this "grid" way of storing data is a bit rigid. It's like trying to describe a smooth, flowing river by only counting the individual water droplets. It works, but it's clunky and hard to work with if you want to zoom in infinitely or understand the shape's true nature.

The Solution: "Implicit-Zoo"
The researchers built a massive new library called Implicit-Zoo. Instead of saving the pixels or the wireframes, they saved the mathematical recipes (functions) that create the images and 3D shapes.

Think of it this way:

Old Way: You save a JPEG file of a cat. If you zoom in too much, it gets blurry and blocky.
Implicit-Zoo Way: You save a "Cat Generator" app. No matter how much you zoom in, the app recalculates the pixels instantly, keeping the image perfectly smooth and sharp.

They spent nearly 1,000 days of supercomputer time (GPU days) to train these "Generators" for over 1.5 million images and 3D objects. They checked them all to make sure the "Cat Generator" actually looks like a cat, not a blob.

Why is this a big deal? (The "Magic" Part)

The paper shows two main ways this new library helps computers learn better:

1. The "Smart Map" Analogy (Learnable Tokenization)

When computers (specifically AI models called Transformers) look at an image, they usually chop it up into fixed squares, like a chessboard. They treat every square the same, whether it's a blank sky or a detailed eye.

The Old Way: Imagine a security guard checking every square inch of a building with a flashlight, regardless of whether it's a wall or a window. It's inefficient.
The Implicit-Zoo Way: Because the data is a smooth mathematical function, the AI can learn to move the squares around. It learns to make the "squares" bigger over the blank sky and smaller (or more numerous) over the complex eye.
The Result: The AI becomes a "smart map" reader. It focuses its attention exactly where the details are, ignoring the empty space. This made the AI significantly better at recognizing what's in the picture (classification) and labeling every part of it (segmentation).

2. The "Time Machine" Analogy (3D Pose Regression)

Imagine you have a 3D model of a room (the "Generator"). You take a photo of that room from a weird angle. The computer's job is to figure out exactly where the camera was standing when it took the photo.

The Old Way: The computer guesses and checks, often getting lost if the room looks similar from different angles.
The Implicit-Zoo Way: The AI uses the smooth 3D "Generator" to simulate what the room should look like from any angle. It compares its simulation to your photo and instantly adjusts its guess until they match.
The Result: The AI can figure out the camera's position in 3D space with surprising accuracy, even for objects it has never seen before.

The "Zoo" Metaphor

Why call it a "Zoo"?
Just like a real zoo has different animals (lions, tigers, bears) that all have different needs but live in the same place, this dataset has different "creatures" of data:

CIFAR-10: Tiny, simple images (like baby animals).
ImageNet: Huge, complex images (like adult elephants).
Cityscapes: Busy street scenes (like a bustling city zoo).
OmniObject3D: 3D objects (like 3D sculptures of animals).

The researchers built a habitat where all these different types of data can live together in a smooth, mathematical format, ready for other scientists to study.

In a Nutshell

The authors built a giant, high-quality library of mathematical "image generators" instead of just saving pictures. They proved that if you let AI learn from these smooth generators, it can teach itself to look at the world more intelligently (focusing on what matters) and understand 3D space better (knowing exactly where a camera is).

It's like upgrading from a library of static paintings to a library of living, breathing holograms that the computer can interact with and learn from.

1. Problem Statement

Neural Implicit Representations (INRs) have emerged as a powerful tool for representing continuous signals (images, 3D scenes) using Multi-Layer Perceptrons (MLPs). They offer advantages such as high-fidelity reconstruction, smooth interpolation, and resolution independence. However, research in this field faces two critical bottlenecks:

Lack of Large-Scale Datasets: Existing INR datasets are limited in scale and diversity, hindering the training of data-hungry architectures like Transformers.
Computational Cost: Generating high-quality INRs requires significant computational resources (thousands of GPU days), making it difficult for the community to benchmark and iterate on new methods.
Suboptimal Tokenization: Current Vision Transformers (ViTs) rely on fixed, hand-crafted tokenization (e.g., uniform patching). This approach may not be optimal for continuous implicit functions, where the "best" locations to sample data might vary based on content.

2. Methodology

A. The Implicit-Zoo Dataset

The authors constructed a large-scale dataset comprising over 1.5 million implicit functions, trained over nearly 1,000 GPU days on the ETH Euler cluster.

2D Tasks:
- CIFAR-10: 60,000 images using a 3-layer SIREN (64 width).
- ImageNet-1K: 1.43 million images using a 4-layer SIREN (256 width).
- Cityscapes: 23,473 urban scenes using a 5-layer SIREN (256 width) for semantic segmentation.
3D Tasks:
- OmniObject3D: 5,914 objects (190 categories) using NeRF (4 layers, 128 width).
Quality Control: A rigorous three-phase training framework was employed. Initial training is followed by extended training for samples failing to reach a PSNR of 30 dB (approx. RGB MSE of 0.03). A final check ensures all data meets this high-fidelity threshold.

B. Learnable Tokenization

Instead of using fixed grid patches, the authors propose Learnable Tokenization.

Concept: The coordinates ( $x$ ) from which the INR is queried to form a token are treated as learnable parameters, optimized jointly with the Transformer backbone.
Strategies:
- Learnable Centers (LC): The center of the patch is learnable, but the relative grid structure is preserved.
- Learnable Pixels (LP): Every pixel coordinate within a token is learnable.
- Regularization: To prevent coordinates from collapsing into local minima (clumping), a regularization term penalizes coordinates within the same token that are too close to each other.
Differentiable Augmentation: To enable backpropagation through the tokenizer during data augmentation, the authors implemented geometric transformations (rotation, shear, translation) directly in the weight space of the INR (modifying weights $W$ and bias $b$ ) rather than the input space, ensuring gradients flow correctly.

C. Benchmark Tasks

The dataset was used to benchmark three specific tasks:

Image Classification: Training ViTs on INR representations of CIFAR-10 and ImageNet-100.
Semantic Segmentation: Pixel-wise prediction on Cityscapes-INRs.
3D Pose Regression: A novel task where a Transformer regresses the 6-DoF camera pose of a 2D image relative to a 3D scene represented by a pre-trained NeRF. This involves extracting volume features from the NeRF and fusing them with 2D image features.

3. Key Contributions

Implicit-Zoo Dataset: The creation of the largest public dataset of implicit functions to date, covering diverse 2D and 3D tasks with verified high quality (PSNR > 30 dB).
Learnable Tokenization Framework: A novel approach allowing networks to learn optimal sampling locations directly from data, outperforming fixed patching strategies.
3D Pose Regression Benchmark: The establishment of the first benchmark for direct 3D pose regression using trained INRs, achieving competitive results (e.g., ~20° rotational error on unseen scenes) without relying on pose priors.
Differentiable Augmentation: A method to apply geometric augmentations to INRs in a differentiable manner, crucial for training learnable tokenizers.

4. Experimental Results

Classification (CIFAR-10):
- Baseline ViT accuracy: 80.82%.
- ViT + Learnable Centers (LC): 81.33% (+0.51%).
- ViT + Learnable Pixels + Regularization (LP+Reg): 81.57% (+0.75%).
- Insight: Learnable tokenization allows the model to focus on semantically important regions, improving accuracy. Random initialization of pixels performed poorly, highlighting the need for structured learning.
Segmentation (Cityscapes):
- Using learnable tokenizers with MiT-B0 improved mIOU from 39.95% (baseline) to 40.61% (LP+Reg).
- The learnable approach helps resolve misalignment between input pixels and supervised labels by aggregating relevant local information via self-attention.
3D Pose Regression (OmniObject3D):
- Seen Scenes: Achieved a rotational error (RE) of 14.17° (with pre-training and refinement).
- Unseen Scenes: Achieved an RE of 20.02°, with nearly 80% of poses having errors below 30°.
- Insight: Pre-training the volume encoder significantly boosts performance across all grouping strategies.

5. Significance and Future Impact

Unlocking Transformer Potential for INRs: The paper demonstrates that INRs are not just a representation tool but a viable data source for training large-scale models like Transformers, bridging the gap between continuous representations and discrete deep learning architectures.
New Research Avenues: The success of learnable tokenization suggests that fixed grid assumptions in computer vision may be suboptimal. This opens a new direction for "learnable sampling" in various continuous domains.
Community Resource: By releasing a massive, high-quality dataset, the authors lower the barrier to entry for INR research, enabling the community to focus on algorithmic improvements rather than data generation costs.
Limitations: The current approach is computationally expensive (small batch sizes due to INR querying) and struggles with symmetric objects in pose regression. Future work may involve symmetry-aware representations and more efficient querying mechanisms.

In conclusion, Implicit-Zoo serves as a foundational resource that validates the utility of large-scale implicit function datasets, demonstrating that learning where to sample data (tokenization) is as critical as learning what the data represents.