CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering

Imagine you are trying to send a massive, high-definition 3D movie of a city to a friend on a phone with a slow internet connection.

The Problem:
Traditional methods for creating these 3D views (like NeRF or 3D Gaussian Splatting) are like trying to send the entire city's blueprint, every single brick, and every texture map. It's huge, takes forever to download, and if your friend wants to zoom in or look from a new angle, the computer has to crunch through all that heavy data every time. It's like trying to carry a library in your backpack just to read one page.

The Solution: CLiFT (Compressive Light-Field Tokens)
The paper introduces a new way to handle this called CLiFT. Think of CLiFT not as a blueprint, but as a smart, compressed "highlight reel" of the scene.

Here is how it works, using a few simple analogies:

1. The "Light Field" (The Raw Data)

Imagine a scene is a giant jar filled with millions of tiny, glowing fireflies. Each firefly represents a single ray of light traveling from an object to your eye. To see the scene perfectly, you need to know the position and color of every single firefly. That's the "Light Field." It's beautiful but overwhelming.

2. The "Tokenization" (Taking a Snapshot)

First, the system looks at all the photos you took of the scene. Instead of keeping every single pixel, it turns the photos into a giant list of "tokens." Think of these tokens as postcards. Each postcard describes a tiny patch of the scene (a wall, a tree, a person) and the angle you saw it from.

3. The "Smart Sort" (Latent K-Means)

Now, you have 10,000 postcards. You don't need all of them to remember the scene.

The Old Way: You might just pick postcards randomly. You might end up with 50 postcards of the same boring gray wall (redundancy) and zero postcards of the interesting cat on the roof.
The CLiFT Way: The system uses a smart algorithm (Latent K-Means) to act like a curator. It looks at all the postcards and groups them.
- It says, "These 500 postcards are all of that gray wall; we only need one representative postcard for that."
- It says, "These 10 postcards are of the cat, the window, and the tree; we need all of these because they are unique."
- It picks the "best" postcard from each group to be a Centroid (the leader of the group).

4. The "Condenser" (Compressing the Info)

This is the magic step. The system takes the information from the 500 gray-wall postcards and compresses it into the single "Centroid" postcard. It's like writing a summary of a whole book on a single index card. Now, instead of 10,000 postcards, you have a tiny, efficient stack of maybe 1,000 "Super Postcards" (the CLiFTs) that hold all the essential geometry and color information.

5. The "Adaptive Renderer" (The Flexible Viewer)

This is where CLiFT shines. Imagine you are the viewer.

Scenario A (Slow Internet): You tell the system, "I only have a tiny bit of data allowance." The system grabs just 50 of the closest, most relevant Super Postcards and builds a quick, slightly lower-quality image. It's fast and cheap.
Scenario B (High-Speed Connection): You say, "I want 4K quality!" The system grabs 5,000 Super Postcards, including the ones from far away, and builds a stunning, hyper-realistic image.

The best part? It's the same trained brain doing both jobs. You don't need a different model for low quality and high quality. It just adjusts how many "tokens" (postcards) it uses on the fly.

Why is this a big deal?

Storage: It shrinks the file size of a 3D scene by 5 to 7 times compared to current top methods, without losing much visual quality.
Speed: Because it can choose to use fewer tokens, it can render scenes much faster on weaker devices (like phones or VR headsets).
Flexibility: It allows for a "trade-off." If you are in a hurry, you get a fast, good-enough view. If you have time, you get a perfect view.

In Summary:
CLiFT is like a smart travel guide. Instead of giving you the entire encyclopedia of a city, it gives you a condensed list of the most important landmarks (tokens). If you have 5 minutes, it shows you the top 3 spots. If you have 5 hours, it shows you the top 50. It saves space, saves time, and still lets you see the world clearly.

1. Problem Definition

The paper addresses the growing demand for efficient storage and bandwidth in visual media, specifically for Novel View Synthesis (NVS). While existing methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) offer high-quality rendering, they often suffer from:

High Storage Costs: Requiring per-scene optimization or large sets of explicit primitives.
Inflexibility: Most models are trained for a fixed data size or require separate models for different quality/speed trade-offs.
Reconstruction Limitations: Reconstruction-based methods struggle with scene dynamics and fine-grained details, while reconstruction-free methods (like LVSM) often lack control over computational budget during inference.

The goal is to develop a reconstruction-free NVS framework that represents a scene as a compact set of tokens, allowing for adaptive rendering where the user can trade off data size, rendering quality, and speed on the fly using a single trained model.

2. Methodology: CLiFT Framework

The core innovation is Compressive Light-Field Tokens (CLiFTs), a compact set of light-field rays with learned latent embeddings. The framework operates in two phases: Construction (training/encoding) and Rendering (inference).

A. CLiFT Construction (Training Phase)

The process involves three sequential steps to compress multi-view images into a scene representation:

Multi-view Encoding:
- Input: $N_c$ images with camera poses.
- Mechanism: A Transformer encoder processes the images. For each pixel, the system concatenates 6D Plücker coordinates (representing ray geometry) with normalized 3D color vectors.
- Output: These are patchified and projected into Light-Field Tokens (LiFTs), capturing both geometry and appearance.
Latent-space K-means (Ray Selection):
- To avoid redundancy (e.g., uniform sampling in textureless regions), the system performs K-means clustering in the latent space of the LiFTs.
- Centroids: The nearest neighbor sample from each cluster center is retained as a Centroid LiFT.
- Adaptive Density: This ensures tokens are denser in texture-rich regions and sparse in homogeneous areas, preserving geometric diversity while reducing token count ( $N_s$ ).
Neural Condensation:
- A lightweight Transformer "condenser" compresses the information from all original LiFTs into the selected Centroid LiFTs.
- Mechanism: It uses inter-cluster self-attention (to exchange info between clusters) and intra-cluster cross-attention (where the centroid queries its cluster members).
- Residual Connection: A zero-initialized linear layer aggregates features back into the centroid to preserve the pretrained latent space.
- Result: The final output is the CLiFTs, a compressed scene representation.

B. CLiFT Rendering (Inference Phase)

The system supports compute-adaptive rendering, allowing the user to specify a "compute budget" (number of tokens to use, $N_r$ ).

Token Selection:
- Given a target view, the system divides the view into a grid of patches.
- It casts rays through patch centers and retrieves the $N_r$ closest CLiFTs from the storage pool ( $N_s$ ) based on ray origin/direction heuristics.
- This ensures spatial coverage of the target view without needing the entire scene representation.
Neural Renderer:
- A Transformer decoder synthesizes the image.
- Query: The target view's Plücker coordinates.
- Keys/Values: The selected $N_r$ CLiFTs.
- Training Strategy: The renderer is trained with randomly varying token counts ( $N_r$ ), enabling it to learn to handle different computational budgets dynamically.

3. Key Contributions

CLiFT Representation: A novel scene representation that compresses light-field rays into a variable-size set of tokens, retaining rich geometric and appearance information.
Compute-Adaptive Rendering: A single trained model capable of rendering novel views at varying quality and speed levels by simply adjusting the number of tokens ( $N_r$ ) used during inference.
Efficient Compression Pipeline: A three-stage process (Encoding $\to$ Latent K-means Selection $\to$ Neural Condensation) that significantly reduces data size while maintaining high fidelity.
Reconstruction-Free Approach: Unlike 3DGS or NeRF, CLiFT does not require per-scene optimization or explicit geometry reconstruction, making it suitable for dynamic scenes and direct feed-forward synthesis.

4. Experimental Results

The method was evaluated on RealEstate10K and DL3DV datasets, comparing against:

Reconstruction-free: LVSM (Large View Synthesis Model).
Reconstruction-based: MVSplat and DepthSplat.

Key Findings:

Data Reduction: CLiFT achieves comparable rendering quality with 5–7 $\times$ less data size than MVSplat/DepthSplat and 1.8 $\times$ less than LVSM.
Quality: It achieves the highest overall PSNR among all methods while using significantly fewer tokens.
Adaptability:
- Quality vs. Speed: As shown in Table 2, reducing the render token count from 4096 to 512 decreases FLOPs by 36% and increases FPS by 66%, with only a minor drop in PSNR (approx. 2.83 dB).
- Ablation Studies: Removing the K-means clustering or the Neural Condenser significantly degrades performance, especially at high compression rates, proving the necessity of adaptive token selection and information compression.
Qualitative Results: CLiFT preserves sharp details and high-frequency content better than baselines under strong compression.

5. Significance and Future Directions

Significance:
CLiFT bridges the gap between high-fidelity neural rendering and practical deployment constraints. By decoupling the scene representation size from the rendering compute budget, it enables dynamic quality adjustment suitable for diverse devices (e.g., VR headsets vs. mobile phones) and network conditions. It represents a shift towards on-demand rendering where resources are allocated only where necessary.

Limitations & Future Work:

Motion Generalization: The model struggles with camera motions that deviate significantly from the training distribution (e.g., complex rotations vs. smooth translations).
Occlusion Handling: In large scenes where target views are not covered by input images, renderings can become blurry.
Future Direction: The authors suggest incorporating generative priors to improve rendering in unseen or occluded areas, potentially combining the efficiency of CLiFT with the hallucination capabilities of generative models.

In summary, CLiFT offers a highly efficient, flexible, and high-quality solution for novel view synthesis, setting a new standard for adaptive neural rendering in resource-constrained environments.