NeRV360: Neural Representation for 360-Degree Videos with a Viewport Decoder

Imagine you have a giant, 360-degree movie of a bustling city. It's so high-definition that if you tried to print the whole thing out, it would cover the entire floor of a football stadium. This is what a 6K 360-degree video is like: massive, detailed, and incredibly heavy to store or stream.

Now, imagine you are wearing a Virtual Reality (VR) headset. You can only see a small square window in front of your eyes—maybe the size of a postcard. You look left, right, up, and down, but you never actually see the whole stadium-sized image at once. You only ever need that tiny "postcard" sized view.

The Old Problem: The "Whole Pizza" Approach

Traditional video compression methods (and even the previous best AI methods like HNeRV) work like this:

They take that massive, stadium-sized video file.
They try to reconstruct the entire stadium in your computer's memory, pixel by pixel, just to show you that one postcard-sized window.
Only after the whole stadium is built do they cut out the tiny piece you are looking at.

The Analogy: It's like ordering a delivery of a 100-foot-long pizza just so you can eat one slice. You have to pay for the whole pizza, the delivery truck has to carry the whole thing, and your kitchen (your computer's memory) has to be huge enough to hold it. If you try to do this on a small laptop, the kitchen explodes (the computer crashes), and it takes forever to get your slice.

The New Solution: NeRV360

The authors of this paper, NeRV360, came up with a clever trick. Instead of building the whole stadium, they built a system that only builds the slice you are looking at.

Here is how they did it, using simple metaphors:

1. The "Magic Map" (The Embedding)

Instead of storing the video as a giant image, they compress it into a tiny, dense "magic map" (called an embedding). Think of this map not as a picture, but as a recipe book for the entire city. It doesn't show the buildings; it just contains the instructions on how to build them.

2. The "Smart Chef" (The Viewport Decoder)

In the old method, the chef would read the recipe book, build the whole city, and then hand you a slice.
In NeRV360, the chef is smarter. You tell the chef, "I want to see the view looking North at 2:00 PM."
The chef looks at the recipe book, skips the parts about the South and the East, and only cooks the specific North-facing window you asked for.

3. The "Special Lens" (The STAT Module)

To make this work, the system needs to know exactly where you are looking (Latitude and Longitude) and what time it is in the video.
The researchers created a special tool called STAT (Spatio-Temporal-Aware Transform).

Analogy: Imagine the recipe book has a magical lens attached to it. When you turn the lens to "North," the book automatically rearranges its instructions to only show you how to build the North side. When you turn it to "South," it instantly switches. This happens instantly, without ever building the rest of the city.

4. The "Extra Ingredients" (Channel Expansion)

There was a small snag: when you zoom in on a tiny part of a compressed map, it can get blurry (like zooming in on a low-res photo).
To fix this, NeRV360 adds a "channel expansion layer."

Analogy: Before the chef starts cooking the specific slice, they take the basic ingredients and multiply them to create a richer, more detailed mix. This ensures that even though they are only cooking a small slice, the flavor (image quality) is just as rich as if they had cooked the whole pizza.

Why This Matters (The Results)

The paper tested this on huge 6K videos and found amazing results:

Memory: It uses 7 times less memory. You can now run this on a standard gaming laptop or a consumer graphics card, whereas before you needed a supercomputer.
Speed: It decodes 2.5 times faster. You can watch the video in real-time without lag.
Quality: Surprisingly, the image quality is actually better than the old methods because the system focuses all its computing power on the part you are actually seeing.

The Bottom Line

NeRV360 changes the game by realizing that for VR and 360-degree videos, we don't need to see the whole world to enjoy the view. By teaching the AI to only "dream" the part of the video you are looking at, they made high-quality, ultra-high-resolution VR possible on devices that fit in your pocket. It's the difference from trying to carry the whole ocean in a bucket, versus just scooping out the water you need to drink.

1. Problem Statement

While Implicit Neural Representations for Videos (NeRV) have shown promise in video compression, applying them to high-resolution 360-degree videos (e.g., 6K) faces significant hurdles:

Memory Bottleneck: Conventional NeRV pipelines (like HNeRV) decode the entire panoramic frame before extracting the user's viewport. For 6K resolution, this requires massive GPU memory (e.g., ~30 GiB for a 2.2M parameter model), making real-time applications and training on consumer hardware infeasible.
Computational Inefficiency: Decoding the full 360° frame is unnecessary because viewing devices (VR headsets, touchscreens) only display a limited viewport at any given time.
Quality Degradation: Extracting a viewport from a high-dimensional embedding space using standard bilinear interpolation can introduce blurriness and artifacts.

2. Methodology: NeRV360 Framework

The authors propose NeRV360, an end-to-end framework that shifts the paradigm from "Decode Entire Frame $\rightarrow$ Extract Viewport" to "Extract Viewport from Embedding $\rightarrow$ Decode Viewport."

A. Core Architecture

Input/Output: The input is an equirectangular 360° frame ( $x_t$ ). The output is a specific viewport ( $\hat{x}_{t,\theta,\phi}$ ) defined by user-selected longitude ( $\theta$ ) and latitude ( $\phi$ ).
Encoder: Uses a ConvNeXt-based encoder (adapted from HNeRV) to compress the input frame into a latent embedding ( $y_t$ ).
Viewport Extraction (Pre-Decoding): Instead of decoding $y_t$ into a full frame, NeRV360 applies a perspective projection directly to the embedding $y_t$ using the user's viewpoint parameters ( $\theta, \phi$ ). This extracts a smaller, viewpoint-specific embedding ( $y_{t,\theta,\phi}$ ).
Channel Expansion Layer: To prevent quality loss caused by bilinear interpolation during the projection step, a channel expansion layer is inserted before extraction. It increases the channel dimension (from $d$ to $c^2$ ) and incorporates a Sinusoidal NeRV-like (SNeRV) block and a Temporal-Aware Affine Transform (TAT) module. This ensures the embedding retains sufficient spatial detail for the subsequent decoding.

B. Viewpoint-Conditioned Decoder (STAT Module)

Since the input to the decoder is now viewpoint-dependent, the standard temporal conditioning is insufficient. The authors introduce the Spatio-Temporal-Aware Affine Transform (STAT) module:

Inputs: It takes embeddings for time ( $t$ ), longitude ( $\theta$ ), and latitude ( $\phi$ ).
Mechanism: It learns affine parameters ( $\beta_{t,\theta,\phi}$ and $\gamma_{t,\theta,\phi}$ ) to conditionally transform features.
Formula: The transformation is defined as:
$\text{STAT}(f_{t,\theta,\phi} | \beta, \gamma) = \gamma_{t,\theta,\phi} \cdot f_{t,\theta,\phi} + \beta_{t,\theta,\phi}$
Structure: The decoder alternates between SNeRV blocks and STAT residual blocks (using GELU activations) to enhance robustness against viewpoint variations.

3. Key Contributions

Viewport Decoder: A novel architecture that reconstructs the user-selected viewport directly from the latent embedding, bypassing the reconstruction of the invisible panoramic regions.
Channel Expansion Strategy: The introduction of a pre-extraction channel expansion layer to mitigate interpolation artifacts in the embedding space, preserving image quality.
STAT Module: A spatio-temporal conditioning mechanism that integrates longitude, latitude, and time embeddings, enabling the decoder to adapt dynamically to the user's viewing direction.

4. Experimental Results

Experiments were conducted on JVET Class S2 6K-resolution 360-degree videos (3072 $\times$ 6144 pixels) with dynamic viewport trajectories.

Performance Comparison (vs. HNeRV & HNeRV-Boost):
- Memory Efficiency: NeRV360 achieved a 7 $\times$ reduction in GPU memory consumption compared to HNeRV (dropping from ~30 GiB to ~4 GiB for decoding).
- Decoding Speed: Achieved a 2.5 $\times$ increase in decoding speed (38.4 FPS vs. 15.0 FPS for HNeRV).
- Quality: Delivered superior objective metrics (PSNR and MS-SSIM) compared to HNeRV and competitive results with HNeRV-Boost.
Training Feasibility: While training 6K models with HNeRV required >50 GiB of VRAM, NeRV360 allowed training on consumer-grade GPUs with 24 GiB of memory.
Ablation Studies:
- Embedding Resolution: Maintaining a high-resolution embedding (128 $\times$ 256) with a single channel was found superior to reducing resolution and increasing channel depth.
- Component Validation: Removing the channel expansion layer or the spatial inputs ( $\theta, \phi$ ) from the STAT module resulted in significant performance drops, confirming their necessity.

5. Significance and Impact

Enabling Real-Time 360° VR: By drastically reducing memory and computational costs, NeRV360 makes real-time, high-fidelity 360-degree video streaming feasible on devices with limited resources (e.g., standalone VR headsets).
Cost-Efficient Training: The framework lowers the barrier for training neural video compression models on ultra-high-resolution (6K+) content, removing the need for enterprise-grade GPU clusters.
Future Scalability: The lightweight design paves the way for future 8K+ immersive experiences and supports flexible viewport rendering, addressing the critical bottlenecks of processing power and memory in next-generation media delivery.