SRNeRV: A Scale-wise Recursive Framework for Neural Video Representation

Imagine you are trying to send a massive, high-definition video of a busy city street to a friend, but your internet connection is very slow. You need to shrink the file size without making the video look blurry or pixelated.

For decades, engineers have used a "hand-crafted" recipe to compress videos (like H.265). But recently, a new method called Implicit Neural Representations (INRs) has emerged. Instead of storing a list of pixels, INRs teach a small computer program (a neural network) to "remember" the video. When you want to watch it, the program runs and "draws" the video frame by frame.

The problem? These programs are often huge and inefficient. They try to learn every single detail of the video from scratch at every level of zoom, which is like hiring a different artist to draw the same building at 10 different sizes, even though the building looks the same at all those sizes.

Enter SRNeRV, a new method that fixes this waste. Here is how it works, explained with simple analogies:

1. The Problem: The "Stack of Independent Chefs"

Imagine you are baking a giant, multi-layered cake.

Old Method (Stacked INRs): You hire a different chef for every single layer of the cake. Chef A makes the bottom layer, Chef B makes the middle, and Chef C makes the top. Even though they are all making cake, they each have their own full set of expensive tools and ingredients. It's redundant and expensive.
The Insight: In reality, the logic for making a cake layer is very similar whether it's the bottom or the top. The shape of the layer might change (it gets wider or narrower), but the recipe for mixing the batter is the same.

2. The Solution: The "Smart Recursive Chef" (SRNeRV)

The authors of this paper created a framework called SRNeRV. Instead of hiring new chefs for every layer, they use one master chef who works recursively (repeatedly).

They split the chef's job into two parts:

The "Shape Shifter" (Spatial Mixing): This part handles the specific shape of the current layer. Is it a tiny circle? A wide square? This part is unique for every layer because every layer looks different.
The "Flavor Master" (Channel Mixing): This part handles the complex mixing of ingredients (the "flavor" or data features). This logic is the same whether you are making a tiny layer or a huge one.

The Magic Trick:
SRNeRV hires a different "Shape Shifter" for every layer, but it uses the exact same "Flavor Master" for every single layer.

Think of it like a music producer:

Every song (video scale) needs a unique drum beat (Spatial Mixing) to fit the rhythm.
But the mixing board that balances the vocals and instruments (Channel Mixing) can be the exact same machine for every song.
By reusing the expensive mixing board over and over, you save a massive amount of money (computer parameters) without losing any quality.

3. How It Works in Practice

Start Small: The system starts with a tiny, blurry sketch of the video.
The Loop: It runs this sketch through the "Flavor Master" (shared) and a "Shape Shifter" (specific) to make it bigger and clearer.
Repeat: It takes that slightly bigger version and runs it through the same "Flavor Master" and a new "Shape Shifter" to make it even bigger.
Result: It keeps doing this until the video is full resolution.

Why Is This a Big Deal?

Tiny File Size: Because they reuse the "Flavor Master" (which contains most of the complex math), the final file size is much smaller. It's like sending one instruction manual for the mixing board instead of 10 different ones.
Better Quality: Because they saved space by reusing the mixing board, they have more "budget" to hire specialized "Shape Shifters" for the tricky parts of the video (like fast-moving cars or text on a screen).
The Sweet Spot: This works incredibly well for videos with simple backgrounds (like a news anchor talking) or screen content (like a PowerPoint presentation), where the "rules" of the image don't change much as you zoom in.

The Bottom Line

SRNeRV is like realizing that you don't need a new car engine for every gear in your transmission. You just need one great engine (the shared module) and different gears (the specific modules) to handle the speed. This makes the whole system smaller, faster, and more efficient, allowing us to send high-quality videos over the internet with much less data.

Here is a detailed technical summary of the paper "SRNeRV: A Scale-wise Recursive Framework for Neural Video Representation".

1. Problem Statement

Implicit Neural Representations (INRs) have emerged as a promising paradigm for video compression, representing signals as continuous functions parameterized by neural networks rather than discrete pixel grids. However, existing multi-scale INR generators for video typically employ a stacked architecture, where independent processing blocks are instantiated for each resolution scale (from low to high).

Key Issues Identified:

Parameter Redundancy: Stacking independent blocks for every scale leads to a massive number of parameters, many of which are redundant.
Ignored Self-Similarity: These designs overlook the inherent scale self-similarity in the generative process. The logic required to refine features from a lower scale to a higher scale is conceptually repetitive, yet current models treat each scale as a unique, independent task.
Inefficiency: The lack of parameter sharing limits the scalability and efficiency of INR-based video codecs, especially when compared to the potential compactness of the INR paradigm.

2. Methodology: SRNeRV

The authors propose SRNeRV, a novel framework that replaces the stacked design with a scale-wise recursive framework based on a hybrid parameter sharing scheme.

Core Concept: Scale Self-Similarity

Inspired by the Laplacian pyramid and the repetitive nature of generative mapping, SRNeRV posits that the transformation from low-resolution features to high-resolution features follows a similar logic across different scales. Instead of learning distinct weights for every scale, the model should reuse functional blocks.

Architecture Design

The framework decouples the standard refinement block (typically found in models like HiNeRV or ConvNeXt) into two distinct functional modules:

Scale-Specific Spatial Mixing Module ( $f_{SM}$ ):
- Function: Aggregates local spatial information (e.g., via depthwise convolution).
- Sharing Strategy: Not Shared. Parameters ( $\theta_{SM}$ ) are unique for each scale ( $i$ ) and block position ( $j$ ).
- Rationale: Different resolutions require specific spatial filters to capture unique patterns (e.g., edges at low res vs. fine textures at high res).
Scale-Invariant Channel Mixing Module ( $f_{CM}$ ):
- Function: Performs feature transformation and channel mixing (e.g., via a Feedforward Network/FFN).
- Sharing Strategy: Shared. Parameters ( $\theta_{CM}$ ) are shared across all upsampling scales.
- Rationale: The abstract logic of transforming feature channels is reusable regardless of spatial resolution. Since FFNs typically contain the majority of a network's parameters, sharing this module drastically reduces the model size.

Recursive Generation Process

The generation process (Algorithm 1) is recursive:

Start with an initial low-scale feature grid.
Upsample the features.
Apply a sequence of SRNeRV-Blocks. Each block applies the unique spatial mixer followed by the shared channel mixer.
The output of one scale becomes the input for the next, repeating the process until the final high-resolution frame is generated.

Compression Pipeline

The implementation follows the standard per-instance fitting paradigm (adapted from HiNeRV):

Training: Fit the network to the specific video sequence.
Quantization: Quantization-Aware Training (QAT) is applied.
Entropy Coding: The quantized weights (both shared and scale-specific) are serialized and compressed using an arithmetic coder.
Bitrate Calculation: The total bitrate is the sum of the entropy costs of the scale-specific spatial parameters and the shared channel parameters. The sharing of the channel parameters significantly reduces the second term.

3. Key Contributions

Theoretical Insight: First systematic analysis and exploitation of scale self-similarity within the INR generation process, extending the INR principle from coordinate-wise logic to multi-scale generative logic.
Novel Architecture: Introduction of SRNeRV, a highly compact recursive framework utilizing a hybrid sharing scheme that decouples spatial and channel mixing.
Performance Validation: Extensive experiments demonstrating that this design achieves superior rate-distortion performance, particularly in scenarios where INRs naturally excel.

4. Experimental Results

The authors evaluated SRNeRV on diverse datasets, including UVG, HEVC Class B (HD), HEVC Class E (complex motion), and Screen Content Coding (SCC) sequences.

Comparison: SRNeRV was benchmarked against traditional codecs (H.266/VVC) and state-of-the-art INR baselines (HNeRV, Boost-NeRV, HiNeRV).
Metrics: Performance was measured using BDBR (Bjontegaard Delta Bit-Rate), where negative values indicate bitrate savings for the same quality.
Key Findings:
- Overall Gain: SRNeRV consistently outperformed HiNeRV and other baselines across all datasets.
- INR-Friendly Scenarios: The most significant improvements were observed in HEVC Class E and SCC sequences. In these scenarios (static backgrounds or high-frequency text/graphics), the shared channel mixer effectively models the static/repetitive background, freeing up the parameter budget for the scale-specific spatial modules to capture complex foreground details.
- Ablation Study: Comparing SRNeRV against a "Full Share" variant (sharing the entire block) proved that while sharing helps, retaining scale-specific spatial modules is crucial for balancing parameter compactness with high-fidelity reconstruction.

5. Significance

Efficiency: SRNeRV demonstrates that INR-based video compression can be made significantly more parameter-efficient without sacrificing reconstruction quality. By sharing the heavy FFN components, the model size is drastically reduced.
Paradigm Shift: It validates that the "repetitive logic" principle, common in classical computer vision (e.g., Laplacian pyramids) and generative models (e.g., Diffusion models), applies effectively to the spatial scale axis of INRs.
Future Direction: The paper suggests that targeted recursive sharing is a promising direction for future neural representation designs, offering a path toward highly compact, high-performance video codecs that leverage the inherent structure of visual data.