TeCoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos

Imagine you want to send a 4K movie to a friend, but your internet connection is slow. You need to shrink the file size without making the movie look like a blurry mess.

For decades, we've used "block-based" compression (like ZIP files for video) to do this. But recently, scientists discovered a new way: Implicit Neural Representations (INRs).

Think of an INR not as a file full of pixels, but as a tiny, custom-built recipe. Instead of storing the picture of a cat, you store the instructions (a neural network) that can "draw" the cat from scratch whenever you need it. This is incredibly efficient because the recipe is tiny compared to the image.

However, there's a catch: The "One Video, One Chef" Problem.
In previous methods, if you wanted to compress 100 different videos, you had to train 100 different "chefs" (neural networks) to learn the specific recipe for each video. This took forever (slow encoding) and required a massive kitchen (huge memory) to handle high-definition videos.

Enter TeCoNeRV. The authors of this paper built a system that solves these problems using three clever tricks. Here is how they did it, explained with everyday analogies:

1. The "Lego Brick" Strategy (Patch-Tubelets)

The Problem: Trying to predict the recipe for a whole 1080p movie frame at once is like trying to bake a 10-foot cake in a tiny oven. It crashes the system (runs out of memory).

The TeCoNeRV Solution: Instead of baking the whole cake at once, they break the video into small, manageable Lego bricks (called "patch tubelets").

Imagine the video is a giant wall. Instead of trying to predict the pattern for the whole wall, the AI only looks at one small 320x160 pixel square at a time.
It learns the recipe for that specific square and then slides over to the next one.
The Magic: Because the AI only ever has to think about a small square, it doesn't matter if the final video is 480p, 720p, or 4K. The "oven" stays the same size. You can train the AI on a small 480p video, and it can instantly bake a 1080p cake just by using more Lego bricks.

2. The "Delta" Notebook (Residual Storage)

The Problem: In a video, the frame at 1:00:01 looks almost exactly like the frame at 1:00:00. If you write down the entire recipe for every single second, you are wasting a ton of space repeating the same instructions.

The TeCoNeRV Solution: They use a Delta Notebook.

Step 1: They write down the full recipe for the very first second of the video.
Step 2: For the next second, they don't write a new recipe. They just write a tiny note saying: "Change the color of the sky slightly," or "Move the car 2 pixels to the right."
Step 3: They keep doing this, only storing the differences (residuals) between the current moment and the previous one.
Since videos change slowly, these "difference notes" are tiny. This shrinks the file size dramatically.

3. The "Smooth Jazz" Rule (Temporal Coherence)

The Problem: Even with the Delta Notebook, the AI was sometimes getting jittery. One second, the recipe might say "draw a blue sky," and the next second, it might suddenly say "draw a purple sky" just because the math got a little weird, even if the video didn't actually change color. This creates "noise" in the file.

The TeCoNeRV Solution: They added a rule called Temporal Coherence Regularization.

Think of this as training the AI to play Smooth Jazz.
They tell the AI: "Hey, if the video is moving smoothly, your internal recipe (the weights) must also change smoothly. Don't jump around!"
By forcing the AI to make gentle, gradual changes to its recipe rather than sudden jumps, the "difference notes" (from Trick #2) become even smaller and easier to compress. It's like smoothing out a bumpy road so the car (the data) can drive faster and use less fuel.

The Result?

By combining these three tricks, TeCoNeRV is a game-changer:

Faster: It encodes videos 1.5 to 3 times faster than previous methods.
Smaller: It reduces the file size (bitrate) by about 36%.
Sharper: The video quality is actually better (higher PSNR) than the competition, especially at high resolutions like 720p and 1080p.
Scalable: It's the first method that can handle high-definition videos without needing a supercomputer to train.

In summary: TeCoNeRV stops trying to memorize the whole movie at once. Instead, it learns to paint the movie in small, smooth, connected brushstrokes, only writing down what changed since the last stroke. This makes it the most efficient way yet to shrink high-quality video using AI.

1. Problem Statement

Implicit Neural Representations (INRs) have shown promise for video compression by representing videos as compact neural networks that map coordinates to pixel values. However, existing approaches face three critical limitations when scaling to high-resolution videos:

Encoding Efficiency: Traditional INRs require overfitting a separate network for every video, making encoding prohibitively slow for practical applications.
Memory Scalability: Hypernetwork-based approaches (which predict INR weights for unseen videos) suffer from memory requirements that scale quadratically with video resolution. This makes training and inference at resolutions like 720p or 1080p infeasible due to GPU memory constraints.
Compression Inefficiency: Prior hypernetwork methods often produce large compressed sizes and low quality at higher resolutions. Furthermore, they lack mechanisms to ensure that the predicted network weights evolve smoothly over time, leading to large, uncompressible variations in weights between consecutive frames even when visual changes are minor.

2. Methodology: TeCoNeRV

The authors propose TeCoNeRV, a framework designed to overcome these scaling and efficiency barriers through three core technical innovations:

A. Patch-Tubelet Decomposition (Spatial & Temporal Scaling)

To address the quadratic memory scaling issue, TeCoNeRV decomposes the video compression task:

Mechanism: Instead of predicting weights for an entire video frame at once, the method partitions each video clip into smaller spatio-temporal volumes called "patch tubelets" (e.g., $N \times 3 \times H_p \times W_p$ ).
Benefit: The hypernetwork predicts weights for these fixed-size patches rather than full frames. This decouples memory requirements from the full video resolution, allowing the model to be trained on lower resolutions (e.g., 480p) and perform inference on much higher resolutions (e.g., 1080p) by simply increasing the number of patches.
Implementation: At inference, the model predicts weights for each patch position and reassembles them. To handle non-integer divisions of frame sizes, the authors use strided patches with overlap and crop the edges to ensure smooth transitions.

B. Residual Encoding (Bitstream Compression)

To reduce the size of the compressed bitstream, the authors exploit temporal redundancy between consecutive clips:

Mechanism: Instead of storing the full unique parameters ( $\theta_{uniq}$ ) for every clip, the system stores the full parameters only for the first clip. For all subsequent clips, it stores only the compact residuals (differences) relative to the previous clip.
Decoding: The decoder reconstructs the full weights for a clip by cumulatively adding the residuals to the base parameters.
Result: This significantly reduces the bitstream size, especially for videos with smooth motion where weight transitions are small.

C. Temporal Coherence Regularization

The authors observed that standard hypernetwork training often results in erratic weight transitions between clips, even when the visual content changes smoothly.

Mechanism: They introduce a temporal coherence regularization term ( $L_{temp}$ ) applied during a fine-tuning stage. This loss function penalizes the $L_1$ norm of the differences between the weight spaces of consecutive clips ( $|\theta_i - \theta_{i+1}|$ ).
Effect: This forces the hypernetwork to produce weights that evolve smoothly and synchronously with the video content.
Outcome: This induces sparsity in the weight residuals, making them even smaller and more compressible. The strength of this regularization ( $\lambda_{temp}$ ) acts as a tunable rate control mechanism, allowing users to trade off bitrate against reconstruction quality.

3. Key Contributions

Scalable Hypernetworks: TeCoNeRV is the first hypernetwork-based approach to successfully scale to high-resolution video compression (480p, 720p, and 1080p) by introducing the patch-tubelet decomposition strategy, overcoming the memory bottlenecks of previous methods.
Resolution-Independent Training: The architecture allows models trained on low-resolution data to infer on high-resolution videos, a capability previously impossible for hypernetworks due to memory constraints.
Efficient Residual Storage: A differential encoding scheme that stores only weight differences between clips, drastically reducing bitstream size.
Temporal Coherence Framework: A regularization technique that aligns weight space evolution with video content, reducing the magnitude and variance of residuals and enabling further compression.

4. Experimental Results

The method was evaluated on standard datasets including UVG, HEVC (Classes B, C, E), and MCL-JCV at 480p, 720p, and 1080p resolutions.

Performance Gains:
- On the UVG dataset at 480p, TeCoNeRV achieved a 2.47 dB PSNR improvement over the baseline (NeRV-Enc*) with a 36% reduction in bitrate.
- At 720p, it achieved a 5.35 dB PSNR gain over the baseline.
- It consistently outperformed NeRV-Enc*, NeRV, and HiNeRV in terms of PSNR, SSIM, and bitrate across all tested resolutions.
Speed:
- TeCoNeRV offers 1.5x to 3x faster encoding speeds compared to the baseline NeRV-Enc*, while maintaining comparable decoding speeds.
Memory Efficiency:
- While NeRV-Enc* requires ~~32GB of GPU memory for 720p training (making 1080p infeasible), TeCoNeRV maintains near-constant memory usage (~~2.1–2.9GB) regardless of resolution, enabling training on consumer-grade hardware.
Quality:
- Visual comparisons show TeCoNeRV preserves structural details and object boundaries significantly better than baselines, which often exhibit blurring or artifacts.

5. Significance

TeCoNeRV represents a significant leap forward in neural video compression. By solving the fundamental memory scaling problem of hypernetworks, it makes high-resolution neural compression practical. The combination of patch-tubelet decomposition and temporal coherence regularization establishes a new paradigm where:

Training and Inference are decoupled from resolution, allowing for flexible deployment.
Encoding is fast (comparable to traditional codecs) while decoding remains simple (feed-forward).
Compression efficiency is improved by explicitly modeling the smoothness of the neural representation's weight space over time.

This work bridges the gap between the theoretical promise of INRs and the practical requirements of modern high-resolution video streaming, paving the way for future research in scalable, neural-based media compression.

TeCoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos

1. The "Lego Brick" Strategy (Patch-Tubelets)

2. The "Delta" Notebook (Residual Storage)

3. The "Smooth Jazz" Rule (Temporal Coherence)

The Result?

1. Problem Statement

2. Methodology: TeCoNeRV

A. Patch-Tubelet Decomposition (Spatial & Temporal Scaling)

B. Residual Encoding (Bitstream Compression)

C. Temporal Coherence Regularization

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Multi-Agent Home Energy Management Assistant

ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Fundamentals of Computing Continuous Dynamic Time Warping in 2D under Different Norms

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Efficient Model Repository for Entity Resolution: Construction, Search, and Integration