NeuralLVC: Neural Lossless Video Compression via Masked Diffusion with Temporal Conditioning

Imagine you have a massive library of video files. You want to shrink them down to save space, but there's a catch: you cannot lose a single pixel of information. If you compress a medical scan or a movie master, it must come out exactly the same as it went in. No blurry edges, no missing colors, no "close enough."

This is the challenge of Lossless Video Compression. For decades, we've used traditional tools (like H.264 or H.265) to do this, but they are like using a sledgehammer to crack a nut—they work, but they aren't very efficient.

Enter NeuralLVC, a new AI-powered system that acts like a super-smart, time-traveling librarian. Here is how it works, explained simply.

1. The Problem: Why Video is Hard to Shrink

Think of a video as a stack of 30 photos per second.

Traditional AI (used for lossy compression, like Netflix streaming) looks at a photo and says, "I can guess what the blurry background looks like." It throws away details to save space. This is great for streaming, but terrible for medical records or film archives where every detail matters.
Old Lossless Tools try to save space by finding patterns, but they are rigid. They are like a person trying to pack a suitcase by just folding clothes neatly, without realizing that the shirt you wore yesterday is almost identical to the one you're wearing today.

2. The Solution: The "Time-Traveling" AI

NeuralLVC uses a clever two-part strategy, similar to how you might explain a story to a friend who already knows the beginning.

Part A: The "Snapshot" (I-Frame)

The first frame of the video is treated like a standalone photo. The AI looks at it and breaks it down into tiny 32x32 puzzle pieces.

The Magic Trick: Instead of just guessing the picture, it uses a bijective map. Imagine a secret code where every single color pixel is assigned a unique, unchangeable ID number. If the pixel is Red, it becomes ID #50. If it's Blue, it's ID #51. This ensures that when you decode it later, you get exactly Red or Blue back. No guessing, no errors.

Part B: The "Difference" (P-Frame)

Here is where the magic happens. In a video, the second frame is usually 99% identical to the first.

The Analogy: Imagine you are describing a movie scene to a friend who just watched the previous scene.
- Old Way: You describe the whole scene again: "There's a blue sky, a green tree, and a man in a red shirt."
- NeuralLVC Way: You say, "Remember that blue sky and green tree? They didn't change. But the man in the red shirt moved two steps to the left."
How it works: The AI looks at the current frame and the previous frame. It only tries to compress the difference (the movement). It uses a "lightweight reference" (a tiny memory of the previous frame) to help it predict what changed. This is the "Temporal Conditioning" mentioned in the title—it's the AI using time to its advantage.

3. The "Masked Diffusion" Engine

How does the AI know what to predict? It uses a technique called Masked Diffusion.

The Analogy: Imagine a game of "Taboo" or a crossword puzzle.
- The AI takes a puzzle piece (a patch of the image) and covers up 50% of the words with black squares (masks).
- It looks at the uncovered words around the black squares and tries to guess what the hidden words are.
- Because it can look at the whole picture at once (not just left-to-right like a human reading), it gets a much better understanding of the context.
- Once it guesses the hidden words, it reveals them and covers up a new set, repeating the process until the whole picture is reconstructed perfectly.

4. Why is this a Big Deal?

The researchers tested this on 9 standard video clips.

The Result: NeuralLVC squeezed the videos down 18% to 19% smaller than the best existing professional tools (H.265).
The Guarantee: Unlike some "near-lossless" tools that introduce tiny, invisible errors, NeuralLVC is mathematically perfect. If you encode a video and decode it, the two files are bit-for-bit identical.

5. The Catch (and the Future)

There is one downside: Speed.

The Analogy: Traditional codecs are like a fast-food assembly line—fast, but maybe not the most efficient packing. NeuralLVC is like a master chef hand-folding every origami crane. It takes much longer to process.
Why it matters: This isn't meant for live streaming on your phone right now. It's designed for archiving. Think of national film libraries, medical hospitals, or space agencies storing terabytes of data. They don't need the video now; they need it to be perfect and take up as little space as possible for the next 50 years.

Summary

NeuralLVC is a new way to shrink videos without losing a single drop of data. It does this by:

Using a perfect "secret code" for the first frame.
Only saving the "changes" for the rest of the video, using a smart AI that remembers the previous frame.
Using a "fill-in-the-blanks" game (masked diffusion) to predict exactly what those changes are.

It's a bit slow, but for saving the world's most important digital memories, it's a game-changer.

1. Problem Statement

While neural networks have revolutionized lossy video compression (e.g., DCVC family) and lossless image compression (e.g., HPAC, CALLIC), neural lossless video compression remains largely unexplored.

The Gap: Traditional codecs (H.264, H.265, VVC) rely on hand-crafted predictors and entropy coding. Neural methods for images often fail to exploit the massive temporal redundancy present in video sequences.
The Challenge: Achieving exact pixel-level reconstruction (lossless) in a neural setting is difficult. Many generative or token-based approaches introduce quantization errors or rely on cluster-based tokenization, which prevents perfect reconstruction. Furthermore, integrating temporal conditioning without bloating the model size or sacrificing the strict bijective requirements of lossless coding is non-trivial.

2. Methodology: NeuralLVC

The authors propose NeuralLVC, a neural codec that combines masked diffusion models with an I/P-frame architecture to exploit temporal redundancy while guaranteeing exact reconstruction.

A. Bijective Linear Tokenization

To ensure strict losslessness, the authors avoid cluster-based tokenization (which loses information). Instead, they use a bijective linear mapping from pixel values to tokens:

I-frames (Intra): Each pixel value $x \in [0, 255]$ is mapped to an even token: $Token_I(x) = 2x$ . This creates 256 distinct tokens in the range $[0, 510]$ . The inverse is exact ($x = Token/2$).
P-frames (Predictive): The model encodes the temporal difference between the current pixel $x_t$ and the previous decoded pixel $x_{t-1}$ . The token is calculated as: $Token_P(x_t, x_{t-1}) = (x_t - x_{t-1}) + 255$ . This maps the difference range $[-255, +255]$ to $[0, 510]$ . The original pixel is recovered exactly using the previous pixel: $x_t = Token_P - 255 + x_{t-1}$ .
Vocabulary: Both schemes share a vocabulary size of 511 (plus a special mask token), allowing the P-frame model to be warm-started from the I-frame weights.

B. Masked Diffusion Entropy Model

The core entropy model is adapted from LLaDA (a bidirectional masked diffusion model for language):

Bidirectional Attention: Unlike autoregressive models (left-to-right), this model uses bidirectional attention. When predicting a masked token, it conditions on all unmasked positions in the patch (spatially above, below, left, and right). This is ideal for image data where spatial dependencies are non-causal.
Architecture: A Transformer with 8 layers, 384 hidden dimensions, and 6 attention heads, processing $32 \times 32$ patches (1024 tokens).
Training: The model is trained to predict masked tokens given the unmasked context. The masking ratio is sampled uniformly, and the loss is weighted to ensure the model performs well at both high and low masking levels.

C. I/P-Frame Architecture with Temporal Conditioning

I-Frame: Compresses the first frame independently using the I-frame tokenization.
P-Frame: Compresses subsequent frames by encoding temporal differences. Crucially, it is conditioned on the previous decoded frame.
Reference Embedding: A lightweight learned embedding table maps the previous frame's tokens (tokenized via the I-frame scheme) to vectors. These vectors are added to the current token's embedding and positional embedding.
- Efficiency: This adds only ~1.3% additional parameters (~197K parameters) to the model, yet significantly improves compression by exploiting temporal redundancy.

D. Group-wise Parallel Decoding

Standard masked diffusion requires sequential decoding (one token at a time), which is slow. The authors adopt group-wise parallel decoding (from HPAC):

Tokens are grouped based on a diagonal pattern (parameter $\delta$ ).
The model predicts all tokens in a group simultaneously, conditioned on all previously decoded groups.
This reduces the number of sequential forward passes from 1024 (token-by-token) to roughly 94 (for $\delta=2$ ), significantly speeding up inference while maintaining the exact probability distribution required for arithmetic coding.

3. Key Contributions

First Temporally Conditioned Neural Lossless Codec: One of the earliest works to combine masked diffusion with an I/P-frame architecture specifically for exact lossless video compression.
Bijective Tokenization: Demonstrates that linear tokenization guarantees pixel-level losslessness while enabling effective probability estimation within a diffusion framework.
Lightweight Temporal Conditioning: Introduces a reference embedding mechanism that adds minimal parameters (~1.3%) but yields massive compression gains by leveraging previous frame data.
State-of-the-Art Performance: Outperforms traditional lossless codecs (H.264, H.265) significantly on standard benchmarks.

4. Experimental Results

The model was evaluated on 9 Xiph CIF sequences (352×288, YUV420) and three 720p sequences.

Compression Performance:
- NeuralLVC Average Rate: 29.71% (bits per raw bit).
- H.265 Lossless: 36.37% (NeuralLVC is 18.3% better).
- H.264 Lossless: 36.77% (NeuralLVC is 19.2% better).
- VVC (QP=0): 27.24%. Note: VVC at QP=0 is "near-lossless" and introduces quantization errors, whereas NeuralLVC is strictly lossless.
Ablation Studies:
- Temporal Conditioning: Removing the reference embedding (using only difference tokens) increased the rate to 45.91%, proving the reference embedding is essential for exploiting temporal redundancy.
- Spatial vs. Temporal: Even though the I-frame model alone (3.78 bpsp on Kodak) is weaker than specialized image codecs like HPAC (2.73 bpsp), the temporal conditioning allows the full video system to outperform frame-by-frame image compression.
Scalability: Preliminary tests on 720p sequences show the method scales well, outperforming H.265 lossless by ~17% on low-motion content.
Speed: The method is currently slower than traditional codecs (~0.06 FPS on CIF vs. 2.2 FPS for H.265), making it suitable for offline archival rather than real-time streaming.

5. Significance and Conclusion

NeuralLVC represents a significant step forward in neural video compression by bridging the gap between the high compression efficiency of learned models and the strict fidelity requirements of professional workflows (medical imaging, film mastering, archival).

Paradigm Shift: It moves away from hand-crafted predictors toward learned, bidirectional diffusion models that naturally handle complex spatial and temporal dependencies.
Practical Impact: By achieving strict pixel-perfect reconstruction while outperforming H.264/H.265, it offers a viable alternative for applications where even minor artifacts are unacceptable.
Future Work: The authors identify speed optimization (via architectural changes or distillation) and handling arbitrary scene cuts as key areas for future improvement.

In summary, the paper demonstrates that masked diffusion with temporal conditioning is a highly promising direction for neural lossless video compression, offering a new trade-off between compression ratio and reconstruction fidelity that traditional codecs cannot match.