Real-Time Neural Video Compression with Unified Intra and Inter Coding

This paper presents a real-time neural video compression framework that unifies intra and inter coding within a single adaptive model and employs a simultaneous two-frame compression design to effectively handle disocclusion, prevent error propagation, and achieve a 12.1% BD-rate reduction over DCVC-RT while maintaining real-time performance.

Hui Xiang, Yifan Bian, Li Li, Jingran Wu, Xianguo Zhang, Dong Liu

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you are trying to send a long video message to a friend over a slow, spotty internet connection. You want the video to look great, but you also need it to arrive quickly without freezing.

For decades, the standard way to do this (like H.264 or H.266) has been to send the first frame of a scene in high detail, and then for the rest of the scene, just send "instructions" on how the picture changes from the previous one. It's like sending a photo of a landscape, and then just sending a note saying, "The tree moved left, the cloud moved right." This saves a ton of data.

However, Neural Video Compression (NVC) is the new, fancy AI way of doing this. It uses smart neural networks to predict the next frame even better than the old rules. But, the current best AI methods have a few big problems, which this new paper, UI2C, solves.

Here is the breakdown of the paper using simple analogies:

1. The Problem: The "Amnesia" and the "Glitch"

Current AI video compressors are great at predicting what happens next if the scene is smooth. But they have two major flaws:

  • The "Scene Change" Problem: Imagine you are watching a video of a soccer game, and suddenly it cuts to a cooking show. The AI, which is used to predicting soccer balls, panics. It tries to guess the cooking show based on the soccer game, and the result is a blurry mess. Old systems handle this by forcing a "reset" (sending a full new photo), but that costs a lot of data and creates a sudden spike in internet usage.
  • The "Whispering Game" Problem: In a game of "telephone," the message gets distorted as it passes from person to person. In video compression, if the AI makes a tiny mistake predicting frame 10, that mistake gets carried over to frame 11, then 12, and so on. By frame 100, the video looks terrible. To stop this, old systems periodically force a "hard reset" (like a refresh button), which again causes those annoying data spikes.

2. The Solution: The "Swiss Army Knife" Model

The authors propose a new system called UI2C (Unified Intra and Inter Coding).

The Old Way: Imagine you have two different workers.

  • Worker A (The Artist): Only works on the very first frame of a scene. They are amazing at drawing from scratch but slow.
  • Worker B (The Editor): Works on all the other frames. They are fast at copying and tweaking, but if the scene changes, they are terrible at drawing from scratch.
  • The Flaw: When the scene changes, you have to fire Worker B and call in Worker A. This switch is slow and expensive.

The New Way (UI2C): You hire one Super-Worker.

  • This worker is trained to be both an Artist and an Editor.
  • If the scene is smooth, they act like an Editor, copying the previous frame with tiny tweaks (fast and efficient).
  • If the scene changes (like the soccer-to-cooking cut), they instantly switch to "Artist mode" and draw the new scene from scratch without needing a manager to tell them to switch.
  • The Result: No more awkward switches, no more data spikes, and the video stays clear even after a scene change.

3. The Secret Sauce: The "Two-Step Dance"

To make this Super-Worker even faster and smarter, the authors introduced a Simultaneous Two-Frame Compression trick.

The Analogy:
Imagine you are trying to describe a dance routine to a friend over the phone.

  • Old Method: You describe step 1, then step 2, then step 3. By the time you get to step 3, you might have forgotten a detail from step 1.
  • UI2C Method: You look at step 1 and step 2 together before you speak. Because you can see the future (step 2), you can explain step 1 much more accurately. You know exactly how the dancer is moving into the next pose.

In technical terms, the AI looks at the current frame and the next frame at the same time. This helps it understand the motion perfectly, even if the previous frames were a bit blurry. It only adds a tiny delay (waiting for one frame), but the quality boost is huge.

4. Training the AI: The "Blindfold" Exercise

How do you teach a computer to switch between "Artist" and "Editor" modes automatically?

The authors used a clever training trick. During training, they sometimes fed the AI a blank screen (like a blank canvas) instead of the previous video frame.

  • This forced the AI to learn how to draw a scene from nothing (Intra-coding).
  • Sometimes they fed it a "noisy" or broken version of the previous frame.
  • This taught the AI: "Hey, if the reference is bad, don't try to copy it! Just draw the new scene yourself."

This means the AI learns to fix its own mistakes on the fly, without needing a human to hit a "refresh" button.

The Bottom Line

The paper shows that this new UI2C system:

  1. Saves Data: It uses about 12% less data than the current best real-time AI compressor (DCVC-RT).
  2. Stays Stable: It doesn't have those annoying spikes in data usage when the scene changes.
  3. Fixes Itself: It stops the "whispering game" errors from ruining the whole video.
  4. Runs Fast: It's still fast enough for real-time video calls and streaming.

In short: They built a video compressor that is smart enough to know when to copy-paste and when to draw from scratch, all while looking ahead to the next frame to make sure everything looks perfect. It's like upgrading from a clumsy typist who needs a spell-checker every few words to a genius writer who knows exactly what to say before they even type it.