Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding

This paper proposes Video TokenCom, a novel framework that leverages textual intent to guide multi-rate video tokenization and employs Unequal Error Protection-based adaptive source-channel coding to prioritize semantically important tokens, thereby significantly enhancing perceptual and semantic video quality under bandwidth constraints.

Jingxuan Men, Mahdi Boloursaz Mashhadi, Ning Wang, Yi Ma, Mike Nilsson, Rahim Tafazolli

Published 2026-03-04
📖 4 min read☕ Coffee break read

Imagine you are trying to send a high-definition video of a bustling city street to a friend, but your internet connection is very slow and unreliable. In the old days, you would have to compress the entire video equally, making everything blurry so it could fit through the narrow pipe.

This paper proposes a smarter way to do this, called Video TokenCom. Think of it as a "VIP Service" for your video data, where the most important parts get the best treatment, and the boring parts get the bare minimum, all based on what you actually care about seeing.

Here is how it works, broken down into simple concepts:

1. The "Lego" Transformation (Tokenization)

First, the system breaks the video down. Instead of sending millions of tiny colored pixels (like a giant, messy mosaic), it turns the video into a sequence of Lego bricks (called "tokens").

  • The Analogy: Imagine describing a movie not by sending every single frame, but by sending a list of Lego instructions: "Here is a red brick (a car), here is a blue brick (the sky)."
  • Why it helps: This makes the data much smaller and easier to manage, just like sending a list of instructions is easier than sending a pile of loose bricks.

2. The "Director's Note" (Textual Intent)

This is the magic ingredient. Before sending the video, you type a short note telling the system what you care about.

  • The Scenario: You are watching a video of a woman hitting a man with a phone. You type: "Show me the woman and the phone."
  • The Magic: The system uses a super-smart AI (like a digital director) to scan the video. It draws an invisible spotlight on the woman and the phone. Everything else (the sky, the background buildings, the crowd) is marked as "background noise."

3. The "Two-Tier" Delivery System (Multi-Rate Coding)

Now, the system treats the video data differently based on that spotlight.

  • The VIPs (Intended Tokens): The woman and the phone get the Full Treatment. They are sent with high precision, like sending a high-resolution photo. They are protected heavily so they don't get damaged.
  • The Background (Non-Intended Tokens): The sky and the crowd get the Economy Treatment. Instead of sending the full picture of the sky, the system just sends a tiny note saying, "The sky is mostly the same as the last frame, just a little different." This saves a massive amount of space.
  • The Analogy: Imagine mailing a package. You wrap the fragile, valuable vase (the woman) in 10 layers of bubble wrap and put it in a reinforced box. You wrap the old newspaper (the sky) in a single sheet of paper. You still send the whole package, but you used way less packing material.

4. The "Smart Traffic Cop" (Adaptive Coding)

The internet connection changes constantly. Sometimes it's a wide highway; sometimes it's a narrow dirt road.

  • The Problem: If the road gets narrow (bad connection), you can't send everything.
  • The Solution: The system acts like a smart traffic cop. It looks at the road conditions and instantly decides: "Okay, the road is bad. Let's send the woman in high definition, but let's make the background description even shorter."
  • Unequal Error Protection (UEP): This is the technical term for "protecting the VIPs more." If a packet of data gets lost in the mail, the system is designed so that losing a "background" packet doesn't ruin the video, but losing a "VIP" packet would. So, the VIP packets get extra insurance.

5. The Result

When your friend receives the data:

  • They see the woman and the phone crystal clear, exactly as you wanted.
  • The background might look a little fuzzy or blocky, but because you didn't care about it, it doesn't matter.
  • The Win: You got a high-quality experience for the parts that matter, using much less data than traditional video compression (like H.265) would require.

Why is this a big deal?

Current video apps (like Zoom or YouTube) treat every pixel equally. If your internet is slow, everything gets blurry.
This new method is like having a personal editor for your video stream. It listens to your instructions ("I care about the car, not the trees"), cuts out the fluff, and ensures the important stuff gets through clearly, even on a bad connection.

In short: It's not about sending more data; it's about sending the right data, exactly how you want it, while ignoring the rest.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →