Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding

Imagine you are trying to send a high-definition video of a bustling city street to a friend, but your internet connection is very slow and unreliable. In the old days, you would have to compress the entire video equally, making everything blurry so it could fit through the narrow pipe.

This paper proposes a smarter way to do this, called Video TokenCom. Think of it as a "VIP Service" for your video data, where the most important parts get the best treatment, and the boring parts get the bare minimum, all based on what you actually care about seeing.

Here is how it works, broken down into simple concepts:

1. The "Lego" Transformation (Tokenization)

First, the system breaks the video down. Instead of sending millions of tiny colored pixels (like a giant, messy mosaic), it turns the video into a sequence of Lego bricks (called "tokens").

The Analogy: Imagine describing a movie not by sending every single frame, but by sending a list of Lego instructions: "Here is a red brick (a car), here is a blue brick (the sky)."
Why it helps: This makes the data much smaller and easier to manage, just like sending a list of instructions is easier than sending a pile of loose bricks.

2. The "Director's Note" (Textual Intent)

This is the magic ingredient. Before sending the video, you type a short note telling the system what you care about.

The Scenario: You are watching a video of a woman hitting a man with a phone. You type: "Show me the woman and the phone."
The Magic: The system uses a super-smart AI (like a digital director) to scan the video. It draws an invisible spotlight on the woman and the phone. Everything else (the sky, the background buildings, the crowd) is marked as "background noise."

3. The "Two-Tier" Delivery System (Multi-Rate Coding)

Now, the system treats the video data differently based on that spotlight.

The VIPs (Intended Tokens): The woman and the phone get the Full Treatment. They are sent with high precision, like sending a high-resolution photo. They are protected heavily so they don't get damaged.
The Background (Non-Intended Tokens): The sky and the crowd get the Economy Treatment. Instead of sending the full picture of the sky, the system just sends a tiny note saying, "The sky is mostly the same as the last frame, just a little different." This saves a massive amount of space.
The Analogy: Imagine mailing a package. You wrap the fragile, valuable vase (the woman) in 10 layers of bubble wrap and put it in a reinforced box. You wrap the old newspaper (the sky) in a single sheet of paper. You still send the whole package, but you used way less packing material.

4. The "Smart Traffic Cop" (Adaptive Coding)

The internet connection changes constantly. Sometimes it's a wide highway; sometimes it's a narrow dirt road.

The Problem: If the road gets narrow (bad connection), you can't send everything.
The Solution: The system acts like a smart traffic cop. It looks at the road conditions and instantly decides: "Okay, the road is bad. Let's send the woman in high definition, but let's make the background description even shorter."
Unequal Error Protection (UEP): This is the technical term for "protecting the VIPs more." If a packet of data gets lost in the mail, the system is designed so that losing a "background" packet doesn't ruin the video, but losing a "VIP" packet would. So, the VIP packets get extra insurance.

5. The Result

When your friend receives the data:

They see the woman and the phone crystal clear, exactly as you wanted.
The background might look a little fuzzy or blocky, but because you didn't care about it, it doesn't matter.
The Win: You got a high-quality experience for the parts that matter, using much less data than traditional video compression (like H.265) would require.

Why is this a big deal?

Current video apps (like Zoom or YouTube) treat every pixel equally. If your internet is slow, everything gets blurry.
This new method is like having a personal editor for your video stream. It listens to your instructions ("I care about the car, not the trees"), cuts out the fluff, and ensures the important stuff gets through clearly, even on a bad connection.

In short: It's not about sending more data; it's about sending the right data, exactly how you want it, while ignoring the rest.

1. Problem Statement

The paper addresses the limitations of current video communication systems, particularly in the context of future AI-native wireless networks.

Inefficiency of Conventional Methods: Traditional codecs (e.g., H.265) and existing semantic communication (SemCom) schemes often treat all video content uniformly or rely on continuous feature representations. They lack the ability to dynamically prioritize specific semantic regions based on user intent.
Bandwidth Constraints: In restrictive bandwidth environments, transmitting full-resolution video or uniform semantic features leads to significant quality degradation.
Lack of Intent-Awareness: Existing systems do not effectively leverage user-defined textual intents to guide the compression and transmission process, resulting in wasted resources on non-critical content.
Gap in Token-Based Video: While "TokenCom" (using discrete tokens as communication units) has been explored for text and images, its application to video with adaptive source-channel coding remains under-studied.

2. Methodology

The authors propose a Video TokenCom framework that integrates discrete video tokenization, multimodal intent extraction, and Unequal Error Protection (UEP)-based adaptive coding. The system operates in three main stages:

A. Video Tokenization

Discrete Representation: The input video is converted into a grid of discrete tokens using a pre-trained video tokenizer (e.g., Cosmos models).
Mechanism: Instead of raw pixels, the video is mapped to a codebook of spatio-temporal prototypes. This reduces data volume significantly while preserving high-level semantics.

B. Textual Intent-Guided Token Extraction

Semantic Masking: The system uses a Vision-Language Model (specifically CLIP) to generate a text-conditioned heatmap on the first frame based on a user's textual description (e.g., "a woman hitting a phone").
Temporal Propagation: To maintain consistency across frames, the initial semantic mask is propagated through the video sequence using optical flow.
Token Classification: The pixel-level masks are mapped onto the discrete token grid. Tokens are classified into two disjoint sets:
1. Intended Tokens ( $S$ ): Correspond to regions matching the user's text.
2. Non-Intended Tokens ( $N$ ): Correspond to the rest of the video.

C. Semantic-Aware Multi-Rate Bit Allocation

The framework employs a differential coding strategy based on token classification:

Intended Tokens: Transmitted with full codebook precision ( $B_{full}$ ) to ensure high fidelity.
Non-Intended Tokens: Transmitted using reduced precision differential encoding. The difference between the current token and a reference token is calculated, quantized with fewer bits ( $B_{\Delta}$ ), and clipped to a smaller range. This drastically reduces the bitrate for background or irrelevant content.

D. UEP-Based Adaptive Source-Channel Coding

To handle varying channel conditions (SNR) and resource constraints, the system formulates a joint optimization problem:

Unequal Error Protection (UEP): Intended and non-intended tokens are assigned different Modulation and Coding Schemes (MCS).
Optimization Objective: The system minimizes a weighted sum of semantic distortion and transmission delay subject to a fixed resource budget (bandwidth/time).
Adaptation: The optimizer dynamically selects the bit-precision ( $B_{\Delta}$ ) and MCS for non-intended tokens based on instantaneous SNR, ensuring reliable transmission of critical semantic data while adapting the background data rate to channel quality.

3. Key Contributions

Intent-Relevance Extraction Framework: A novel pipeline combining vision-language modeling and optical flow to convert user text into discrete token classes, enabling fine-grained semantic prioritization.
Multi-Rate Bit-Allocation Strategy: A source coding scheme that assigns full precision to intent-relevant tokens and reduced differential precision to others, significantly improving rate efficiency without sacrificing semantic quality.
Joint Source-Channel Optimization: A UEP-based adaptation scheme that balances distortion and delay. It explicitly separates the transmission configuration for intended vs. non-intended tokens, optimizing for both perceptual quality and reliability under resource constraints.
Scalable Architecture: Unlike end-to-end deep learning models, this framework uses pre-trained tokenizers and follows an OSI-compatible layered design, offering flexibility and scalability.

4. Experimental Results

The framework was evaluated on the MCL-JCV and UVG video datasets, comparing against H.265 (conventional) and VC-DM (diffusion-based SemCom).

Performance Metrics: The proposed method outperformed baselines across PSNR, SSIM, LPIPS (perceptual), FVD (Fréchet Video Distance), and CLIP similarity (semantic).
Key Findings:
- Low Bitrate Efficiency: At ultra-low bitrates (0.013 BPP), Video TokenCom achieved superior quality compared to H.265 and VC-DM (0.02 BPP).
- Semantic Fidelity: At an SNR of 6 dB, the framework reduced the FVD metric by nearly 1500 points compared to H.265, indicating vastly superior temporal coherence and semantic preservation.
- Robustness: While H.265 failed to decode >85% of frames at low SNRs (leading to "Failed" results in graphs), Video TokenCom maintained stable decodability and reconstruction quality across all SNR levels.
- Intent Control: Visualizations confirmed that changing the textual intent (e.g., from "Sky" to "Car and person") successfully shifted the high-quality reconstruction focus to the specified regions while degrading irrelevant areas, all within a similar total bitrate.

5. Significance

This paper represents a significant step toward AI-native wireless communications. By treating video as a stream of discrete, semantically structured tokens rather than raw pixels, it bridges the gap between Large Multimodal Models (MLLMs) and physical layer transmission.

Efficiency: It demonstrates that "semantic-aware" compression can achieve higher quality at lower bitrates than traditional codecs.
User-Centricity: It introduces a paradigm where communication resources are allocated based on what the user cares about (textual intent), rather than uniform quality.
Robustness: The UEP-based adaptation ensures that critical semantic information survives poor channel conditions, making it highly suitable for future 6G networks where reliability and efficiency are paramount.