UniComp: Rethinking Video Compression Through Informational Uniqueness

Imagine you are trying to describe a 2-hour movie to a friend, but you only have 5 minutes to talk.

Most current methods of summarizing videos (like "Attention-based" compression) act like a nervous movie critic. They shout, "Look at this! Look at that! This part is exciting!" They focus on the loudest, most obvious moments. But in doing so, they often miss the quiet, crucial details that actually explain the plot, or they repeat the same "exciting" moment over and over because it grabbed their attention.

UniComp (the method in this paper) takes a completely different approach. Instead of asking "What is loud?", it asks "What is unique?"

Here is how UniComp works, broken down into simple analogies:

1. The Core Idea: The "Unique Ingredient" Rule

Imagine you are making a giant stew (the video).

Old Way: You keep adding salt because the chef (the AI) says, "Salt is important!" But you end up with a bowl of just salty water, missing the carrots, the beef, and the herbs.
UniComp Way: You look at the ingredients and ask, "Which ones can't be replaced?"
- If you have 100 identical potatoes, you only need to keep one potato to represent the whole batch. The other 99 are redundant; you can guess what they look like based on the first one.
- But if you have a single, rare truffle, you must keep it. If you lose it, the whole stew loses its special flavor.

UniComp throws away the "potatoes" (redundant, repetitive parts) and keeps the "truffles" (unique, irreplaceable information).

2. The Three Magic Steps

UniComp uses three specific tools to do this filtering, acting like a smart editor for your video:

Step A: The "Scene Grouping" (Frame Group Fusion)

The Problem: In a video, if a car is driving down a street, Frame 1, Frame 2, and Frame 3 look almost exactly the same.
The UniComp Fix: Instead of treating every frame as a separate chapter, UniComp says, "These 10 frames are basically the same scene." It glues them together into one single summary frame.
The Result: If the scene changes (the car crashes), it instantly creates a new group. It only keeps the frames where the story actually changes.

Step B: The "Budget Manager" (Token Allocation)

The Problem: You have a limited amount of "space" (computing power) to describe the video.
The UniComp Fix: It looks at the summary frames from Step A.
- "This boring scene where the camera just pans across a wall? Give it 1 word of description."
- "This exciting scene where the hero fights a dragon? Give it 100 words of description."
The Result: It dynamically shifts the budget to the parts of the video that actually matter, rather than spreading the budget evenly.

Step C: The "Detail Sifter" (Spatial Dynamic Compression)

The Problem: Even inside a single frame, some parts are boring (a blue sky) and some are vital (a person's face).
The UniComp Fix: It zooms in on the image and asks, "Which pixels are unique?"
- It keeps the pixels that make up the face and the text on a sign.
- It merges the pixels of the blue sky into a single "blue" token because the sky doesn't need to be described pixel-by-pixel.
The Result: It creates a highly efficient "sketch" of the video that contains all the important details but uses very few "words" (tokens) to describe it.

3. Why is this a Big Deal?

The paper shows that UniComp is smarter and faster than the current state-of-the-art methods.

It sees the invisible: In the paper's examples, other methods missed text on a tea box or confused the colors of cups. UniComp kept the unique text and colors because they were "informationally unique," even when the video was compressed to just 5% of its original size.
It's plug-and-play: You don't need to retrain the whole AI brain. You can just plug UniComp into existing video models, and they instantly become better at handling long videos.
It saves time: Because it throws away the "boring" parts before the AI even starts thinking, it can process long videos 4 times faster without losing accuracy.

The Bottom Line

Think of UniComp as a smart filter that stops the AI from getting overwhelmed by a flood of repetitive data. Instead of trying to remember everything, it remembers only what is different and important.

By focusing on Information Uniqueness rather than just "Attention," UniComp allows AI to watch hours of video, understand the story perfectly, and do it in a fraction of the time and computing power it used to take.

1. Problem Statement

The rapid advancement of Multimodal Large Language Models (MLLMs) has created a bottleneck in processing dense video inputs due to high computational costs and memory constraints. Existing video compression methods primarily rely on attention-based importance scoring (e.g., VisionZip, HoliTom). While effective at highlighting salient content, these approaches suffer from several limitations:

Redundancy: They often retain redundant frames and tokens that share similar attention scores, failing to eliminate true information overlap.
Loss of Detail: Under aggressive compression, they tend to overlook fine-grained details and essential information.
Complexity: State-of-the-art methods often require tuning numerous hyperparameters or modifying internal LLM attention layers, making them difficult to generalize across different model architectures.

The authors argue that the essence of video compression should not be "attention" but "information uniqueness." Redundant or overlapping representations can be reconstructed from others, whereas unique information is irreplaceable.

2. Methodology: UniComp

UniComp is a plug-and-play, information-uniqueness-driven framework designed to maximize information fidelity under constrained computational budgets. It is grounded in an information-theoretic formulation where compression is modeled as minimizing the conditional entropy $H(X|S)$ between the full token set $X$ and the retained subset $S$ .

The framework consists of three synergistic modules:

A. Theoretical Foundation

Information Uniqueness: Defined as the inverse of cosine similarity between token features. A token with high uniqueness is distinct from others and carries irreplaceable information.
Optimization Objective: The paper derives an upper bound linking reconstruction error to information uniqueness. Minimizing this bound implies a greedy selection strategy that prioritizes tokens with the highest uniqueness, thereby minimizing information loss.

B. Three Core Modules

Frame Group Fusion (FGF):
- Goal: Reduce temporal redundancy.
- Mechanism: It adaptively merges temporally redundant frames into compact groups based on frame-level uniqueness.
- Process: Frames are scanned sequentially. If a frame's uniqueness difference from the group's representative frame is below a threshold ( $U_f$ ), it is merged. Otherwise, a new group is formed. The group is then fused via mean pooling.
- Benefit: Stable scenes are compressed heavily, while dynamic scenes with semantic transitions retain more frames.
Token Allocation (TA):
- Goal: Distribute the token budget globally across frames.
- Mechanism: Allocates more tokens to frames with higher global uniqueness (semantic deviation from the rest of the video) and fewer tokens to redundant frames.
- Process: Uniqueness scores are normalized and scaled by the number of frames, then converted to a probability distribution via Softmax to determine the token count ( $K_t$ ) for each frame.
Spatial Dynamic Compression (SDC):
- Goal: Eliminate local spatial redundancy within each frame.
- Mechanism: Performs greedy selection and fusion of tokens based on token-level uniqueness.
- Process:
  - Calculates uniqueness for every token using ViT Keys.
  - Sorts tokens by uniqueness.
  - Iteratively selects the most unique token. If a neighbor token is below a uniqueness threshold ( $U_c$ ), it is considered redundant and fused (averaged) with the selected token rather than simply discarded.
- Optimization: The authors optimized the $O(N^2)$ complexity of this graph-based selection using matrix-level parallel computation, reducing overhead by ~20x.

3. Key Contributions

New Paradigm: Shifts the compression focus from attention scores to information uniqueness, providing a theoretical link between uniqueness and reconstruction error.
Unified Framework: Proposes UniComp, which integrates temporal fusion, global allocation, and spatial compression under a single "keep unique" principle.
High Generalizability: Requires only two hyperparameters ( $U_f$ and $U_c$ ) and minimal code changes. It is architecture-agnostic and does not require modifying the internal LLM layers, allowing deployment on various ViTs and LLMs (e.g., LLaVA-OneVision, LLaVA-Video, Eagle2.5).
State-of-the-Art Performance: Demonstrates superior performance across multiple long-video understanding benchmarks.

4. Experimental Results

The authors evaluated UniComp on four major benchmarks: LongVideoBench, EgoSchema, MLVU, and VideoMME, using models like LLaVA-OneVision-7B, LLaVA-Video-7B, and Eagle2.5.

Accuracy: UniComp consistently outperforms SOTA methods (VisionZip, HoliTom, FastVid) across all retention ratios (10% to 25%).
- At 10% retention (extreme compression), UniComp achieved 59.80% average accuracy, surpassing HoliTom by 0.9 points and even outperforming the uncompressed baseline in some settings.
- On Eagle2.5 (designed for hour-long videos), UniComp maintained high performance even with 512-frame inputs compressed to 64-frame tokens.
Efficiency:
- UniComp achieves up to 4.15x faster Time-To-First-Token (TTFT) compared to full-token inference on 320-frame inputs.
- It successfully handles long videos (up to 320 frames) without running out of memory (OOM), whereas baselines often fail.
Ablation Studies: Confirmed that all three modules (FGF, TA, SDC) are necessary. Specifically, token fusion (merging redundant tokens) was shown to be more effective than simple pruning.

5. Significance

Theoretical Insight: The paper establishes a rigorous information-theoretic basis for video compression, proving that maximizing "uniqueness" is equivalent to minimizing reconstruction error.
Practical Deployment: By avoiding complex hyperparameter tuning and internal model modifications, UniComp offers a practical, "plug-and-play" solution for scaling MLLMs to handle long-duration videos efficiently.
Robustness: The method proves robust across different model architectures and video lengths, maintaining semantic fidelity even when retaining only 5-10% of the original tokens.

In conclusion, UniComp redefines video compression by prioritizing informational uniqueness over attention, offering a highly efficient, generalizable, and high-performance solution for the next generation of video-language models.