ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling

Imagine you are trying to describe a 10-minute movie to a friend, but you are only allowed to use 100 words. If you just pick the 100 most important words from the script, you might miss the story because the words are scattered all over. Or, if you just summarize every single scene separately, you might end up repeating the same words (like "the man walks" and "the man walks again") over and over, wasting your precious word count.

This is exactly the problem computer scientists face with Video AI (specifically Multimodal Large Language Models, or MLLMs). These AI models are like super-smart students who can "watch" videos and answer questions. However, to understand a video, the AI breaks it down into thousands of tiny pieces called tokens (like pixels or small image patches). A 10-minute video can generate hundreds of thousands of these tokens. Processing all of them is slow, expensive, and uses up a massive amount of computer memory.

Existing methods try to cut down the number of tokens by looking at each video frame (image) individually. They say, "This frame has a face, keep it. This frame is just a wall, throw it away." But this is like editing a movie by looking at one photo at a time. It fails to realize that if a character is standing still for 5 seconds, the AI is wasting time analyzing the exact same face 150 times in a row.

Enter ForestPrune: The "Forest Ranger" of AI

The authors of this paper propose a new method called ForestPrune. Instead of looking at frames one by one, they look at the whole video as a growing forest.

Here is how it works, using a simple analogy:

1. Building the Forest (Spatial-Temporal Modeling)

Imagine the video is a forest.

The Trees: Instead of treating every frame as a separate island, ForestPrune connects similar things across time. If a person's face appears in Frame 1, Frame 2, and Frame 3, ForestPrune doesn't see three separate faces. It sees one single tree growing through time.
The Roots and Branches: The "root" of the tree is the first time the object appears. The "branches" are the subsequent frames where it continues to exist.
The Rules: The AI only connects these branches if they are:
- Semantically similar (it's the same object).
- Spatially close (it's in roughly the same spot on the screen).
- Temporally ordered (it happens in the right time sequence).

2. Pruning the Forest (The Compression)

Now that the AI has built this "forest" of connected information, it needs to cut it down to size (to save memory).

The Old Way (Image-only methods): Imagine a gardener who looks at each tree individually and cuts off the bottom leaves. If there are 100 identical trees, they cut the bottom leaves off all 100. This is inefficient because the trees are clones!
The ForestPrune Way: The gardener looks at the whole forest. They realize, "Hey, these 100 trees are all part of the same family."
- They keep the Roots (the most important, earliest frames that define the object).
- They keep the Trunks (the core structure).
- They ruthlessly cut off the Leaves (the redundant, repetitive frames that add no new information).

Because they understand the relationship between the frames, they can throw away 90% of the data without losing the story. It's like summarizing a movie by saying, "A man walks into a room, sits down, and talks," rather than describing every single step he took.

Why is this a Big Deal?

The paper tested this on two powerful AI models (LLaVA-Video and LLaVA-OneVision) and found some amazing results:

Massive Savings: They were able to cut the amount of data the AI had to process by 90% (keeping only 10% of the tokens).
No Brain Damage: Even with 90% less data, the AI's ability to answer questions stayed incredibly high (retaining about 95-96% of its original intelligence).
Speed and Efficiency: Because the AI has less data to chew on, it runs much faster and uses less computer memory. In some tests, it was significantly faster than other top methods.
Better at Long Videos: While other methods got confused and started repeating themselves in long videos, ForestPrune stayed sharp because it understood the timeline.

The Bottom Line

Think of ForestPrune as a smart editor who doesn't just cut sentences randomly. Instead, they understand the flow of the story. They know that if a character is just standing there for a minute, they don't need to describe them 60 times. They describe the character once, and then just note that "the character remains there."

This allows AI to watch longer, more complex videos without getting overwhelmed, making video understanding faster, cheaper, and more efficient for everyone.

1. Problem Statement

Multimodal Large Language Models (MLLMs) have achieved significant breakthroughs in vision-language tasks. However, applying them to video tasks introduces severe computational and memory overheads due to the quadratic increase in visual tokens when processing dozens or hundreds of frames.

Current Limitations: Existing token compression methods (e.g., G-Prune, VisionZip, FastV) are primarily designed for static images. They focus on image-wise importance (saliency within a single frame) but fail to model temporal continuity and global redundancy across video frames.
The Gap: As the compression ratio increases (e.g., removing 90% of tokens), image-centric methods suffer from significant performance drops because they retain redundant tokens in adjacent frames that look similar but are treated as independent. They lack a mechanism to evaluate the global importance of tokens across the entire video sequence.

2. Methodology: ForestPrune

The authors propose ForestPrune, a training-free token pruning method specifically designed for video MLLMs. It models video content as a Spatial-Temporal Forest to achieve high-ratio compression while preserving semantic integrity.

Core Workflow

Candidate Node Selection:
- The video is encoded into frame features.
- Representative tokens for each frame are selected (via existing pruning methods like G-Prune or random sampling) to serve as potential nodes for the forest.
Spatial-Temporal Forest Construction:
- Semantic Constraints: A similarity matrix is computed based on cosine similarity between token features.
- Spatial Constraints: The spatial distance between token coordinates is calculated.
- Temporal Constraints: The temporal order of frames is enforced (tokens can only connect to previous or current frames, not future ones).
- Tree Formation: Tokens are connected to form trees (forests) based on a connection matrix $C$ . A token $i$ connects to token $j$ if they are semantically similar ( $A_{ij} \geq \tau_s$ ), spatially close ( $D_{ij} \leq \tau_p$ ), and temporally ordered ( $t_i < t_j$ ).
- Root Identification: Nodes with no incoming connections (in-degree = 0) are identified as Root Nodes. Other nodes are linked to these roots based on a ranking matrix that balances semantic similarity and spatial distance.
Token Pruning Strategy:
- Tree Depth Ranking: Trees are sorted by their depth. Deeper trees represent more complex temporal evolution and are considered more important.
- Node Role Pruning:
  - Leaf/Tail Nodes: Nodes at the ends of trees (leaves) or deep in the tail are pruned first as they represent less critical or redundant information.
  - Root/Trunk Nodes: Root nodes (representing the start of a semantic sequence) and trunk nodes are preserved.
- Budget Allocation: The pruning process continues progressively until the target compression budget is met. If only root nodes remain, the system prioritizes keeping tokens from earlier timesteps.

3. Key Contributions

Spatial-Temporal Modeling: The paper identifies that effective video compression requires modeling the continuity of video content, not just frame-wise saliency. It introduces a novel "Forest" structure to capture global temporal redundancy.
Training-Free Approach: ForestPrune does not require fine-tuning the MLLM. It operates as a pre-processing step, making it highly efficient and adaptable to different base models.
High-Ratio Compression: The method is specifically optimized for extreme compression ratios (up to 90% token reduction) where existing methods fail.
Scalability: By reducing token counts, ForestPrune allows models to process significantly more input frames (e.g., scaling from 64 to 512 frames) within the same computational budget, leading to performance gains on long-video benchmarks.

4. Experimental Results

The method was evaluated on two representative video MLLMs (LLaVA-Video and LLaVA-OneVision) across five benchmarks (NExT-QA, MVBench, VideoMME, MLVU, LongVideoBench).

Performance Retention:
- On LLaVA-OneVision, ForestPrune retained 95.8% of the average accuracy while reducing 90% of the visual tokens.
- On LLaVA-Video, it achieved 94.6% retention at a 90% compression ratio.
- In contrast, state-of-the-art (SOTA) competitors like G-Prune and VisionZip dropped significantly (e.g., to ~85-88% retention) at the same 90% ratio.
Efficiency:
- Time: Reduced pruning time by 81.4% compared to FrameFusion on LLaVA-Video.
- Memory: Significantly reduced GPU memory overhead and computational complexity (TFLOPS) compared to baselines.
Scaling Capability:
- Using ForestPrune to scale input frames from 64 to 512 (keeping token count constant) improved LLaVA-Video's score on MLVU to 72.5, outperforming SOTA models like InternVL3.5 and Qwen3-VL.
Qualitative Analysis: Visualizations show that ForestPrune effectively reduces temporal redundancy (keeping diverse tokens across similar frames) whereas image-centric methods retain nearly identical tokens across consecutive frames.

5. Significance

Bridging the Gap: ForestPrune addresses the critical bottleneck of video processing in MLLMs, enabling high-ratio compression without the catastrophic performance drops seen in previous methods.
Global Optimization: By shifting from frame-wise pruning to a global forest-based approach, it provides a more robust understanding of video dynamics, ensuring that critical temporal information (like cause-and-effect or motion) is preserved.
Practical Deployment: As a training-free, plug-and-play module, it offers an immediate solution for deploying video MLLMs on resource-constrained devices or for processing long-duration videos efficiently.

In summary, ForestPrune redefines video token compression by treating video frames as a connected spatial-temporal forest, allowing for aggressive token reduction while maintaining high semantic fidelity and model performance.