Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

Imagine you are trying to find a specific moment in a 2-hour movie where a character runs to a window. You have a super-smart AI assistant (a Video-Language Model) that can watch the whole movie and tell you exactly when that happens.

The Problem:
Watching a 2-hour movie frame-by-frame is like trying to read every single page of a 1,000-page book to find one sentence. It takes forever and uses up a massive amount of computer power. To speed things up, researchers have tried to "prune" (cut out) the boring parts of the video, keeping only the "important" frames.

The Mistake:
Previous methods tried to cut out the video like a standard editor: they kept the frames with the most action or the most obvious objects. But for finding a specific time (like "when did he run?"), this approach fails.

The Analogy: Imagine you are trying to find the exact second a runner crosses the finish line. If you only keep the frames where the runner is standing still (because they are "important" or "salient") and delete the blurry, fast-moving frames in between, you lose the motion. You see the start and the end, but you miss the crossing. The AI gets confused because the "story" of the movement is broken.

The Solution: SemVID (The "Evidence Chain" Keeper)
The authors of this paper, Jiaqi Li and her team, realized that to find a specific moment, you don't just need "important" pictures; you need a continuous chain of evidence. You need to see the runner approaching, the moment they cross, and the moment they stop.

They created a new system called SemVID that acts like a smart film editor with three specific roles for the clips it keeps:

The "Detective" (Object Tokens):
- What it does: It finds the specific things mentioned in your question (e.g., "the window," "the bag").
- The Analogy: It's like a detective looking for clues. It makes sure it doesn't just grab 100 pictures of the same window (redundancy) but grabs pictures of the window, the door, and the person, ensuring a diverse set of clues.
The "Bridge Builder" (Motion Tokens):
- What it does: It specifically keeps the frames where things are changing or moving fast.
- The Analogy: If the Detective finds the start and end points, the Bridge Builder builds the bridge between them. It keeps the blurry, fast-motion frames that show the action happening. Without these, the AI can't tell when the event happened, only what happened.
The "Anchor" (Context Tokens):
- What it does: It keeps a few stable background frames so the AI doesn't get lost.
- The Analogy: Imagine a movie where the camera keeps jumping wildly. You need a few steady shots of the room so you know where you are. These anchors prevent the video from feeling like a chaotic mess of disconnected clips.

How It Works (The "Budget" System):
Instead of just cutting randomly, SemVID gives every second of the video a "budget" of how many clips it can keep.

If a second has the answer to your question, it gets a bigger budget.
If a second has a lot of movement (a transition), it also gets a bigger budget to ensure the "bridge" isn't broken.
It guarantees that every second gets at least one tiny clip (the Anchor), so the timeline never has a "gap" where the AI goes blind.

The Result:
By keeping this "Evidence Chain" intact, SemVID is incredibly efficient.

Speed: It can process videos 5.8 times faster than before.
Accuracy: Even though it throws away 87.5% of the video data, it still gets 95% of the accuracy of watching the whole thing.

In Summary:
Previous methods tried to find the "best" pictures and throw the rest away. This paper says, "No, we need the story." SemVID is a smart editor that keeps the clues, the action, and the background context in just the right order, allowing the AI to find the exact moment you are looking for without having to watch the whole movie.

1. Problem Statement

Video Temporal Grounding (VTG) aims to localize the start and end timestamps of a specific event within a long, untrimmed video based on a natural language query. While Video-Language Models (VLMs) have shown promise in this task, they face severe computational bottlenecks:

Quadratic Complexity: VLMs tokenize videos into thousands of patch tokens, leading to attention costs that scale quadratically with sequence length.
Inefficiency of Existing Pruning: Current training-free token pruning methods, largely borrowed from Video Question Answering (VideoQA), fail in VTG.
- VideoQA often relies on single informative frames (perception-oriented).
- VTG requires temporally coherent evidence to reason about event boundaries and state transitions.
- Naively applying VideoQA pruning (e.g., removing redundant tokens or selecting only salient frames) disrupts the "evidence chain," causing drastic performance degradation because it discards critical boundary cues and breaks cross-frame reasoning paths.

2. Methodology: SemVID

The authors propose SemVID, a training-free pruning framework specifically designed to preserve the "evidence chain" required for VTG. The core insight is that effective pruning must satisfy two VTG-specific principles:

Evidence Retention (ER): Preserving query-critical patches, especially around event boundaries.
Connectivity Strength (CS): Maintaining token-level cross-frame connectivity to allow long-range evidence aggregation.

SemVID operates in two stages:

A. Frame-Level Budget Allocation

Instead of uniform or purely saliency-based allocation, SemVID distributes the token budget across frames by balancing two factors:

Query Relevance ( $s_{EL}$ ): Measures how relevant a frame is to the query using normalized global features.
Inter-frame Variation ( $s_{EC}$ ): Measures temporal change (state transitions) between adjacent frames.
Mechanism: A mixed weight $w^{(t)} = \alpha s_{EL}^{(t)} + (1-\alpha)s_{EC}^{(t)}$ determines the token quota for each frame. This ensures that frames containing event boundaries (high variation) and query-relevant moments receive sufficient tokens, preventing "token-empty" gaps that break the reasoning chain.

B. Role-Aware Semantic Token Selection

Within each frame's allocated budget, SemVID selects three distinct types of tokens to form a coherent evidence chain:

Object Tokens (Evidence):
- Goal: Capture diverse, query-aligned evidence.
- Method: Uses Maximal Marginal Relevance (MMR). It selects patches based on query similarity while penalizing redundancy (visual similarity to already selected tokens). This ensures diverse object coverage without duplicate selections.
Motion Tokens (Connectivity Relays):
- Goal: Bridge evidence across frames by capturing state transitions.
- Method: Identifies patches with high temporal feature variation (large differences between adjacent frames). These are filtered by query relevance to suppress background noise (e.g., camera shake) and retain only meaningful motion cues that act as "relay nodes" for multi-hop attention.
Context Tokens (Stability Anchors):
- Goal: Maintain scene continuity and prevent fragmented reasoning.
- Method: Selects a small, fixed number of tokens per frame based on scene representativeness (matching the frame's global mean) and saliency. These act as stable anchors to keep the scene interpretable even under aggressive pruning.

3. Key Contributions

Problem Formulation: Identified that VTG requires a specific "evidence chain" formulation, distinct from VideoQA, necessitating the preservation of both boundary-critical evidence and cross-frame connectivity.
New Metrics: Defined two measurable metrics for pruning quality in VTG: Evidence Retention (ER) and Connectivity Strength (CS), derived from attention graph analysis.
SemVID Framework: Proposed a plug-and-play, training-free framework that explicitly optimizes ER and CS through semantic budget allocation and role-aware token selection (Object, Motion, Context).
Efficiency: Demonstrated that SemVID achieves high accuracy with significantly reduced computational cost, making long-video VTG practical without retraining.

4. Experimental Results

The authors evaluated SemVID on Charades-STA and ActivityNet-Grounding benchmarks using state-of-the-art VLMs (Qwen3-VL, Qwen2.5-VL, LLaVA-OneVision).

Performance vs. Efficiency:
- Under an aggressive 12.5% token retention budget, SemVID retained 95.4% of the original mIoU (mean Intersection over Union) on Qwen3-VL.
- It achieved a 5.8× prefill speedup compared to the full model.
- It consistently outperformed strong baselines like FastVID (saliency-driven) and VisionZip (redundancy-driven) across all token budgets.
Ablation Studies:
- Removing Motion Tokens caused a significant drop in Connectivity Strength (CS) and mIoU, proving the necessity of transition relays.
- Semantic Budget Allocation was crucial; simply selecting tokens randomly within a fixed budget performed significantly worse than SemVID's targeted allocation.
- MMR improved Evidence Retention by preventing the selection of redundant object patches.
Robustness: SemVID showed superior robustness to subtle actions and strong background transitions compared to baselines, which often failed when boundary cues were compressed.

5. Significance

This paper addresses a critical bottleneck in deploying VLMs for long-video understanding. By shifting the focus from generic token reduction to semantic evidence allocation, SemVID demonstrates that:

Pruning Strategy Matters: One-size-fits-all pruning (from VideoQA) fails for temporal grounding; task-specific structural preservation is essential.
Connectivity is Key: Preserving the "chain" of evidence (via motion tokens and balanced budgeting) is as important as preserving the evidence itself.
Practical Deployment: SemVID offers a simple, training-free recipe to make high-precision video temporal grounding feasible on long videos, reducing latency and memory usage without sacrificing accuracy.

In summary, SemVID redefines token pruning for VTG not as a compression task, but as a graph topology preservation task, ensuring that the reasoning path from query to video evidence remains intact and traceable.

Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

1. Problem Statement

2. Methodology: SemVID

A. Frame-Level Budget Allocation

B. Role-Aware Semantic Token Selection

3. Key Contributions

4. Experimental Results

5. Significance

More like this

On the security of 2-key triple DES

Security issues in a group key establishment protocol

The impact of quantum computing on real-world security: A 5G case study

Yet another insecure group key distribution scheme using secret sharing

How not to secure wireless sensor networks: A plethora of insecure polynomial-based key pre-distribution schemes