VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

The Big Problem: The "Blindfolded Librarian"

Imagine you have a librarian (an AI) who is incredibly smart but has a very short attention span. You hand them a 2-hour movie and ask, "What was the woman in the red dress doing in the background during the third scene?"

If you ask the librarian to read the entire movie script at once, they get overwhelmed. They might start hallucinating (making things up) because they can't hold all the details in their head at once. This is the current problem with AI trying to understand long videos: too much information, not enough focus.

The Solution: VideoTIR (The "Detective with a Toolkit")

The authors propose a new system called VideoTIR. Instead of forcing the AI to stare at the whole video at once, they turn the AI into a detective with a specialized toolkit.

Here is how it works, step-by-step:

1. The Multi-Turn Conversation (The Detective's Notebook)

Instead of guessing the answer immediately, the AI talks to itself in turns:

Turn 1: "I see the whole room, but I can't see the woman in red clearly. I need to zoom in."
Turn 2: The AI uses a tool to zoom in on that specific part of the video.
Turn 3: "Ah, now I see! She was running on a treadmill."
Turn 4: The AI gives the final answer.

This is like a human watching a movie: you don't memorize every second instantly; you pause, rewind, and zoom in on specific moments when you need to.

2. The Toolkit (The Swiss Army Knife)

The AI doesn't just have one way to look at the video. It has a "hierarchical toolkit" with different tools for different jobs:

The Browsing Tool (The Wide-Angle Lens): If the question is general ("What is this video about?"), this tool scans the whole video quickly at a lower resolution, like looking at a map to see the whole city.
The Segment Retriever (The Time-Traveler): If the question is about a specific time ("What happened between 5:00 and 5:30?"), this tool jumps straight to that clip.
The Zoom-In Tool (The Magnifying Glass): If the question is about a tiny detail ("What color was the car's license plate?"), this tool crops the image to get a high-definition look at just that spot.

3. The "Textual Router" (The Smart Dispatcher)

How does the AI know which tool to use? It has a "Textual Router." Think of this as a smart dispatcher in a call center.

When you ask a question, the dispatcher reads it, figures out what you need, and immediately hands the case to the right specialist (the Zoom tool, the Browsing tool, etc.).
This prevents the AI from wasting time using a magnifying glass when it just needs to look at the whole map.

The Secret Sauce: How We Taught the AI to be Efficient

Teaching an AI to use these tools is hard. If you just tell it "use tools," it might get lazy or go crazy.

The "Overuse" Problem: The AI might keep zooming in even when it already has the answer (like checking your email 50 times when you already know the news).
The "Misuse" Problem: The AI might use the wrong tool (using a magnifying glass to read a billboard from a mile away).

To fix this, the authors invented TAGPO (Toolkit Action Grouped Policy Optimization).

The Analogy: Imagine a video game where you get points for solving a puzzle.
- Old Way: You only get points if you solve the whole puzzle at the end. If you wasted 10 moves doing nothing, you still get the same points, so you don't learn to be efficient.
- TAGPO Way: You get instant feedback for every single move. If you use a tool that was unnecessary, you lose points immediately. If you use the perfect tool to get closer to the answer, you get bonus points.
- This teaches the AI to be concise and precise, stopping it from wasting time.

The Training Ground: The "Sandbox"

To teach the AI these skills, you need a lot of practice data. But real videos with perfect "tool usage" instructions are rare.

The Solution: The authors built a Sandbox.
The Analogy: Imagine a flight simulator. They took real video questions and used a super-smart AI to generate fake "training flights." They created thousands of scenarios where the AI practices calling the right tools, making mistakes, and learning from them in a safe, simulated environment before they let it fly the real plane (the actual video).

Why This Matters

Accuracy: It stops the AI from making things up (hallucinations) because it can actually go look for the evidence.
Efficiency: It doesn't waste computer power processing the whole video at high quality. It only zooms in where it matters.
Scalability: It works well even on very long videos (hours long) where other AI models usually fail.

In a nutshell: VideoTIR turns a confused AI into a smart, efficient detective that knows exactly when to scan the whole crime scene and when to pull out a magnifying glass, all while being trained in a virtual gym to avoid wasting time.

1. Problem Statement

Long Video Understanding (LVU) remains a significant challenge for Multimodal Large Language Models (MLLMs). Existing approaches suffer from two primary issues:

Hallucinations and Token Imbalance: MLLMs often hallucinate when processing long videos due to the imbalance between textual and visual tokens. To mitigate this, current methods often rely on aggressive frame subsampling, which loses critical temporal and spatial details.
Inefficiency in Tool-Integrated Reasoning (TIR): While recent TIR methods attempt to use retrieval tools to fetch relevant video segments, they face limitations:
- External Dependencies: Heavy external toolkits require complex pipelines and fixed workflows, limiting generalization.
- Internal Limitations: Lightweight internal methods (e.g., prompting MLLMs to output timestamps) struggle because base models lack fine-grained spatiotemporal pre-training, leading to poor localization.
- RL Training Issues: Standard Reinforcement Learning (RL) strategies (like GRPO) often lead to tool misuse (calling tools but still getting wrong answers) and tool overuse (retrieving excessively fine-grained data when coarse data suffices), causing redundant exploration and slow convergence.
- Data Scarcity: High-quality, fine-grained tool-calling trajectories for SFT (Supervised Fine-Tuning) are difficult to obtain manually.

2. Methodology: VideoTIR

The authors propose VideoTIR, a framework that combines a multi-turn, multi-tool agent with a novel RL optimization strategy.

A. Multi-Turn, Multi-Tool Agent Framework

Unlike single-turn MLLMs, VideoTIR adopts a coarse-to-fine, multi-turn interaction loop:

Textual Router: An internal mechanism that analyzes the user question and current context to decide if the existing visual information is sufficient. If not, it determines whether to invoke global or local tools.
Hierarchical Internal Toolkit: The system utilizes an endogenous toolkit designed to mimic human browsing behaviors:
- Global Tools (Browsing Tool): For holistic understanding, this tool progressively increases video resolution and sampling rate (FPS) to acquire coarse-to-fine visual evidence.
- Local Tools (Temporal-Spatial Grounding Chain): For specific queries, this chain includes:
  - Segment Retriever: Locates relevant video segments based on cosine similarity between text queries and video tokens.
  - Frame Retriever: Extracts keyframes from the selected segment.
  - Zoom-in Retriever: Crops specific high-resolution regions within a frame.

B. Toolkit Action Grouped Policy Optimization (TAGPO)

To address the inefficiencies of standard RL (misuse and overuse), the authors propose TAGPO, a novel policy optimization algorithm:

Stepwise Reward Assignment: Instead of assigning rewards only at the episode level (final answer), TAGPO assigns rewards to individual tool invocations.
Advantage Estimation:
- Penalizing Redundancy: In successful trajectories, earlier or redundant tool calls receive lower rewards (via a decay coefficient $\gamma$ ), discouraging overuse.
- Encouraging Exploration: In failed trajectories, repeated attempts are not penalized as harshly if they lead to new information, preventing the model from getting stuck in local optima.
- Grouped Normalization: Advantages are calculated per tool type to ensure fair comparison and prevent one tool from dominating the policy update.
Composite Objective: The final policy update combines the standard GRPO advantage with the TAGPO advantage to balance answer accuracy and tool efficiency.

C. Sandbox-Based Trajectory Synthesis

To overcome the lack of high-quality training data for tool usage:

The authors developed a synthesis framework using an external strong MLLM (e.g., GLM-4.5V).
Process:
1. Filter video-text grounding data to identify questions requiring tools.
2. Predict plausible tool invocation orders.
3. Rewrite system prompts to diversify syntax.
4. Simulate trajectories in a sandbox to generate intermediate reasoning steps, tool calls, and environment feedback.
5. Adjudication: A judge MLLM filters trajectories for rationality and conciseness, creating a high-quality dataset for SFT cold-starting and RL pre-training.

3. Key Contributions

VideoTIR Framework: A novel multi-turn, multi-tool agent that leverages internal hierarchical tools (Global Browsing + Local Retrieval) to efficiently handle long videos without relying on heavy external pipelines.
TAGPO Algorithm: A new RL optimization method that explicitly mitigates tool misuse and overuse through stepwise reward assignment and grouped advantage estimation, significantly accelerating convergence.
Trajectory Synthesis Framework: A sandbox-based system that automatically generates high-quality, instruction-following tool-calling trajectories, enabling effective SFT cold-starting for RL agents.
Comprehensive Evaluation: Extensive experiments demonstrating state-of-the-art performance across short, medium, and long video benchmarks.

4. Experimental Results

The method was evaluated on three major benchmarks: MVBench, Video-MME, and LongVideoBench, using Qwen2.5-VL (3B and 7B) as the base model.

Performance: VideoTIR with TAGPO outperforms the base Qwen2.5-VL-7B and other SOTA methods (e.g., Video-R1, LongVT-RL), even when using lower resolution and fewer frames (8 frames, low res) compared to competitors requiring high-resolution inputs.
- Example: On Video-MME, VideoTIR (7B, Low Res) achieved 65.1, surpassing the base model (63.1) and methods using high-res inputs.
Efficiency:
- TAGPO vs. GRPO: TAGPO accelerates learning, reaching a valid tool reward of 0.45 approximately 50% faster than standard GRPO.
- Tool Usage: The model learns to call tools only when necessary, reducing redundant calls while maintaining high accuracy.
Ablation Studies:
- SFT Impact: SFT with synthesized trajectories is crucial for enabling the 3B model to follow complex tool-calling formats (format quality jumped from ~1% to ~90%).
- Random Noising SFT: Surprisingly, training with random-noised visual inputs (masking vision) yielded better validation accuracy than standard SFT, suggesting robustness in instruction following.
- Router Effectiveness: The textual router correctly selects "Browsing" tools for global tasks and "Retriever" tools for local tasks in >90% of cases.

5. Significance

VideoTIR represents a significant step forward in making MLLMs viable for long-form video analysis. By shifting from static frame selection to dynamic, tool-integrated reasoning, it solves the "context window vs. detail" trade-off. The introduction of TAGPO provides a generalizable solution to the "reward hacking" and inefficiency problems common in multi-tool RL agents. Furthermore, the synthesis framework offers a scalable path to generating the high-quality training data required for training agentic video models, reducing reliance on expensive human annotations. This work bridges the gap between theoretical reasoning capabilities and practical, efficient video understanding.