Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding

🎬 The Big Problem: The "Needle in a Haystack"

Imagine you are given a 3-hour movie and asked a very specific question, like: "What color was the pneumatic air gun the man was holding at the 25-minute mark?"

Current AI models (Video LLMs) try to solve this by reading the "script" of the movie (the text reasoning) while looking at a few blurry snapshots of the whole film. Because the movie is so long, the AI gets overwhelmed. It tries to guess based on the general vibe, or it gets confused by all the irrelevant scenes. This leads to hallucinations—the AI confidently says, "It was orange!" when it was actually blue, simply because it didn't look closely enough at the right moment.

🛠️ The Solution: Video-TwG (The "Detective" Approach)

The authors propose a new system called Video-TwG. Instead of passively staring at the whole movie, this AI acts like a smart detective.

The "Think" Phase: The AI reads the question and thinks, "Hmm, I need to find a specific tool. The wide shots are too blurry to see the color."
The "Grounding" Phase: Instead of guessing, the AI says, "Wait, let me zoom in." It actively selects a specific 4-second clip (the "grounding" action) where the tool appears and looks at it in high definition.
The Answer: Now that it has the high-definition evidence, it answers correctly: "It's blue."

The Analogy:

Old AI: Like a student taking a test who tries to memorize the entire textbook cover-to-cover but forgets the specific page number, so they guess the answer.
Video-TwG: Like a student who knows they don't know the answer, so they open the book, use the index to find the exact page, read the specific paragraph, and then write the answer.

🎓 How They Taught the AI: The "Curriculum" Strategy

You can't just throw a complex detective task at a beginner AI. The authors used a Two-Stage Curriculum (like a school system):

Stage 1: Kindergarten (Short Videos):
They started by training the AI on short, 20-second clips where the "answer" and the "clue" were clearly marked. It was easy. The AI learned the basic habit: "When I'm unsure, I should zoom in."
Stage 2: University (Long, Complex Videos):
Once the AI mastered the habit, they gave it thousands of long, messy videos (movies, news, vlogs) where the clues weren't labeled. The AI had to figure out when to zoom in on its own. This taught it to generalize and handle real-world complexity.

🏆 The Secret Sauce: The "Self-Confidence" Reward System

To teach the AI to be smart about when to zoom in, they invented a special scoring system (an algorithm called TwG-GRPO).

Imagine the AI is playing a video game where it earns points.

The Trap: If the AI zooms in too much, it wastes time and energy. If it zooms in too little, it gets the answer wrong.
The "Self-Confirmed" Trick: The AI is given a challenge: "You just zoomed in on this clip. Can you answer the question using only this zoomed-in clip?"
- If the AI can answer correctly using only the zoomed-in part, it gets a bonus point.
- If it can't, it gets a penalty.

This teaches the AI to be efficient. It learns: "I shouldn't zoom in unless I'm sure that specific clip will actually help me solve the puzzle." This stops the AI from wasting time looking at irrelevant scenes.

📊 The Results: Why It Matters

When they tested Video-TwG on famous benchmarks (like Video-MME and LongVideoBench):

It beat the competition: It scored significantly higher than other top AI models.
It saved money: Because it only zooms in when necessary, it uses less computing power than models that try to process everything at high definition all the time.
It reduced mistakes: It stopped "making things up" (hallucinating) because it actually looked at the evidence before answering.

💡 The Takeaway

Video-TwG changes the game by teaching AI to pause, think, and zoom in only when it's truly necessary. It turns a passive observer into an active investigator, making it much better at understanding long, complex videos without getting lost in the details.

1. Problem Statement

Long Video Understanding (LVU) is a critical yet challenging task for Video Large Language Models (Video-LLMs). The core difficulties include:

Temporal Redundancy: Long videos contain vast amounts of irrelevant information, making it difficult to locate sparse, crucial clues needed to answer specific questions.
Hallucinations in Fixed Context: Existing reasoning methods often rely on text-only reasoning based on a fixed, statically sampled video context. Due to limited context lengths, these models frequently miss crucial details, leading to hallucinations where they confidently generate incorrect answers based on incomplete information.
Inefficiency: Current approaches either process the entire video (computationally expensive) or rely on complex, multi-stage agent pipelines that suffer from error accumulation.

The paper argues that models need an active, on-demand mechanism to dynamically retrieve and zoom into relevant video segments during the reasoning process, rather than passively relying on a pre-sampled context.

2. Methodology: Video-TwG Framework

The authors propose Video-TwG, a curriculum-reinforced framework that enables Video-LLMs to perform "Think-with-Grounding." This paradigm allows the model to iteratively decide when to perform grounding actions (zooming into specific video clips) during a multi-turn reasoning process.

A. Core Paradigm: Think-with-Grounding

Instead of a single-pass inference, the model engages in a multi-turn dialogue:

Thinking ( $t$ ): The model analyzes the current context.
Action: It decides to either:
- Ground ( $g$ ): Select a specific start and end frame to "zoom in" on a relevant clip. This clip is then processed with fine-grained representation (higher token density) and added to the context.
- Answer ( $a$ ): Provide the final answer if sufficient information is gathered.
Representation: The system uses multi-grained video representation:
- Initial Input: Coarse-grained (fewer frames, fewer tokens) to capture global context.
- Grounded Clips: Fine-grained (more frames, more tokens) to capture detailed evidence.

B. TwG-51K Dataset

To train this framework, the authors constructed TwG-51K, a dataset containing 50,744 video QA samples:

Labeled Subset (8,195 samples): Derived from NExT-GQA and CG-Bench, containing explicit ground-truth video grounding annotations (start/end timestamps).
Unlabeled Subset (42,549 samples): Derived from LLaVA-Video-178K, containing general video QA data without grounding labels, used to improve generalization.

C. Two-Stage Reinforced Curriculum Strategy

To address the difficulty of training a model to perform grounding from scratch, a two-stage curriculum is employed:

Stage 1 (Cold Start): The model is trained on the labeled short-video GQA data. This teaches the model the fundamental behavior of "thinking with grounding" using explicit supervision.
Stage 2 (Generalization): The model is trained on the entire TwG-51K dataset (mostly unlabeled). This encourages the model to generalize the grounding ability to diverse domains and video lengths without explicit grounding labels.

D. TwG-GRPO Algorithm

The training utilizes a modified Group Relative Policy Optimization (GRPO) algorithm with specific reward mechanisms to handle the complexity of reasoning and grounding:

Accuracy & Format Rewards: Standard rewards for correct answers and adherence to the output format.
Fine-grained Grounding Reward ( $R_{grounding}$ ): Applied to labeled data. It combines a soft reward (IoU-based) and a hard reward (binary success) to guide the model toward the correct clip.
Self-Confirmed Pseudo Reward ( $R_{pseudo}$ ): Crucial for unlabeled data. Since ground truth grounding is unavailable, the model acts as its own judge. It attempts to answer the question solely using the last grounded clip. If it succeeds, the grounding is rewarded; if it fails, a penalty is applied. This encourages high-quality, necessary grounding.
Accuracy-Gated Mechanism: Grounding-related rewards are only applied if the final answer is correct. This prevents the model from optimizing for grounding actions that lead to wrong answers, resolving the conflict between accuracy and grounding quality.

3. Key Contributions

Video-TwG Framework: A novel paradigm enabling Video-LLMs to actively decide when to ground video clips during reasoning, effectively mimicking a Retrieval-Augmented Generation (RAG) approach for video.
Two-Stage Curriculum & TwG-GRPO: A training strategy that starts with labeled short videos and scales to unlabeled general data, supported by a reward algorithm featuring self-confirmed pseudo-rewards and an accuracy-gated mechanism.
TwG-51K Dataset: A new benchmark dataset combining labeled grounding data and large-scale unlabeled QA data to facilitate this training paradigm.
State-of-the-Art Performance: Demonstrated consistent improvements over strong baselines across multiple benchmarks without relying on expensive supervised reasoning traces.

4. Experimental Results

The model was evaluated on three mainstream benchmarks: Video-MME, LongVideoBench, and MLVU.

Performance: Video-TwG consistently outperformed strong baselines, including general Video-LLMs, long-context models, and existing video reasoning/agent systems.
- Example: On Video-MME (Low Resolution), Video-TwG achieved 53.6% overall accuracy, a 7.0% improvement over its backbone (Qwen2.5-VL-7B).
- High Resolution: Achieved 59.7% on Video-MME, surpassing the backbone by 2.5%.
Efficiency: The model reduced redundant grounding actions by 15.7% (via the Self-Confirmed Pseudo Reward) while maintaining or improving QA performance.
Ablation Studies:
- Removing the two-stage curriculum caused significant performance drops, proving the necessity of the cold-start phase.
- Text-only reasoning (without grounding) performed significantly worse, especially on long videos, confirming the need for dynamic visual retrieval.
- Simply increasing input context length (scaling) was less effective than intelligent, dynamic grounding.

5. Significance

This work addresses a fundamental limitation in current Video-LLMs: the inability to handle the "needle-in-a-haystack" problem in long videos effectively.

Paradigm Shift: It moves away from passive, fixed-context reasoning to active, iterative perception, where the model dynamically seeks evidence.
Scalability: By utilizing self-confirmed pseudo-rewards, the method can leverage vast amounts of unlabeled data, making it highly scalable and cost-effective compared to methods requiring heavy manual annotation of reasoning traces.
Robustness: The approach significantly reduces hallucinations by ensuring the model only commits to an answer after verifying crucial details through fine-grained grounding, making it more reliable for real-world applications involving long-form content.