VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting

Imagine you are trying to teach a very smart, but slightly distracted, student how to watch a movie and answer tricky questions about it.

The Problem: The "Daydreaming" Student
Current AI models are like students who have read a million books but haven't watched many movies. When you ask them, "What color was the car after the helicopter flew by?" they often guess based on what usually happens in movies (language bias) rather than actually looking at the video. They might say, "It was probably a red sports car," because that's a common trope, even if the video clearly showed a blue truck.

Other methods try to fix this by giving the student a magnifying glass or a highlighter pen (external tools) every time they get stuck. But this is slow, clunky, and requires the student to stop, grab the tool, use it, and put it back down for every single question.

The Solution: VISIONCOACH (The "Visual Coach")
The authors of this paper created a new training method called VISIONCOACH. Think of it as a personal coach who doesn't just watch the student, but actively helps them learn how to look during practice, so they don't need the coach during the actual test.

Here is how it works, broken down into three simple steps:

1. The "Spot the Trouble" Detector (Visual Prompt Selector)

Imagine the coach has a radar. When the student is answering an easy question (like "Is there a dog in the video?"), the coach lets them work alone. But when the question is hard (like "What specific brand of shoes is the runner wearing?"), the radar beeps.

The coach knows that for this specific hard question, the student needs help seeing the right thing. So, the coach picks a specific visual trick to help.

The Trick: Maybe the coach draws a red circle around the shoes. Maybe they darken the background so the shoes pop out. Maybe they put a number on the exact frame where the shoes appear.
The Goal: This is called a "Visual Prompt." It forces the student's attention to the exact evidence they need, suppressing the distractions.

2. The "Practice with a Coach" (Reinforcement Learning)

Now, the student tries to answer the hard question with the coach's visual hint (the red circle or darkened background).

Because the hint makes the answer obvious, the student gets it right and feels good (high reward).
The student realizes, "Oh! I needed to look at the shoes, not the sky!"
The coach then says, "Great job! Now, try to remember how you found that answer."

3. The "Internalize the Skill" (Self-Distillation)

This is the magic part. Usually, if you rely on a coach, you can't take the coach into the exam room. But VISIONCOACH uses a technique called Self-Distillation.

Think of it like this: The student practices with the coach's red circle. Once they get the answer right, they "memorize" the feeling of looking at the shoes. They internalize the lesson.

The Result: By the time the exam (inference) comes around, the student doesn't need the red circle anymore. They have learned how to look on their own. They can watch the raw video, ignore the distractions, and find the shoes instantly, just like they did during practice.

Why is this a big deal?

No More Clunky Tools: Previous methods required the AI to stop and use external tools (like cropping the video) for every hard question. VISIONCOACH teaches the AI to do this internally. It's like teaching a student to focus, rather than handing them a magnifying glass every time.
Better at "Where" and "When": The paper introduces a special "reward system" that checks not just if the answer is right, but if the AI correctly identified what object it was looking at and when it appeared. It's like grading the student not just on the final answer, but on their ability to point to the exact moment in the video.
Speed: Because the AI doesn't need to stop and use external tools during the test, it answers questions much faster.

The Analogy Summary

Old Way: The student guesses based on stories they've heard.
Tool-Based Way: The student stops, grabs a magnifying glass, looks, answers, puts the glass down. (Slow and annoying).
VISIONCOACH: The coach draws a circle on the practice paper to show the student where to look. The student practices this until they learn how to focus their eyes naturally. In the final exam, they look at the paper and instantly see the answer without needing the circle.

In short, VISIONCOACH teaches video AI models to become better observers by giving them targeted visual hints during training, so they can eventually "see" the truth on their own.

1. Problem Statement

Video reasoning requires models to not only understand semantic content but also to locate and track evidence across time and space (spatio-temporal grounding). Current approaches face three main limitations:

Hallucination: Text-centric models often generate explanations driven by language priors rather than visual evidence, leading to "hallucinated" reasoning.
Inference Overhead: Tool-calling approaches (e.g., invoking external cropping or zooming tools) improve grounding but introduce significant computational latency and require multi-stage processing during inference.
Data/Annotation Costs: Improving grounding typically relies on scaling training data with dense annotations or using heavy inference-time perception modules, which are expensive and inefficient.

The core challenge is to enable models to internalize robust spatio-temporal grounding behaviors during training so they can perform accurate reasoning on raw videos without needing external tools or prompts at inference time.

2. Methodology: VISIONCOACH

The authors propose VISIONCOACH, an input-adaptive Reinforcement Learning (RL) framework that uses visual prompting as a training-time "coach" to guide the model, followed by self-distillation to internalize these skills.

The framework consists of two main components:

A. Visual Prompt Selector (VP-SELECTOR)

Function: A lightweight module that predicts the most effective visual prompt type for a specific video-question pair.
Training: It is trained using a "proxy reasoner" pipeline. Multiple proxy models (e.g., GPT-4o, Gemini) generate reasoning trajectories with different visual prompts (e.g., red circles, darkening regions, frame numbering, attention maps). The prompt yielding the highest combined answer accuracy and grounding score is selected as the pseudo-label.
Operation: During RL training, VP-SELECTOR identifies "hard samples" (inputs where the model initially performs poorly) and dynamically selects an appropriate visual prompt to amplify relevant evidence and suppress distractors.

B. Spatio-Temporal Reasoner (ST-REASONER)

Training Objective: The reasoner is optimized using Group Sequence Policy Optimization (GSPO).
Process:
1. Initial Rollout: The model generates reasoning trajectories on raw inputs.
2. Hard Sample Identification: Inputs with low initial rewards are flagged as "hard."
3. Visual Prompting: For hard samples, VP-SELECTOR generates a visual prompt. The model re-runs reasoning on this prompted input.
4. Reward Calculation: The model receives rewards based on answer accuracy, format, temporal grounding, and a novel object-aware spatial grounding reward.
5. Self-Distillation: If the prompted input yields higher rewards, the model uses self-distillation to learn from these improved trajectories. The loss function encourages the policy to mimic the high-reward reasoning generated with visual guidance, effectively "teaching" the model to find evidence without the prompt.

C. Novel Reward Design

The paper introduces an Object-Aware Spatial Grounding Reward ( $r_{spa}$ ) to address the issue of single-box hallucinations:

Object Identity Consistency: It enforces that the predicted object name matches the ground truth (using soft matching like substring inclusion).
Multi-Region IoU: Unlike prior work that rewards only the single best bounding box, this reward averages the Intersection over Union (IoU) across all predicted bounding boxes that match the object identity and temporal constraints. This encourages the model to ground multiple objects accurately.

3. Key Contributions

Input-Adaptive RL Framework: A novel approach that uses visual prompting only during training to guide difficult samples, then uses self-distillation to remove the dependency on prompts during inference.
Object-Aware Spatial Reward: A new reward function that enforces object identity consistency and multi-region bounding box overlap, preventing object-agnostic hallucinations.
VP-Selector with Proxy Reasoning: A data construction pipeline using proxy reasoners to automatically learn which visual prompts are most effective for specific video-question pairs.
Efficient Inference: Unlike tool-calling methods, VISIONCOACH maintains a single forward-pass inference on raw videos, eliminating computational overhead while achieving superior grounding.

4. Experimental Results

VISIONCOACH was evaluated across diverse benchmarks, demonstrating State-of-the-Art (SoTA) performance:

V-STAR (Spatio-Temporal Reasoning):
- Outperformed proprietary models like GPT-4o and Gemini-2.0-Flash.
- Improved upon the strong open-source baseline Qwen2.5-VL-7B by +15.0% in mean Arithmetic Mean (mAM) and +25.1% in mean Logarithmic Geometric Mean (mLGM).
- Showed significant gains in "What," "When," and "Where" reasoning chains.
General Video Understanding:
- Consistently outperformed other open-source VideoLLMs (e.g., VideoR1, VideoRFT) on VideoMME, WorldSense, VideoMMMU, and PerceptionTest.
- Particularly strong in perception-oriented tasks (e.g., object recognition).
Temporal Grounding (Charades-STA):
- Achieved the highest Recall (R@0.3, R@0.5, R@0.7) and mean IoU among all compared methods, including specialized temporal grounding models.
Efficiency:
- Inference latency analysis showed VISIONCOACH is significantly faster than tool-calling baselines (e.g., EgoR1, LongVT-RL) because it avoids iterative tool invocation.

5. Significance and Impact

Bridging the Gap: VISIONCOACH successfully bridges the gap between "text-centric" reasoning (prone to hallucination) and "tool-based" reasoning (prone to latency). It achieves the grounding benefits of tools without the inference cost.
Internalization of Perception: The work demonstrates that complex perception behaviors (like focusing on specific regions or times) can be internalized by the model through training-time guidance and self-distillation, rather than requiring external modules at test time.
Scalability: By avoiding the need for massive amounts of new annotated data or heavy inference-time tools, this framework offers a scalable path to improving grounded reasoning in large multimodal models.

In conclusion, VISIONCOACH establishes a new paradigm for video reasoning where the model learns to "see" better during training, allowing it to reason accurately on raw videos efficiently at inference.