PyVision-RL: Forging Open Agentic Vision Models via RL

🎬 The Movie: "The Over-Confident Detective"

Imagine you hire a brilliant detective (an AI model) to solve a complex mystery. You give them a magnifying glass (a tool) and a notebook.

The Problem:
In the past, when we trained these detectives using a "reward system" (like giving them gold stars for solving cases), they learned a bad habit. They realized that the fastest way to get a gold star was to stop using the magnifying glass and just guess the answer immediately. They stopped looking for clues, stopped turning pages, and stopped asking questions. They became lazy detectives who refused to interact with the evidence. This is what the paper calls "Interaction Collapse."

The Solution: PyVision-RL
The authors created a new training method called PyVision-RL. Think of this as a new, smarter way to train the detective so they actually want to use their tools and keep working until the job is done.

Here is how they did it, broken down into three simple concepts:

1. The "Try, Filter, and Rank" Strategy (Oversampling–Filtering–Ranking)

The Analogy: Imagine a chef trying to perfect a new soup recipe.

Old Way: The chef makes 8 bowls of soup. If all 8 taste bad, they throw them all away and make 8 more. If all 8 taste perfectly the same, they don't learn anything new because there's no difference to analyze.
PyVision-RL Way: The chef makes 32 bowls (Oversampling). Then, they throw away the ones that are burnt or raw (Filtering). Finally, they look at the remaining bowls and pick the ones that are "just right"—not too easy, not too hard, but challenging enough to teach the chef something new (Ranking).

Why it matters: This stops the AI from getting stuck. It ensures the AI is always learning from the "Goldilocks" examples—cases that are difficult enough to require effort but solvable enough to succeed.

2. The "Bonus for Persistence" Reward (Accumulative Tool Reward)

The Analogy: Imagine a video game where you get points for killing monsters.

Old Way: You get 100 points for killing the final boss. You realize you can get those 100 points faster if you skip all the side quests and just run straight to the boss. So, you stop playing the game the way it was intended.
PyVision-RL Way: The game changes the rules. You get 100 points for the boss, PLUS 1 extra point for every single sword swing, potion used, or door opened along the way.
- The Catch: You only get the sword points if you actually beat the boss. If you swing wildly but fail, you get nothing.

Why it matters: This forces the AI to use its tools (like Python code to analyze images) multiple times. It teaches the AI that interaction is valuable. It stops the "lazy" behavior and encourages the AI to take a long, thoughtful path to the answer.

3. The "On-Demand" Video Viewer (For PyVision-Video)

The Analogy: Imagine you need to find a specific scene in a 2-hour movie.

Old Way (Uniform Sampling): You force the AI to watch every single frame of the movie, even the boring parts where nothing happens. This is like trying to read a whole library to find one sentence. It's slow, expensive, and wastes a lot of energy (tokens).
PyVision-RL Way (On-Demand Context): The AI is given the whole movie file but told: "Don't watch it yet. Just wait. When you think you need to see a specific moment to solve the puzzle, then you can pull up that exact frame."
- If the question is about the ending, the AI only loads the last 5 minutes.
- If the question is about a car crash, it only loads the crash scene.

Why it matters: This is a massive efficiency hack. The AI uses 90% less memory (visual tokens) but actually gets smarter because it focuses only on what matters. It's like using a spotlight instead of flooding the whole room with light.

🏆 The Results: What Did They Achieve?

By using these three tricks, the team built two new "detectives":

PyVision-Image: Great at looking at pictures, zooming in, and doing math on charts. It beat previous record-holders by a significant margin.
PyVision-Video: Great at watching videos and answering questions about them. It solved spatial puzzles (like "how big is this table?") much faster and cheaper than anyone else.

🧠 The Big Takeaway

The paper proves that AI doesn't have to be lazy. If you train it correctly with the right incentives, it will happily use tools, think through problems step-by-step, and interact with the world (images and videos) just like a human expert would.

In short: They taught the AI that doing the work is the reward, not just getting the answer.

1. Problem Statement

The paper addresses a critical bottleneck in training agentic multimodal models (models that can reason, act, and use tools): Interaction Collapse.

The Phenomenon: When Reinforcement Learning (RL) is applied to multimodal models, they often learn to minimize tool usage and reduce multi-turn reasoning to maximize immediate rewards. This leads to "collapse," where the model reverts to passive, short-turn behaviors, negating the benefits of agentic interaction.
The Gap: Existing approaches either rely on static toolsets (predefined cropping/zooming, lacking flexibility) or dynamic tooling (using Python), but the latter has been limited to images and often relies on proprietary APIs. Furthermore, video understanding via dynamic tooling remains underexplored, particularly for open-weight models.
Specific Challenge: Video reasoning typically requires processing massive amounts of visual data. Uniform frame sampling is inefficient, consuming excessive visual tokens and computational resources.

2. Methodology: PyVision-RL

The authors propose PyVision-RL, a unified RL framework for open-weight multimodal models that utilizes Python as a primitive tool for dynamic tooling in both image and video understanding.

A. Agentic Scaffold & Interaction Protocol

Dynamic Tooling: The model interleaves natural language reasoning with executable Python code blocks (<code>...</code>).
Execution Loop: The environment executes the code and returns results (<interpreter>...</interpreter>), which are fed back into the model's context. This loop continues until a final answer is generated (<answer>...</answer>).
Input Handling:
- Images: Injected into both the MLLM context and the Python runtime.
- Videos (PyVision-Video): The full video is loaded only into the Python runtime. The model is instructed to selectively sample and plot task-relevant frames via code during reasoning. This enables On-Demand Context Construction, avoiding uniform sampling.

B. Key Technical Innovations

To prevent interaction collapse and stabilize training, the framework introduces two core mechanisms:

Accumulative Tool Reward:
- Standard RL often penalizes long trajectories. PyVision-RL introduces a reward term proportional to the number of tool calls ( $n_{tc}$ ), but only if the final answer is correct.
- Formula: $R = R_{acc} + 0.1 \cdot n_{tc} \cdot \mathbb{1}\{R_{acc}=1\}$
- Effect: This explicitly incentivizes sustained, multi-turn tool usage without rewarding unproductive or incorrect attempts.
Oversampling–Filtering–Ranking Rollout Strategy:
- Oversampling: Generates a large pool of rollouts ( $\alpha \times B$ ).
- Filtering: Removes "broken" trajectories (e.g., runtime errors, timeouts) and groups with zero reward variance (all correct or all incorrect), which provide no learning signal.
- Ranking (Standard Deviation Sorting): Ranks remaining groups by the standard deviation of rewards within the group. Groups with higher variance represent "moderately difficult" samples (mix of correct/incorrect or varying tool usage) that offer the most informative learning signals. This prioritizes curriculum learning over easy or impossible samples.
- Optimization: The framework removes the standard deviation normalization term in the advantage estimation (GRPO) to further stabilize training dynamics.

3. Key Contributions

PyVision-Image & PyVision-Video: Two state-of-the-art open-weight models for image and video understanding, respectively, built on Qwen2.5-VL-7B.
On-Demand Context Construction for Video: A novel paradigm where the agent dynamically fetches and visualizes specific frames via Python code based on the query, rather than pre-sampling. This drastically reduces visual token consumption.
Stabilized Agentic RL: The combination of the oversampling-filtering-ranking strategy and accumulative tool reward successfully prevents interaction collapse, enabling models to learn long-horizon reasoning behaviors.
Unified Framework: Demonstrates that Python can serve as a universal primitive tool for both static image analysis and dynamic video reasoning.

4. Experimental Results

Image Understanding (PyVision-Image)

Performance: Achieves State-of-the-Art (SOTA) across visual search, multimodal reasoning, and agentic reasoning benchmarks.
- +6.9% on V* (Visual Search).
- +9.6% on WeMath (Multimodal Reasoning).
- Outperforms prior dynamic tooling methods like DeepEyes-v2 and CodeV.
Capability: Demonstrates robust pixel-level analysis (e.g., color histograms, rotation correction) and multi-turn tool usage.

Video Understanding (PyVision-Video)

Performance: Surpasses VITAL (a video agent with static clipping tools) by +2.2% on VSI-Bench (Spatial Reasoning).
Efficiency (The "Killer Feature"):
- Token Usage: PyVision-Video uses an average of 5K visual tokens per sample.
- Comparison: Qwen2.5-VL-7B (uniform sampling) requires 45K tokens for comparable or lower performance.
- Accuracy: PyVision-Video achieves 44.0% accuracy vs. 38.0% for Qwen2.5-VL-7B, proving that selective frame retrieval is superior to brute-force sampling.

5. Significance and Impact

Scalability: The framework proves that open-weight models can achieve complex agentic behaviors previously reserved for proprietary systems, provided the training incentives are correctly aligned.
Efficiency: The "on-demand" video processing strategy offers a massive leap in efficiency, making high-resolution video reasoning feasible on limited hardware by reducing context length by ~90%.
Training Stability: The proposed rollout selection and reward mechanisms provide a blueprint for training stable, long-horizon agentic agents, addressing the "collapse" issue that has plagued the field.
Open Source: The release of code, data, and models (PyVision-Image/Video) democratizes access to advanced agentic vision capabilities.

In conclusion, PyVision-RL demonstrates that sustained interaction and dynamic, on-demand visual processing are critical for the next generation of multimodal agents, enabling them to reason deeply over images and videos with high efficiency and accuracy.