Original authors: Chaoyang Wang, Zeyu Zhang, Meng Meng, Xu Zhou, Haiyun Jiang

Published 2026-05-07

📖 4 min read☕ Coffee break read

Original authors: Chaoyang Wang, Zeyu Zhang, Meng Meng, Xu Zhou, Haiyun Jiang

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a very smart student (the Policy Model) how to solve complex visual puzzles, like counting objects in a 3D scene or figuring out how shapes move.

The Problem: The "Echo Chamber" Trap

Traditionally, when we train these students using a method called Reinforcement Learning (RL), we ask them to generate a bunch of answers on their own. We then reward the best ones and punish the bad ones.

The paper argues that this is like asking a student to study for a test by only reading their own notes. They might get better at repeating what they already know, but they hit a "ceiling." They can't discover new, clever ways to solve problems because they are stuck in an echo chamber, only sampling ideas from their own limited brain. This makes training slow and limits how smart they can eventually become.

The Solution: Vision-EKIPL (The "Expert Panel" Approach)

The authors propose a new framework called Vision-EKIPL. Think of this as hiring a panel of external experts (like GPT-4 or Gemini) to help the student study.

Here is how it works, step-by-step:

The Group Study Session: Instead of just the student generating answers, the system gathers a group of potential answers from two sources:
- The student's own attempts.
- High-quality answers generated by the "external experts" (the auxiliary models).
The Grading: A teacher (the reward function) grades all these answers. Did the student get the right number of spheres? Did they use the correct format?
The Selection: The system picks the top-performing answers from this mixed group. Crucially, if an expert model found a clever solution the student missed, that solution is included in the "top picks."
The Lesson: The student learns from this "best of the best" group. They aren't just learning from their own mistakes; they are being infused with the knowledge and reasoning strategies of the experts.

The Analogy: The Chess Player

Imagine a chess player trying to get better.

Old Way (Standard RL): The player plays 100 games against themselves, wins a few, and tries to figure out what they did right. They might get stuck playing the same safe, boring moves because they never saw a brilliant, risky move that could have won.
Vision-EKIPL: The player plays 10 games against themselves, but also gets to see 10 games played by Grandmasters. The system picks the best moves from both the player and the Grandmasters. The player then studies these top moves. They learn not just their own style, but the brilliant, complex strategies of the experts, pushing their own skill ceiling much higher.

What the Paper Found

The authors tested this method on visual reasoning tasks (like counting objects, understanding geometry, and tracking moving shapes). They found:

Smarter Results: The model trained with Vision-EKIPL solved puzzles better than models trained with standard methods, beating the previous "state-of-the-art" by about 5%.
Faster Learning: Because the model gets to "see" good answers from experts right away, it doesn't have to waste time guessing. It learns faster and converges (finishes training) more efficiently.
Breaking the Ceiling: The most important finding is that this method actually expanded the model's reasoning boundary. While standard RL methods sometimes made models less creative (sticking only to safe, known paths), Vision-EKIPL helped the model discover new, complex ways to solve problems that it couldn't find on its own.

In a Nutshell

Vision-EKIPL is a training method that stops AI models from learning in a vacuum. By mixing their own ideas with high-quality ideas from "expert" AI models, it teaches them to think deeper, solve harder visual puzzles, and learn much faster.

Technical Summary: Vision-EKIPL

Problem Statement

Visual reasoning is a critical cognitive capability for Artificial General Intelligence, essential for applications ranging from autonomous navigation to scene understanding. While Multimodal Large Language Models (MLLMs) have advanced through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), current RL-based approaches (e.g., Group Relative Policy Optimization or GRPO) face significant limitations. Specifically, these methods sample action groups solely from the policy model itself. This self-referential sampling restricts the model's exploration space to its own inherent capabilities, potentially leading to a "reasoning boundary" that does not expand beyond the pre-trained foundation model. Furthermore, existing RL methods often suffer from slow convergence rates and low training efficiency, as they rely on the model to discover complex reasoning paths without external guidance.

Methodology: Vision-EKIPL

To address these limitations, the authors propose Vision-EKIPL (External Knowledge-Infused Policy Learning), a novel RL framework that integrates high-quality actions from external auxiliary models into the policy optimization process.

Core Framework

The framework operates on the principle of infusing external knowledge during the RL training loop to guide the policy model ( $\pi_\theta$ ). The process involves the following steps:

Multi-Source Action Sampling:
Unlike traditional GRPO which samples actions only from the current policy, Vision-EKIPL samples a combined action group $O$ from:
- The current policy model $\pi_\theta$ .
- $M$ external auxiliary models (e.g., GPT-4o, Gemini-1.5-Pro), denoted as $\pi_{\phi_j}$ .
For a given state $s = (x, q)$ (visual encoding and question), the system samples $N$ actions from the policy and $N$ actions from each of the $M$ auxiliary models.
Reward Calculation:
All sampled actions are evaluated using a reward function $R(o_i)$ comprising two components:
- Format Reward ( $R_{format}$ ): Ensures the output adheres to a structured format (e.g., reasoning within <thought> tags and answers within <answer> tags).
- Accuracy Reward ( $R_{acc}$ ): Assigns a positive reward if the final answer is correct, otherwise zero.
High-Quality Action Selection:
The combined pool of actions is sorted by reward values. The top- $G$ actions are selected to form a high-quality action group $T$ . This group serves as the "expert" knowledge base for the current training step.
Policy Optimization (GRPO):
The policy model is updated using the Group Relative Policy Optimization (GRPO) algorithm. The advantage score $A_i$ for each action in the selected group is computed relative to the mean and standard deviation of the group's rewards. The policy parameters $\theta$ are updated to maximize the expected advantage while maintaining a KL divergence penalty to stay close to a reference distribution.

Dynamic Evolution

The paper notes a dynamic shift in the source of actions during training. Initially, the policy model relies heavily on actions from external models because its own reasoning capabilities are weaker. As training progresses and the policy model improves, the proportion of actions selected from the policy model itself increases, allowing it to leverage its own learned strategies while still benefiting from the initial external guidance.

Key Contributions

Novel RL Framework: The proposal of Vision-EKIPL, which is the first method to leverage high-quality actions from external auxiliary models to guide policy-model optimization in visual reasoning.
Expanded Reasoning Boundary: Demonstration that incorporating external actions effectively broadens the policy model's action exploration space, allowing it to discover reasoning paths that a single policy model might overlook, thereby expanding the reasoning boundary.
Enhanced Efficiency and Performance: Validation through extensive experiments that the framework significantly accelerates training convergence and improves overall reasoning performance compared to state-of-the-art baselines.

Experimental Results

The authors evaluated Vision-EKIPL on the Reason-RFT-CoT Benchmark, covering three task categories: Visual Counting, Structure Perception, and Spatial Transformation.

Performance Gains: Vision-EKIPL achieved up to a 5% performance improvement over the state-of-the-art (SOTA) methods on the benchmark.
In-Domain (ID) Results:
- On Visual Counting, RL-based methods (including Vision-EKIPL) consistently outperformed SFT-based methods and baseline models across 2B and 7B parameter scales.
- On Spatial Transformation, Vision-EKIPL achieved the highest performance, surpassing all baseline models, including proprietary models like GPT-4o and Gemini-1.5-Pro in specific out-of-domain generalization metrics.
Out-of-Domain (OOD) Generalization:
- Vision-EKIPL demonstrated superior generalization compared to SFT methods. For instance, on the 7B model, it outperformed ANS-SFT by 19% on Visual Counting.
- In Spatial Transformation, the 2B Vision-EKIPL model exceeded GPT-4o by 34% and Gemini-1.5-Pro by 47% on OOD tasks.
Training Efficiency:
- Vision-EKIPL demonstrated high data efficiency, achieving the performance of Reason-RFT using only 12% to 25% of the training data required by the baseline.
Reasoning Boundary Analysis:
- Pass@K analysis revealed that while traditional RL (Reason-RFT) eventually fell behind the base model as $K$ increased (suggesting over-exploitation of known paths), Vision-EKIPL maintained superior performance across all $K$ values, confirming its ability to push the reasoning frontier.

Significance and Claims

The paper claims that Vision-EKIPL offers a new effective paradigm for visual reasoning research by overcoming the inherent limitations of traditional RL methods. By treating external models as "experts" that provide distilled supervision targets, the framework enables the policy model to learn diverse and novel reasoning strategies that would otherwise be inaccessible through self-sampling alone.

The authors position this approach as a hybrid paradigm combining elements of supervised fine-tuning (via knowledge distillation from external actions) and reinforcement learning. They assert that this method not only significantly enhances the visual reasoning performance of MLLMs but also substantially accelerates training convergence. While the primary validation is on visual reasoning tasks, the authors suggest the framework possesses generality and could theoretically be applied to linguistic and other multimodal tasks.

Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning