Imagine you have a brilliant math tutor who is a world-class expert at solving complex problems, but they are blind. They can't see pictures, diagrams, or geometry shapes. They only understand text.

Now, imagine you have a student who is sighted (they can see images) but isn't very good at solving hard logic puzzles yet. Your goal is to teach the sighted student how to solve difficult visual math problems (like "What is the angle of this triangle in the picture?") using only the blind tutor's help.

The problem? The blind tutor has never seen a triangle, so they can't just "show" the student how to look at a picture. And there are very few textbooks that have both pictures and step-by-step logic written out.

This paper introduces a new method called VOLD to solve this exact problem. Here is how it works, broken down into simple steps:

1. The Problem: The "Blind Tutor" vs. The "Sighted Student"

Usually, to teach a computer to solve visual puzzles, you need thousands of examples where a human wrote out the solution while looking at the picture. These are rare and expensive to make.
However, we have millions of text-only math problems where the "blind tutor" (a powerful AI) has already written out perfect, step-by-step solutions.
The challenge is: How do you teach the sighted student to use the blind tutor's logic when the tutor has never seen an image?

2. The Solution: VOLD (The Two-Stage Training Camp)

The authors created a two-step training process to bridge this gap.

Stage 1: The "Shadowing" Phase (Cold-Start Alignment)

Before the student starts solving problems on their own, they must first learn to think exactly like the tutor.

The Analogy: Imagine the student sitting next to the blind tutor, reading the tutor's written solutions to text-only math problems. The student doesn't just memorize the answers; they try to mimic the tutor's style of thinking.
Why this is crucial: If the student thinks differently than the tutor, the tutor's advice later on will be confusing. The student needs to speak the same "language" of logic before they can learn from the tutor's guidance. The paper shows that if you skip this step, the whole system fails.

Stage 2: The "Coach and Player" Phase (On-Policy Distillation)

Now, the student starts practicing. This is where the magic happens.

The Setup: The student tries to solve a problem. As they think, they generate a "trail of thought" (a series of steps).
The Coach's Role: The blind tutor looks at the student's own trail of thought in real-time.
- If the student is on the right track: The coach stays quiet and lets the student keep going.
- If the student makes a mistake: The coach jumps in and says, "Wait, that step is wrong. Here is how I would have thought about it."
The "Reward" System: The system gives the student a "high five" (a reward) only if they get the final answer right.
The Secret Sauce: The student learns from two things at once:
1. The Reward: "Did I get the answer right?" (This pushes them to find the solution).
2. The Coach's Feedback: "Here is how a genius thinks about this specific step." (This guides them on how to think).

3. The Result: A Super-Student

The paper tested this method on four different types of difficult visual reasoning tests (like geometry, logic, and complex math).

The Surprise: The student model was trained only on text data (using the blind tutor's text solutions). It never saw a single image during its training.
The Outcome: When tested on visual problems (images), this student performed better than other models that were trained directly on thousands of image-text pairs.
Why it works: The student learned the logic of solving problems so well from the text that it could apply that logic to images automatically. It's like learning the rules of chess so well that you can play a game on a board you've never seen before, just by understanding the strategy.

Key Takeaways

Don't skip the basics: You can't just dump the tutor's advice on the student. You must first align their thinking styles (Stage 1).
Guidance is better than just rewards: Teaching the student how to think (distillation) while they practice is much better than just telling them "Good job" or "Bad job" at the end.
Text is powerful: You don't need expensive, hard-to-make image datasets to teach visual reasoning. You can use the abundant, high-quality text reasoning data we already have, provided you use the right training method.

In short, VOLD is a clever way to take a text-only genius and use it to train a visual genius, by making the visual student "shadow" the text genius first, and then having the text genius coach them in real-time as they practice.

Technical Summary: VOLD – Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

Problem Statement

Training Vision-Language Models (VLMs) for complex, multi-step reasoning remains a significant challenge, primarily due to the scarcity of high-quality image-text reasoning datasets. While text-based reasoning resources (e.g., for mathematics and programming) are abundant and scalable—facilitating the success of models like DeepSeek-R1 and QwQ—curating visual reasoning data is labor-intensive and difficult to automate. Existing approaches either rely on synthetic visual traces (which suffer from a modality gap) or collect challenging samples from benchmarks (which risks data contamination and lacks scalability). Furthermore, prior methods attempting text-to-vision transfer often fail to fully leverage the teacher models generating the reasoning traces, missing opportunities for continuous guidance during training.

Methodology: The VOLD Framework

The authors propose VOLD, a framework designed to transfer reasoning capabilities from text-only teacher models to VLM student models using purely text-based training data. The framework operates on the insight that effective cross-modal transfer requires initial policy alignment and combines Reinforcement Learning (RL) with on-policy knowledge distillation.

1. Two-Stage Training Pipeline

The training process consists of two sequential stages:

Stage 1: Policy Alignment via Supervised Fine-Tuning (SFT)
- Objective: To reduce the initial policy divergence between the student VLM and the text-only teacher LLM.
- Mechanism: The student is fine-tuned on a synthetic dataset of reasoning traces generated by the teacher (using the "Mixture-of-Thoughts" dataset prompts).
- Rationale: On-policy distillation requires the student and teacher to have overlapping output distributions. Without this "cold-start" alignment, the teacher's guidance on the student's generated prefixes becomes diffuse and uninformative (state-distribution shift), leading to high-variance gradients and training instability. During this stage, the vision encoder remains frozen to preserve visual capabilities while aligning the language modeling components.
Stage 2: Unified RL and On-Policy Distillation
- Objective: To enhance reasoning capabilities by combining Group Relative Policy Optimization (GRPO) with teacher guidance.
- Mechanism: The framework employs a unified loss function that reuses the same on-policy rollouts for both objectives:
  1. GRPO Component: Optimizes for trajectory-level binary rewards (correct/incorrect answers) on verifiable text-only tasks.
  2. On-Policy Distillation Component: Minimizes the reverse KL divergence between the student's distribution and the teacher's distribution at each step of the student's own generated trajectories.
- Reward-Guided KL Masking: To prevent the distillation signal from penalizing the student when it discovers a correct reasoning path that differs from the teacher's, the authors introduce a masking mechanism. The distillation loss is applied only to incorrect responses ( $r=0$ ), allowing the student to freely explore and retain successful strategies ( $r=1$ ) without teacher interference.

2. Technical Implementation

Models: The student is Qwen2.5-VL-3B, and the teacher is Qwen3-8B. Both share the same tokenizer, a prerequisite for meaningful KL divergence computation.
Data: Training utilizes the "MoT-Teacher-8B" dataset (SFT) and the "orz-57k" text-only math dataset (RL).
Evaluation: The final model is evaluated in a zero-shot setting on visual reasoning benchmarks without any further fine-tuning on image-text data.

Key Contributions

VOLD Framework: A novel approach for transferring reasoning from text-only teachers to VLM students using a unified objective of RL and on-policy distillation, eliminating the need for vision-based reasoning data during training.
Necessity of Policy Alignment: The paper demonstrates that a cold-start alignment via SFT is a critical prerequisite for effective text-to-vision transfer. Without aligning the student's distribution with the teacher's, on-policy distillation fails to provide meaningful guidance.
Unified Training Objective: The authors show that combining RL with teacher distillation significantly outperforms standalone GRPO training, achieving state-of-the-art results despite using only text-based training data.

Experimental Results

The authors evaluated VOLD across four diverse benchmarks: MMMU-Pro, MathVision, MathVista, and LogicVista, along with other datasets like MMStar and DynaMath.

Performance Gains: VOLD outperforms the base model (Qwen2.5-VL-3B) and existing baselines. Notably, on MathVision, VOLD achieves 28.0%, surpassing VLAA-Thinker (24.4%) and the base model (21.9%). On LogicVista, it reaches 45.0%, outperforming VLM-R1 (40.5%) and the base model (40.3%).
Comparison to SOTA: VOLD outperforms methods that train directly on image-text data (e.g., VLAA-Thinker, VLM-R1), despite VOLD training exclusively on text.
Ablation Studies:
- Alignment: Removing the specific teacher-aligned SFT (using original MoT data instead) results in a failure to benefit from on-policy distillation, confirming the necessity of distributional alignment.
- Components: The full VOLD pipeline (SFT + RL + Distillation) consistently outperforms variants using only SFT or SFT+RL, demonstrating that both components are essential.
- Learning Dynamics: Visualization shows VOLD converges to higher training rewards and validation accuracy on visual geometry tasks (Geo3K) compared to vanilla GRPO, indicating successful text-to-vision knowledge transfer.

Significance and Claims

The paper claims that VOLD represents a significant step toward scalable VLM reasoning training. By leveraging abundant text-based reasoning resources rather than relying on scarce visual data, the framework offers a practical solution to the data bottleneck in multimodal reasoning. The authors emphasize that their approach is orthogonal and complementary to existing RL advancements, meaning the unified framework can be seamlessly integrated with improved RL algorithms beyond GRPO. The work establishes that successful text-to-vision reasoning transfer is feasible and effective, provided that the student model is first aligned with the teacher's reasoning distribution.

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation