LLMs Can Learn to Reason Via Off-Policy RL

The Big Problem: The "Out-of-Date Map" Issue

Imagine you are training a robot chef (the AI) to cook the perfect meal.

The Trainer: This is the head chef giving instructions.
The Inference Engine: This is the robot actually cooking in the kitchen.

In the old way of doing things (called On-Policy or GRPO), the head chef and the robot chef had to be perfectly synchronized. Every time the robot cooked a dish, the head chef had to taste it immediately and give feedback based on the exact same recipe the robot just used.

The Problem: In the real world, this is impossible.

The Lag: The robot is cooking fast, but the head chef is slow. By the time the chef gives feedback, the robot has already updated its recipe book 50 times.
The Mismatch: Even if they have the same recipe book, the robot's kitchen tools (hardware) might measure "saltiness" slightly differently than the chef's tongue.

Because of this lag, the data the chef uses to teach the robot is off-policy. It's like the chef trying to correct a robot using a map from yesterday, while the robot is driving on a road that changed this morning. To fix this, previous methods tried to "force" the map to match the road by using complex math corrections (Importance Sampling), which often made the training unstable or slow.

The Solution: OAPL (The "Lagged Inference" Approach)

The authors of this paper say: "Stop fighting the lag. Embrace it."

They propose a new method called OAPL (Optimal Advantage-based Policy Optimization with Lagged Inference policy). Instead of trying to make the robot and chef perfectly synchronized, they accept that the robot is always a few steps ahead (or behind) and design a training system that works with that delay.

The Analogy: The "Video Game Speedrun" Coach

Imagine a video game speedrunner (the AI) trying to beat a record.

Old Method (GRPO): The coach watches the runner, stops the game, analyzes the move, and then the runner tries again. If the runner changes their strategy too fast, the coach gets confused and the training fails.
New Method (OAPL): The coach records the runner's gameplay. Even if the runner is now 400 levels ahead, the coach looks at the old recording and says, "Hey, in that specific situation, you could have done better."

The coach doesn't need the runner to be in the exact same state. The coach uses a special formula (a "squared regression loss") that essentially asks: "If you had taken this path, how much better would you have done compared to the average of all the paths you took?"

This allows the coach to train the runner using old data without needing to stop the game or use complex math to "correct" the differences.

Why is OAPL Better?

The paper shows three major wins for this new approach:

1. It's Faster and Cheaper (Sample Efficiency)

The Analogy: Imagine you are learning to drive.
- GRPO requires you to drive a new route, stop, get feedback, and drive again. You burn a lot of gas (computing power) just to get a little better.
- OAPL lets you drive 100 miles, then look back at the whole trip and learn from it all at once.
The Result: In coding tests, OAPL achieved the same results as a top-tier model (DeepCoder) but used 3 times fewer training examples. It's like getting a PhD with 1/3 of the homework.

2. It's More Stable (No "Entropy Collapse")

The Analogy: When you force a student to memorize one specific answer too hard, they stop thinking creatively and just repeat that one answer (this is called "entropy collapse").
The Result: Because OAPL doesn't force the AI to stay perfectly aligned with the "old" version of itself, the AI keeps its creativity. It explores more options, which leads to better problem-solving skills in math and coding.

3. It Scales Better (The "Pass@k" Superpower)

The Analogy: If you ask a student to solve a math problem once, they might get it right 50% of the time. If you ask them to try 100 different ways, a smart student will eventually find the right answer.
The Result: OAPL-trained models get significantly better as you let them try more times (Pass@k). While other models plateau, OAPL models keep getting smarter the more "guesses" they are allowed to make.

The Bottom Line

The paper proves that you don't need perfect synchronization to train smart AI.

By accepting that the training data will always be slightly "out of date" (off-policy) and building a system that handles that naturally, we can train Large Language Models to reason better, faster, and with less computing power. It's a shift from trying to force the world to match our training rules, to building training rules that work with the messy reality of the world.

1. Problem Statement

Current Reinforcement Learning (RL) post-training for Large Language Models (LLMs), particularly for reasoning tasks, predominantly relies on on-policy algorithms like Group Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO). These methods assume that the data used for training is generated by the current policy being optimized.

However, in modern distributed training architectures, this assumption is frequently violated, leading to an off-policy scenario by design due to two main factors:

Implementation Mismatch: The "trainer" (e.g., HuggingFace) and the "inference engine" (e.g., vLLM) often produce different log-probabilities for the same token sequence, even with identical weights, due to kernel implementation differences.
Asynchronous Training: In distributed pipelines, the inference engine often lags behind the trainer (holding older weights), meaning data is generated by a policy ( $\pi_{vllm}$ ) that differs significantly from the current training policy ( $\pi$ ).

Current Limitations:

Importance Sampling (IS): Prior work attempts to correct this by applying IS ratios to reweight the loss. This introduces high variance and requires complex heuristics (clipping, deleting tokens/rollouts) to stabilize training.
Engine Modification: Other approaches try to align the inference engine with the trainer, which slows down inference and fails to fully close the gap in asynchronous settings.
Instability: These workarounds often lead to training instability, policy collapse (entropy collapse), and poor test-time scaling.

2. Methodology: OAPL

The authors propose Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL), a novel off-policy RL algorithm that embraces the lag rather than trying to eliminate it.

Core Concept

OAPL frames the mismatch between the trainer and inference engine as a KL-regularized RL problem. Instead of forcing the data to be on-policy, it optimizes the policy $\pi$ to maximize reward while minimizing the KL divergence to the current inference policy $\pi_{vllm}$ .

The objective function is:
$\max_{\pi} \mathbb{E}_{x,y \sim \pi(\cdot|x)} [r(x, y)] - \beta \text{KL}(\pi || \pi_{vllm})$

Key Algorithmic Steps

Closed-Form Solution: Leveraging the known closed-form solution for KL-regularized RL, the optimal policy $\pi^*$ relates to the value function $V^*$ and the inference policy $\pi_{vllm}$ via:
$\beta \ln \frac{\pi^*(y|x)}{\pi_{vllm}(y|x)} = r(x, y) - V^*(x)$
Here, the term $r(x, y) - V^*(x)$ represents the optimal advantage ( $A^*$ ).
Squared Regression Objective: Instead of using policy gradients with importance weights, OAPL minimizes a squared regression loss to estimate the optimal advantage directly:
$\min_{\pi} \sum_{x} \sum_{i=1}^{G} \left( \beta \ln \frac{\pi(y_i|x)}{\pi_{vllm}(y_i|x)} - (r(x, y_i) - \hat{V}^*(x)) \right)^2$
Where $\hat{V}^*(x)$ is estimated from a group of rollouts sampled from $\pi_{vllm}$ :
$\hat{V}^*(x) = \beta \ln \frac{1}{G} \sum_{i=1}^{G} \exp(r(x, y_i)/\beta)$
Asynchronous Pipeline (Algorithm 1):
- Synchronization: The trainer ( $\pi$ ) and inference engine ( $\pi_{vllm}$ ) are synchronized periodically (every $L$ steps).
- Off-Policy Phase: Between synchronizations, $\pi_{vllm}$ generates data asynchronously. The trainer updates $\pi$ using data from the buffer $D$ and the log-probabilities from $\pi_{vllm}$ directly.
- No Importance Weights: The algorithm does not use importance sampling ratios or clipping operators. It relies on the stability of the least-squares regression loss.

3. Key Contributions

Theoretical Shift: Challenges the necessity of on-policy learning for LLM post-training, demonstrating that off-policy methods can be stable and effective.
Novel Algorithm (OAPL): Introduces a simple, fully off-policy algorithm that uses a squared regression objective to align the training policy with a lagged inference policy without importance sampling.
Robustness to Lag: OAPL remains stable even with extreme policy lags (e.g., 400 gradient steps), which is 100x more off-policy than prior approaches.
Entropy Preservation: Unlike GRPO, which often suffers from entropy collapse, OAPL maintains higher sequence entropy, leading to better test-time scaling.

4. Experimental Results

A. Competition Math Benchmarks

Evaluated on AIME 2025, HMMT 2025 (Feb & Nov), and BRUMO 2025.

Performance: OAPL outperforms GRPO with importance sampling across all Pass@k metrics (Pass@1, Pass@5, Pass@10).
Stability: Training curves show OAPL converges to higher accuracy with lower variance compared to GRPO.
Entropy: OAPL prevents entropy collapse, whereas GRPO's entropy drops significantly during training.
Test-Time Scaling: OAPL demonstrates superior scaling as $k$ increases (Pass@k), showing that RL training improves the ability to find correct answers among multiple samples, contrary to prior beliefs that RL only sharpens the base distribution.

B. Code Generation (LiveCodeBench)

Comparison: OAPL was compared against DeepCoder, a state-of-the-art coding model trained with GRPO and complex heuristics.
Performance: OAPL matches or slightly outperforms DeepCoder on LiveCodeBench Pass@k metrics.
Sample Efficiency: OAPL achieves comparable performance using 3x fewer generations (~200k samples vs. ~650k samples for DeepCoder).
Lag Tolerance: The code generation experiment utilized a two-stage process with a policy lag of ~400 gradient steps (effectively 1 epoch) without any importance sampling, proving the algorithm's robustness.

5. Significance and Conclusion

Efficiency: OAPL enables fully asynchronous training, allowing for significant computational and sample efficiency gains by reusing data and avoiding the overhead of synchronizing inference engines at every step.
Simplicity: It removes the need for complex variance-reduction heuristics (clipping, token deletion) associated with importance sampling, offering a cleaner and more principled approach to off-policy RL.
Scalability: The method scales effectively to large policy lags, making it highly suitable for distributed training infrastructures where synchronization is costly.
Implication: The work suggests that the "on-policy" assumption is not a strict requirement for LLM reasoning capabilities. Embracing off-policy learning can lead to more stable, efficient, and scalable post-training pipelines for reasoning LLMs.

Code Availability: The authors have released the code at https://github.com/danieldritter/OAPL.