Entropy-Preserving Reinforcement Learning

Imagine you are teaching a brilliant but slightly rigid student (a Large Language Model) how to solve complex puzzles, like coding a new app or solving advanced math problems. You want them to become a master problem-solver.

To do this, you use a method called Reinforcement Learning (RL). Think of it like a game where the student tries many different solutions. If a solution works, they get a "gold star" (reward). If it fails, they get a "thumbs down." Over time, they learn to repeat the gold stars and avoid the thumbs down.

However, the paper argues that current training methods have a fatal flaw: they make the student too confident too quickly.

The Problem: The "Echo Chamber" Effect

Imagine the student finds one way to solve a puzzle that works. Because they are so eager to please, they immediately stop trying anything else. They think, "I found the answer! I will only ever do this one thing from now on."

In technical terms, this is called Entropy Collapse.

Entropy is a fancy word for "variety" or "surprise." High entropy means the student is exploring many different paths. Low entropy means they are stuck in a rut, repeating the same few paths.
When entropy collapses, the student stops exploring. They might get really good at the one specific way they found, but if that way doesn't work for a slightly different puzzle, they are completely lost. They lose their creativity and ability to adapt.

The paper says: "It's not just about the destination; it's about the journey." If you rush the student to the finish line too fast, they never learn the full map.

The Culprits: Why does this happen?

The authors found two main reasons why students get stuck in this "echo chamber":

The "Clipping" Trap: Some training methods try to be careful and say, "Don't change your mind too much at once." But they do this in a way that accidentally punishes the student for trying new things. It's like a teacher who says, "Great job on that one answer, but if you try a different approach, I'll ignore your effort."
The "Blurry Glasses" Problem (Numerical Precision): Computers do math using numbers. Sometimes, to save space, they use "blurry" numbers (like BF16) instead of "sharp" numbers (like FP16). The paper discovered that these blurry numbers create a tiny, invisible bias. It's like wearing glasses that make the "safe" answers look slightly brighter and the "risky" answers look slightly dimmer. The student subconsciously avoids the risky answers, leading to a lack of variety.

The Solution: Keeping the Student Curious

The authors propose new methods to keep the student's "entropy" (curiosity/variety) high throughout the training process. They call this Entropy-Preserving Reinforcement Learning.

Here are their two main tools, explained simply:

1. REPO (The "Encouragement Coach")

Instead of just saying "Good job" or "Bad job," this method adds a special rule: "If you try something rare and it works, I'll give you a massive bonus."

Analogy: Imagine a treasure hunt. Usually, you only get points for finding the treasure. REPO says, "If you take a weird, unexplored path and still find the treasure, you get double points!"
Result: The student is motivated to keep exploring new paths because the reward for being unique is high. This prevents them from getting stuck in one routine.

2. ADAPO (The "Flexible Rulebook")

Some methods use a "clipping" rule to stop the student from changing their mind too wildly. But the old rulebook was too strict on one side and too loose on the other.

Analogy: Imagine a parent telling a child, "You can't run faster than 5mph, but you can run as slow as you want." This forces the child to slow down.
The Fix: ADAPO changes the rule to: "You can't run faster than 5mph, but if you are running too slow (stuck in a rut), we will gently nudge you to speed up and try new things." It dynamically adjusts the rules based on how curious the student is being. If they get too bored (low entropy), the rules loosen to encourage exploration.

The Results: Why Does This Matter?

The paper tested these ideas on two very different challenges:

AppWorld: A complex task where the AI has to use tools to manage apps (like a digital assistant).
AIME: Hard math problems.

The findings were clear:

Old Methods: The students got good quickly but then stopped improving. They became "one-trick ponies." If you asked them to learn a new skill later, they couldn't do it because they had forgotten how to explore.
New Methods (REPO & ADAPO): The students stayed curious. They explored more paths.
- They solved more problems overall.
- They were better at handling tricky, new situations.
- Most importantly, they remained trainable. Even after weeks of training, they could still learn new things because they hadn't forgotten how to explore.

The Bottom Line

This paper teaches us that in training AI, variety is just as important as correctness.

If you force an AI to be perfect too quickly, it becomes rigid and fragile. But if you actively protect its curiosity (entropy) and encourage it to try weird, new things—even if they might fail initially—it becomes a smarter, more adaptable, and ultimately more powerful problem solver.

In short: Don't just teach the AI the answer; teach it how to keep asking questions.

1. Problem Statement

The paper addresses the phenomenon of entropy collapse in online policy gradient Reinforcement Learning (RL) for Large Language Models (LLMs). While RL has successfully enhanced reasoning capabilities in models (e.g., via GRPO, PPO), these algorithms often inadvertently reduce the entropy of the policy distribution during training.

The Issue: As training progresses, the policy narrows its distribution around high-probability solutions found early in training, neglecting other correct but less probable trajectories.
Consequences: This leads to premature convergence to local optima, a significant drop in pass@k performance (diversity of solutions), and a loss of the model's ability to explore new solutions in sequential learning tasks.
Root Cause: The authors argue that standard policy gradient objectives naturally correlate advantages with log-probabilities in a way that sharpens the distribution, reducing entropy. Furthermore, they identify that subtle implementation details (numerical precision, framework casting) exacerbate this collapse.

2. Methodology & Theoretical Analysis

The paper provides a theoretical framework for understanding entropy dynamics and proposes explicit mechanisms to control them.

A. Theoretical Analysis of Entropy Dynamics

The authors derive that the change in policy entropy ( $\Delta H$ ) is driven by the correlation between action advantages ( $A$ ) and mean-centered log-probabilities ( $L$ ), weighted by the action probability:
$\Delta H \propto -E[A(s, a) \cdot L(s, a) \cdot \pi(a|s)]$

PPO: Standard PPO allows multiple off-policy updates, which amplifies entropy collapse. However, its clipping mechanism theoretically bounds the entropy change per update.
DAPO & GSPO: These algorithms use asymmetric clipping (allowing larger increases in probability for low-probability actions than decreases for high-probability ones) or sequence-level clipping. Theoretically, this provides implicit entropy preservation, but the authors show it is often insufficient or unstable without proper numerical handling.
RLOO: Strictly on-policy methods like RLOO avoid the amplification of off-policy drift but still suffer from entropy loss if the base model is already well-calibrated to the task.

B. Empirical Factors: Numerical Precision

A critical contribution is the identification of numerical precision as a primary driver of entropy collapse:

BF16 vs. FP16: Using BFloat16 (BF16) for model outputs introduces a multiplicative upward bias in the importance weight ratio ( $r = \pi_{new}/\pi_{old}$ ) due to rounding errors.
The Bias Effect: This bias causes the observed ratio to reach the upper clipping bound ( $\epsilon_{high}$ ) more frequently and the lower bound ( $\epsilon_{low}$ ) less frequently. This effectively creates an asymmetric clipping regime opposite to DAPO's design, systematically favoring entropy decrease.
Solution: Switching to FP16 training and fixing framework-specific output casting (e.g., in FSDP2) eliminates this bias, stabilizing entropy dynamics.

C. Proposed Algorithms for Explicit Entropy Control

To actively regulate entropy, the authors propose two methods:

REPO (Regulated Entropy Policy Optimization):
- Mechanism: Modifies the advantage function by subtracting a scaled, mean-centered log-likelihood term: $A_{REPO} = A - \beta \cdot L$ .
- Effect: This acts as a control variate. If $\beta > 0$ , it penalizes high-probability actions (increasing entropy); if $\beta < 0$ , it penalizes low-probability actions.
- Efficiency: Unlike standard entropy bonuses that require materializing full logits (memory intensive), REPO estimates the gradient using only the sampled token's log-probability (compatible with Cut Cross-Entropy), incurring zero additional memory cost.
- Variants:
  - REPO-D (Decorrelate): Sets $\beta$ to counteract the entropy change directly.
  - REPO-R (Rescale): An efficient approximation that rescales advantages based on action probabilities.
ADAPO (Adaptive DAPO):
- Mechanism: Dynamically adjusts the asymmetric clipping thresholds ( $\epsilon_{high}$ ) based on the observed entropy trajectory.
- Logic: If entropy drops below a target, $\epsilon_{high}$ is increased to allow more probability mass to shift to low-probability actions. If entropy rises too high, $\epsilon_{high}$ is decreased.

3. Key Contributions

Unified Theory of Entropy Collapse: Formalized the relationship between policy gradient objectives and entropy dynamics, proving that clipping bounds entropy change and that asymmetric clipping can implicitly preserve it.
Numerical Stability Discovery: Identified that BF16 quantization and framework casting behaviors (FSDP2) cause a systematic bias leading to entropy collapse, explaining previous training instabilities.
Explicit Control Mechanisms: Introduced REPO and ADAPO, which provide robust, adaptive entropy regulation without the memory overhead of traditional entropy bonuses.
State-of-the-Art Performance: Demonstrated that strictly on-policy RLOO with FP16 training achieves SOTA results, while entropy-preserving methods (REPO/ADAPO) significantly outperform baselines in off-policy settings and sequential learning.

4. Experimental Results

The methods were evaluated on AppWorld (interactive tool-use agents) and AIME (mathematical reasoning) using Qwen-3-8B and Qwen-3-32B models.

Entropy Preservation: Algorithms like GRPO collapsed entropy by ~90% during training. In contrast, REPO-R and ADAPO maintained steady or increasing entropy trajectories.
Performance Gains:
- AppWorld: REPO-R and ADAPO outperformed their baselines (GRPO, DAPO).
- RLOO with FP16: Achieved 79% Test Normal and 71% Test Challenge on AppWorld with Qwen-3-32B, setting a new state-of-the-art at the time of submission.
Sequential Learning: Models trained with entropy preservation (REPO/ADAPO) retained the ability to learn new tasks when fine-tuned sequentially. Models with collapsed entropy (GRPO) failed to adapt to new environments.
AIME Results: While base models were already strong on AIME, entropy-preserving methods showed more stable training dynamics and competitive pass@1 scores.

5. Significance

This paper fundamentally shifts the perspective on RL training for LLMs from purely optimizing reward to actively managing exploration via entropy control.

Practical Impact: It provides actionable fixes (FP16 training, specific clipping adjustments) that can be immediately applied to improve training stability and performance.
Theoretical Insight: It clarifies that "entropy collapse" is not just a side effect but a predictable outcome of specific numerical and algorithmic choices.
Future Direction: The work suggests that for complex reasoning tasks requiring diverse solutions (high pass@k), maintaining high entropy throughout the training trajectory is more critical than the final policy's immediate reward maximization. It bridges the gap between on-policy stability and off-policy throughput.