From $\boldsymbol{\log\pi}$ to $\boldsymbol{\pi}$: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight

🚗 The Big Picture: Teaching a Car to Drive Without Crashing

Imagine you are teaching a self-driving car (an AI) how to navigate a complex city (solving math problems). You want the car to learn from its mistakes and get better over time.

In the world of AI, this learning process is called Reinforcement Learning (RL). The car tries different routes, gets a "score" (reward) if it arrives safely, and adjusts its driving style to get a higher score next time.

However, there's a tricky problem: The car is too scared to try new things.

1. The Problem: The "Hard Stop" vs. The "Wild Swing"

Current methods (like GRPO) act like a strict driving instructor who says: "If you drift even slightly off the recommended lane, I will immediately cut off your ability to learn from that moment."

The Issue: This "Hard Clipping" stops the car from exploring. If the car tries a risky but potentially brilliant shortcut, the instructor ignores it. The car gets stuck in a boring, safe loop and never learns to be truly great.

Recently, researchers tried a "Soft Clipping" approach. They said: "Okay, if you drift, we won't stop you completely, but we'll just give you a tiny nudge."

The New Issue: This is like giving the car a steering wheel made of jelly. When the car drifts too far in one direction, the "nudge" becomes so weak (or mathematically, so wildly unstable) that the car spins out of control. The AI starts learning from noise, leading to training collapse (the car crashes and stops learning entirely).

2. The Insight: Changing the Map

The authors of this paper realized that the instructors were using the wrong map.

Old Map (Log-Probability): They were looking at the logarithm of the probability. Imagine trying to measure the distance of a star by looking at its reflection in a funhouse mirror. As the star gets fainter (probability drops), the reflection gets distorted and infinitely huge. This causes the "jelly steering wheel" to break.
New Map (Probability): The authors say, "Let's just look at the actual probability." It's like looking at the star directly. It's stable, bounded, and makes sense.

3. The Solution: DGPO (The Smart Cruise Control)

The authors propose a new algorithm called DGPO (Decoupled Gradient Policy Optimization). Think of it as a Smart Cruise Control that handles the edges of the road differently depending on which side you are on.

Imagine the "Safe Zone" is a highway lane.

The Left Side (Too Slow/Too Safe): If the car is driving too slowly or sticking to the rules too rigidly (low probability), the old methods would either ignore it or panic.
- DGPO's Fix: It gently slows down the learning signal here. It says, "Okay, you're being too safe. We'll reduce your learning speed slightly so you don't crash, but we won't ignore you." This prevents the car from getting stuck in a rut.
The Right Side (Too Fast/Too Risky): If the car is driving recklessly (high probability of a weird action), the old methods would either stop it or let it spin out.
- DGPO's Fix: It applies a "soft brake" that gets stronger the faster you go, but it never cuts the engine. It says, "Whoa, that's risky! Slow down your learning a bit, but keep exploring." This keeps the car stable but still adventurous.

The "Bilateral Decoupled Decay" is just a fancy way of saying: "We treat the left side of the road and the right side of the road with two different, custom-tuned rules to keep the car safe but moving forward."

🧩 Why This Matters (The Results)

The authors tested this on some very smart AI models (DeepSeek-R1) trying to solve hard math problems (like the AIME and Olympiad exams).

The Result: The old methods (GRPO, CISPO, etc.) often got stuck or crashed when the math got hard.
The Winner: The DGPO car drove smoothly. It explored more, learned faster, and solved significantly more math problems correctly across different model sizes (from small 1.5B to large 14B models).

🏁 Summary in One Sentence

The paper fixes a broken learning rule for AI by switching from a distorted "mirror view" of probability to a clear "direct view," and then applying a custom "smart brake" that keeps the AI stable enough to learn without being too scared to try new, brilliant ideas.

The takeaway: To make AI smarter, we need to let it explore the edges of the map without letting it fall off the cliff. DGPO builds a safety net that lets the AI fly higher.

1. Problem Statement

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved Large Language Model (LLM) reasoning, particularly in mathematics. However, optimization dynamics remain fragile due to the conflict between exploration and stability.

Hard Clipping Limitations: Standard algorithms like GRPO (Group Relative Policy Optimization) use "hard clipping" on the Importance Sampling (IS) ratio. This discards gradients for tokens outside the trust region, causing "entropy collapse" and premature convergence by stifling exploration.
Soft Clipping Instability: Recent "soft clipping" methods (e.g., CISPO, GPPO) attempt to preserve gradients for out-of-bound tokens. However, they rely on the log-probability gradient ( $\nabla_\theta \log \pi_\theta$ $\nabla_{θ} lo g π_{θ}$ ).
- The Core Issue: As the probability $\pi_\theta$ approaches zero (left boundary), the log-probability gradient weight grows divergently (towards infinity). This causes catastrophic instability and training collapse, disproportionately penalizing low-probability tokens.
- The Gap: Existing methods fail to balance the need for sustained exploration (keeping gradients for clipped tokens) with the need for stable, convergent weights.

2. Methodology: DGPO (Decoupled Gradient Policy Optimization)

The authors propose a paradigm shift: moving from optimizing log-probability to optimizing probability directly.

A. Theoretical Shift: From $\log \pi$ to $\pi$

Optimization Primitive: The paper argues that while Supervised Fine-Tuning (SFT) operates on log-probabilities, RL objectives inherently operate on probabilities ( $\pi_\theta$ ).
Geometric Symmetry: Probability space is bounded and symmetric $(0, 1)$ , whereas log-probability space is unbounded and asymmetric $(-\infty, 0)$ . Optimizing in probability space allows for symmetric, stable gradient designs that are mathematically impossible in log-space without divergence.

B. The DGPO Algorithm

DGPO replaces hard clipping with a decoupled decay mechanism applied to the probability gradient weight ( $W_{i,t}$ ). It treats the Left Boundary (low IS ratio) and Right Boundary (high IS ratio) differently to ensure stability and exploration.

The weighting function $W_{i,t}^{DGPO}$ is defined based on the IS ratio $w_{i,t}$ and advantage $\hat{A}_i$ :

Left Boundary (Low Ratio, Negative Advantage):
- Goal: Stability. Prevent the gradient weight from exploding as probability vanishes.
- Mechanism: Applies a polynomial decay based on the current probability $\pi_\theta$ .
- Formula: $W \propto \pi_\theta^n$ (where $n$ is a hyperparameter). As $\pi_\theta \to 0$ , the weight decays to zero, ensuring convergence.
Right Boundary (High Ratio, Positive Advantage):
- Goal: Exploration. Encourage the model to explore high-probability tokens that were previously clipped.
- Mechanism: Applies a reciprocal radical decay.
- Formula: $W \propto \pi_\theta^{-1/m}$ (where $m$ is a hyperparameter). This allows for a wider distribution of updates, fostering exploration without divergence.
Continuity: The method introduces constants ( $C_{left}, C_{right}$ ) to ensure the gradient weights are continuous at the clipping boundaries, preventing abrupt jumps in optimization dynamics.

C. Theoretical Guarantees

Gradient Continuity: Unlike previous soft clipping methods that suffer from divergent weights, DGPO guarantees convergent weights at both boundaries.
Bias Reduction: Theoretical analysis shows DGPO minimizes the bias relative to the true policy gradient, particularly at the left boundary, outperforming CISPO and GPPO which suffer from high bias or divergence.

3. Key Contributions

Paradigm Shift: Establishes probability gradient ( $\nabla_\theta \pi_\theta$ ) as the superior optimization primitive for RLVR, correcting the misalignment of using log-probability gradients which leads to divergence.
Novel Algorithm (DGPO): Introduces a decoupled decay mechanism that applies asymmetric, continuous decay to boundary tokens. It solves the exploration-stability conflict by "slowing down" exploration at the left boundary (for stability) and "slowing down gently" at the right boundary (for sustained exploration).
Comprehensive Evaluation: Extensive experiments across multiple model scales (1.5B, 7B, 14B) and diverse mathematical benchmarks demonstrate that DGPO is robust, scalable, and theoretically sound.

4. Experimental Results

The authors evaluated DGPO on DeepSeek-R1-Distill-Qwen models (1.5B, 7B, 14B) using the DAPO-Math-17K dataset and benchmarks including AIME24/25, AMC23, MATH500, Minerva, and OlympiadBench.

Performance Gains:
- 1.5B Model: DGPO outperformed the best baseline (CE-GPPO) by +3.5% in average accuracy (Avg@32) and surpassed vanilla GRPO by +4.3%.
- 7B Model: DGPO improved over GRPO by +3.1% and CISPO by +2.7%.
- 14B Model: Consistent improvements were observed, with a +3.1% gain in Avg@32 over GRPO.
Training Dynamics:
- Stability: Methods with divergent weights (CISPO, GPPO) exhibited training collapse. DGPO maintained stable entropy reduction, avoiding premature convergence (GRPO) or over-exploration (ASPO).
- Exploration: DGPO achieved higher Pass@K scores, indicating a better ability to find correct solutions within a larger search space compared to baselines.
Hyperparameter Sensitivity:
- The decay parameters $n$ (left) and $m$ (right) control the trade-off.
- Recommendation: $n=1, m=2$ was found to be a robust configuration for larger models (7B/14B), while $n=2, m=2$ worked well for smaller models (1.5B).

5. Significance

Solving the Divergence Problem: DGPO provides the first theoretically grounded solution to the divergence issue in soft clipping for LLMs, enabling the safe use of gradients from out-of-bound tokens.
Scalability: The method scales effectively from 1.5B to 14B parameters, proving that the probability-based optimization primitive is essential for large-scale RLVR.
Practical Impact: By resolving the exploration-stability conflict, DGPO unlocks higher reasoning capabilities in LLMs, offering a more robust alternative to standard PPO/GRPO for domains requiring complex logical reasoning (e.g., mathematics, science).
Open Source: The authors have released the code and implementation, facilitating reproducibility and further research in the RLVR community.

In summary, DGPO represents a fundamental correction to the optimization dynamics of RLVR, shifting the focus from log-probability to probability to ensure mathematical stability while maximizing the model's reasoning potential.

From log⁡π\boldsymbol{\log\pi}logπ to π\boldsymbol{\pi}π: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight

🚗 The Big Picture: Teaching a Car to Drive Without Crashing

1. The Problem: The "Hard Stop" vs. The "Wild Swing"

2. The Insight: Changing the Map

3. The Solution: DGPO (The Smart Cruise Control)

🧩 Why This Matters (The Results)

🏁 Summary in One Sentence

1. Problem Statement

2. Methodology: DGPO (Decoupled Gradient Policy Optimization)

A. Theoretical Shift: From log⁡π\log \pilogπ to π\piπ

B. The DGPO Algorithm

C. Theoretical Guarantees

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Uncertainty Quantification in CNN Through the Bootstrap of Convex Neural Networks

Schema-Adaptive Tabular Representation Learning with LLMs for Generalizable Multimodal Clinical Reasoning

A Layer-wise Analysis of Supervised Fine-Tuning

When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation

Polynomial Expansion Rank Adaptation: Enhancing Low-Rank Fine-Tuning with High-Order Interactions

From $\boldsymbol{\log\pi}$ to $\boldsymbol{\pi}$ : Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight

A. Theoretical Shift: From $\log \pi$ to $\pi$