Stabilizing Policy Optimization via Logits Convexity

The Big Problem: The "Jittery" AI

Imagine you are teaching a very smart but nervous student (the AI) how to solve math problems.

Supervised Fine-Tuning (SFT) is like giving the student a textbook with the correct answers. The student learns steadily, like walking on a flat, paved road. It's boring but very stable.
Reinforcement Learning (RL) is like putting the student in a video game where they get points for good answers and lose points for bad ones. This is exciting and can lead to super-smart behavior, but it's unstable. The student might suddenly panic, run in circles, or crash into a wall because the "points" system is confusing.

The researchers found that while the "textbook" method (SFT) is smooth, the "video game" method (RL) often causes the AI to have explosive mood swings (mathematically called "gradient explosions"). This makes the AI forget what it learned or stop learning entirely.

The Secret Ingredient: "Logits Convexity"

The paper asks: Why is the textbook method so calm, while the video game method is so chaotic?

They discovered a hidden geometric property called Logits Convexity.

The Analogy: Imagine the learning process is a hiker trying to find the bottom of a valley (the perfect answer).
- SFT (The Textbook): The valley is shaped like a perfect bowl. No matter where the hiker drops a ball, it rolls smoothly straight to the bottom. There are no hidden holes or cliffs. This is "convex."
- PPO (The Standard RL): The valley is shaped like a cratered, rocky landscape with hidden pits and steep cliffs. The hiker might take a step, hit a cliff, and fall backward, or get stuck in a small hole that isn't the bottom. This is "non-convex."

The researchers proved that the standard RL method (PPO) loses this "bowl shape," causing the AI to take wild, erratic steps.

The Solution: LCO (Logits Convex Optimization)

The team invented a new training method called LCO. Instead of letting the AI guess its way through the rocky landscape, LCO forces the landscape to look like a smooth bowl again.

Here is how it works, using a simple metaphor:

The Target: In standard RL, the AI tries to guess what the "best" move is based on trial and error. It's like trying to hit a moving target in the dark.
The LCO Trick: LCO calculates exactly where the "perfect" target should be (based on the math of the game) and tells the AI: "Don't guess. Just aim directly at this specific spot."
The Result: Because the AI is now just trying to match a specific target (like fitting a key into a lock), the path becomes smooth again. The "bowl" is restored.

Why is this better?

The paper tested LCO on three types of tasks:

Math Reasoning: Solving complex equations.
Reading Comprehension: Answering questions about text.
Instruction Following: Doing what the user asks.

The Results:

Stability: The AI didn't crash or go crazy. It learned steadily, like the student on the paved road.
Performance: Because it didn't waste time falling into "cliffs" or "holes," it actually learned better and faster than the old methods. It beat the previous champions (like PPO and GRPO) in almost every test.
Efficiency: It needed fewer examples to learn. If PPO needed 100 practice problems to get good, LCO might only need 30.

Summary

Think of the old RL method as trying to drive a car on a bumpy, pothole-filled road at night. You might crash.
The new LCO method is like paving that road, turning on the headlights, and giving the driver a GPS that says, "Just drive straight to this exact coordinate."

The result is a smoother ride, a faster arrival, and a much happier driver (the AI). This paper provides the mathematical proof for why paving the road works and gives us the blueprint to do it.

1. Problem Statement

Reinforcement Learning (RL) has become essential for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning capabilities. However, RL training (specifically using algorithms like Proximal Policy Optimization, PPO) is notoriously unstable compared to Supervised Fine-Tuning (SFT).

The Instability Gap: While SFT follows a stable optimization trajectory, PPO often suffers from volatile gradients, including gradient explosions and training collapse, even when using standard stabilization techniques like clipping or KL regularization.
Root Cause Unknown: Existing literature attributes this to variance in advantage estimation or policy shift constraints, but the fundamental geometric cause within the loss landscape remains unclear.
Observation: Empirical analysis shows that as PPO training progresses, gradient norms can increase (particularly for actions with negative advantages) even as the loss decreases, leading to erratic parameter updates and eventual performance collapse.

2. Methodology: Logits Convex Optimization (LCO)

The authors propose a new framework called Logits Convex Optimization (LCO) based on the theoretical insight that logits convexity is the key to stable training.

A. Theoretical Foundation: Logits Convexity

Definition: A loss function is "logits convex" if its Hessian matrix with respect to the model's logits ( $z_\theta$ ) is positive semi-definite.
Gradient Directionality: The paper proves (Proposition 4.4) that if a loss is logits convex, the gradient in the parameter space is directionally aligned with the path toward near-optimal parameters. This prevents gradient descent from being misled by spurious stationary points.
SFT vs. PPO:
- SFT: The negative log-likelihood loss is proven to be logits convex, ensuring stable, diminishing gradients as the model converges.
- PPO: The clipped surrogate objective is not logits convex. This lack of convexity leads to non-aligned gradients and the observed gradient spikes, particularly when dealing with negative advantages.

B. The LCO Framework

LCO reformulates the RL task as a supervised alignment problem toward an optimal target derived from the original RL objective, thereby inheriting the stability of logits convexity.

Optimal Target Derivation: Based on the regularized RL objective, the optimal policy $\pi^*$ and optimal logits $z^*$ are derived in closed form:
$z^*(s_t, a_t) = z_{old}(s_t, a_t) + \frac{A(s_t, a_t)}{\beta}$
where $A$ is the advantage function and $\beta$ is the KL coefficient.
Three Implementation Strategies:
- LCO-MSE: Minimizes the Mean Squared Error between the current logits and the optimal logits $z^*$ .
- LCO-LCH: Minimizes the Log-Cosh loss between current and optimal logits. This variant is robust to outliers in advantage estimation.
- LCO-KLD: Minimizes the forward Kullback-Leibler (KL) divergence between the current policy distribution and the optimal policy distribution $\pi^*$ .
Advantage Estimation: The framework supports various advantage estimation strategies, including sparse (sampled actions only), dense (log-probability based), and preference-based (DPO-derived) signals.

C. Theoretical Guarantees

Convexity: All LCO objectives (MSE, LCH, KLD) are proven to be logits convex (Lemma 4.6).
Self-Stabilizing Gradients: The gradient norms of LCO objectives are bounded by monotonic functions of the loss. As the model approaches the optimal target, the gradient magnitude naturally diminishes, preventing the sudden spikes seen in PPO (Proposition 4.7).
Convergence: LCO objectives guarantee linear convergence rates under standard assumptions.

3. Key Contributions

Theoretical Insight: Identified logits convexity as the critical geometric property distinguishing stable SFT from unstable RL. Proved that PPO lacks this property, leading to gradient misalignment.
Novel Framework: Proposed LCO, a policy optimization framework that transforms RL into a convex logit-matching problem, ensuring stable gradient dynamics without relying on heuristic clipping.
Comprehensive Analysis: Provided rigorous proofs for gradient directionality, norm bounds, and convergence rates for the proposed methods.
Empirical Validation: Demonstrated that LCO outperforms state-of-the-art baselines (PPO, GRPO, DAPO, GSPO) across diverse tasks and model families.

4. Experimental Results

The authors evaluated LCO on mathematical reasoning (MATH500, AMC23, MinervaMath), machine reading comprehension (QA-Feedback), and instruction following (AlpacaEval 2.0) using various backbone models (Qwen, Llama, Mistral).

Performance:
- Math Reasoning: LCO variants achieved State-of-the-Art (SOTA) results. For example, LCO-KLD on Qwen-3-4B achieved 73.20% Pass@1 on MATH500, significantly outperforming PPO (67.80%) and GRPO (67.60%).
- Instruction Following: LCO-KLD achieved a 32.93% length-controlled win rate on AlpacaEval 2.0, surpassing PPO (27.20%).
- Robustness: LCO variants consistently outperformed baselines across different model sizes (3B to 70B) and architectures.
Stability:
- Gradient Dynamics: Unlike PPO, which showed oscillating and exploding gradient norms after 6,000 steps, LCO-KLD maintained stable, smoothly decaying gradients throughout training.
- Training Collapse: PPO often suffered performance drops in later training stages due to collapse, whereas LCO showed consistent improvement.
Efficiency:
- Sample Efficiency: LCO-LCH was found to be nearly 3x more sample-efficient than LCO-KLD and significantly more efficient than PPO, achieving comparable performance with half the training samples due to stronger convexity.
Sparse Feedback: LCO remained effective even when advantage signals were restricted to sampled actions (sparse feedback), outperforming PPO and GRPO in these constrained settings.

5. Significance

Solving the Instability Problem: This work provides a fundamental theoretical explanation for why RL training in LLMs is unstable and offers a practical, mathematically grounded solution that eliminates the need for heuristic clipping mechanisms.
Bridging SFT and RL: By framing RL as a convex logit-matching problem, LCO unifies the stability of SFT with the optimization power of RL, allowing models to surpass the performance ceilings of static distillation methods.
Scalability: The framework is applicable across various model sizes and tasks, suggesting a path toward more reliable and scalable alignment of future large language models.
Practical Impact: The proposed method (LCO) is simple to implement (replacing the PPO loss with a regression/KL loss against a derived target) yet yields substantial improvements in both training stability and final model performance.