Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning

This paper introduces Curvature-Aware Policy Optimization (CAPO), a framework that leverages second-order curvature information to identify and mask unstable training samples, thereby achieving up to 30x sample efficiency and stable convergence in LLM reasoning tasks compared to standard policy gradient methods.

Luckeciano C. Melo, Alessandro Abate, Yarin Gal

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine you are teaching a brilliant but slightly chaotic student (a Large Language Model) how to solve complex math problems. You want them to get better through trial and error, a process called Reinforcement Learning.

Currently, the standard way to teach this student is like a strict teacher who is terrified of the student making a mistake. Because the student is so smart but also so unpredictable, the teacher is forced to be extremely cautious:

  • They give tiny, tiny lessons (low learning rate).
  • They review thousands of practice problems before making any changes (huge batch sizes).
  • They are afraid to push the student too hard.

The Problem: This "cautious" approach is incredibly slow and expensive. It's like trying to fill a swimming pool with a teaspoon. The student could learn faster if they were pushed harder, but if you push them too hard, they might panic, forget everything they knew, and start making random guesses. This is called policy collapse.

The Solution: CAPO (Curvature-Aware Policy Optimization)

The authors of this paper, Luckeciano Melo and colleagues, built a new teaching assistant named CAPO. Instead of just watching the student, CAPO has a special "sixth sense" that predicts how the student will react to a lesson before it happens.

Here is how CAPO works, using simple analogies:

1. The "Bumpy Road" Analogy (The Optimization Landscape)

Imagine the student's knowledge is a hiker trying to climb a mountain to reach the peak (the best solution).

  • Standard RL (GRPO): The hiker just looks at the ground immediately under their feet and takes a step. If the ground is slippery or the slope changes suddenly, they might slip and slide all the way back down the mountain. To avoid this, they take tiny, slow steps.
  • CAPO: CAPO is like a hiker with a drone and a topographical map. It doesn't just look at the ground; it looks at the curvature of the mountain. It can see if a step forward is on a smooth slope or if it's about to hit a cliff edge.

2. The "Curvature" (The Sixth Sense)

In math terms, CAPO looks at the Hessian (how bumpy the objective function is) and the Fisher Information Matrix (how much the student's personality changes).

  • Simple version: CAPO asks, "If I make this specific change to the student's brain, will it cause a gentle improvement or a catastrophic explosion?"
  • It calculates this without needing to do impossible math on the whole billion-parameter brain. It focuses on the "last layer" of the student's thinking (the final decision-making part), which is enough to predict the danger.

3. The "Traffic Cop" (Data Selection)

This is the magic trick. CAPO doesn't stop the student from learning; it just filters the practice problems.

  • Imagine the student is practicing math problems. Some problems are easy and help them learn. Some are so weird or difficult that trying to solve them would make the student's brain "glitch" and forget how to count.
  • CAPO acts as a Traffic Cop. It looks at the incoming practice problems.
    • "This problem looks safe? Green light. Let the student try it."
    • "This problem looks like it will cause a brain explosion? Red light. Skip this one."
  • It rejects fewer than 8% of the problems. It's not stopping the student from working hard; it's just removing the few "poisonous" apples that would ruin the whole basket.

The Results: Why is this a big deal?

The paper tested this on math benchmarks (like the MATH dataset).

  • The Old Way (GRPO): When they tried to speed up training (aggressive learning), the student panicked and performance crashed to zero.
  • The CAPO Way: They used the same aggressive speed, but CAPO filtered out the dangerous steps.
    • Result: The student learned 30 times faster (sample efficiency) than the standard method.
    • Stability: The student never panicked. They kept climbing the mountain smoothly.

Summary

Think of CAPO as a smart safety net.

  • Before: You had to walk very slowly because the net was too heavy to move, or you were afraid of falling.
  • Now: CAPO is a lightweight, invisible net that catches you only when you are about to take a step that would kill your progress. This allows you to run fast (aggressive learning) without falling off the cliff.

The paper proves that by understanding the "shape" of the learning process (the curvature), we can teach AI much faster, cheaper, and more reliably than before.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →