CoRPO: Adding a Correctness Bias to GRPO Improves Generalization

The Big Picture: Teaching a Student to Think

Imagine you are training a brilliant but inexperienced student (an AI) to solve math problems and write code. You want them to learn how to think correctly, not just memorize answers.

To do this, you use a method called Reinforcement Learning. You give the student a bunch of practice problems, let them try to solve them, and then give them a score. If they get it right, they get a "thumbs up" (positive reward). If they get it wrong, they get a "thumbs down" (negative reward).

The paper introduces a new way to give these scores that fixes a major flaw in the current standard method.

The Problem: The "Class Average" Trap (GRPO)

The current standard method is called GRPO (Group-Relative Policy Optimization). Here is how it works:

You give the student 8 problems at once.
The student tries to solve all 8.
Instead of comparing the student's answers to a "perfect" answer key, the teacher looks at the average of the 8 answers the student just gave.
The Rule: If a specific answer is better than the average of the group, the student gets a "thumbs up." If it's worse than the average, they get a "thumbs down."

The Flaw:
Imagine the student is having a terrible day. They get 7 answers completely wrong and 1 answer that is "kind of" wrong but slightly better than the others.

The average of the group is very low (very bad).
The "kind of wrong" answer is actually above the average.
The Result: The teacher gives a "thumbs up" to the "kind of wrong" answer!

The Metaphor:
Think of a classroom where everyone fails a test. The average score is 20%. One student gets a 25%. Under GRPO, the teacher says, "Great job! You beat the class average!" and gives them a gold star.
The student learns: "I don't need to get the right answer; I just need to be less wrong than my friends." This reinforces bad habits and prevents them from ever learning the actual truth.

The Solution: The "Pass/Fail" Line (CoRPO)

The authors propose a new method called CoRPO (Correctness-Relative Policy Optimization).

They add a simple rule: "No matter how bad the group is, if your answer is fundamentally wrong, you get a thumbs down."

They set a Correctness Threshold (a minimum passing line).

If the group average is terrible: The teacher ignores the average. Instead, they compare the student's answer to the Passing Line.
- Did you pass the line? Yes = Thumbs up.
- Did you fail the line? No = Thumbs down (even if you were the "best" of a bad bunch).
If the group average is good: The teacher goes back to comparing answers to the group average to see who is the best among the good ones.

The Metaphor:
Imagine a driving test.

GRPO: If everyone in the driving class crashes their cars, and you only hit a fence instead of a tree, you pass because you did better than the average.
CoRPO: There is a rule: "If you hit anything, you fail." Even if everyone else hit a tree, hitting a fence is still a fail. You only get a pass if you actually drive safely.

Why This Matters: Generalization (The "Superpower")

The paper shows that this simple change makes the AI much smarter in the long run.

Avoiding "Fake" Learning: By not rewarding "less wrong" answers, the AI stops trying to game the system. It learns to aim for the actual truth, not just to be better than its peers.
Cross-Domain Transfer: This is the "magic" part.
- The researchers trained the AI on Coding tasks using CoRPO.
- Then, they tested it on Math tasks (which it had never seen before).
- Result: The CoRPO-trained AI was better at Math than the GRPO-trained AI, even though it was trained on Code!

The Metaphor:

GRPO AI: Learned to be a "good coder" by memorizing patterns that worked specifically for the coding problems it saw. It's like a student who memorized the answers to last year's math test. When given a new type of problem, they fail.
CoRPO AI: Learned the logic of solving problems correctly. It learned, "I must find the truth." Because it learned the principle of correctness, it can apply that logic to Math, Coding, or anything else. It's like a student who learned how to think, so they can solve any new problem.

Summary of the "Secret Sauce"

The Old Way (GRPO): "Be better than the group." (Risk: You can be the best of a bad bunch and still be wrong).
The New Way (CoRPO): "Be better than the minimum standard of correctness." (Benefit: You are forced to learn the truth).
The Outcome: The AI becomes more robust, less likely to overfit (memorize), and surprisingly good at solving problems it was never explicitly trained on.

In short, CoRPO fixes the teacher's grading system so that the student learns to be right, not just relatively less wrong.

1. Problem Statement

The paper addresses fundamental limitations in Group-Relative Policy Optimization (GRPO), the current standard for Reinforcement Learning from Verifiable Rewards (RLVR) in Large Language Models (LLMs). While GRPO eliminates the need for a learned critic by using the mean reward of a sampled group as a baseline, the authors identify two critical failure modes:

Advantage Overestimation due to Sampling Variance: GRPO estimates the baseline ( $b_{mean}$ ) using a small group of rollouts (typically 4–16). Due to sampling variance, $b_{mean}$ often falls below the true expected reward ( $\mu^*$ ). This causes the advantage ( $A = R - b_{mean}$ ) to be inflated for all trajectories in the group, leading to overly aggressive policy updates even for suboptimal solutions.
Reinforcement of Incorrect Behaviors (Sign Inversion): When rewards are ordinal (e.g., graded correctness) or uncalibrated (e.g., LLM-as-a-judge), a "failed" trajectory (one that is objectively incorrect) can still receive a positive advantage if it performs better than the poorly performing group average. This inverts the learning signal, reinforcing incorrect behaviors simply because they are "less bad" than the group average.

These issues lead to distribution sharpening (overfitting to specific training patterns) and poor out-of-domain (OOD) generalization, as the model learns to exploit relative ranking rather than absolute correctness.

2. Methodology: CoRPO

The authors propose Correctness-Relative Policy Optimization (CoRPO), a simple yet effective modification to the GRPO objective.

Core Mechanism: Baseline Clipping

CoRPO introduces a minimum correctness threshold ( $R_{min\_correct}$ ) and clips the group-mean baseline at this value. The new baseline ( $b_{CoRPO}$ ) is defined as:
$b_{CoRPO} = \max(R_{min\_correct}, b_{mean})$

Consequently, the advantage is calculated as:
$A_{CoRPO}(y_i) = R(y_i) - b_{CoRPO}$

Two-Phase Learning Dynamics

This modification creates two distinct learning regimes:

Correctness-Seeking Regime: When the group's average reward is below the correctness threshold ( $b_{mean} < R_{min\_correct}$ $b_{m e an} < R_{min_cor r ec t}$ ), the baseline is fixed at $R_{min\_correct}$ $R_{min_cor r ec t}$ .
- Effect: Any trajectory with a reward below the threshold is guaranteed a negative advantage, regardless of how it compares to other failed samples in the group. This prevents the reinforcement of incorrect behaviors.
Quality-Seeking Regime: Once the policy reliably produces correct solutions ( $b_{mean} \geq R_{min\_correct}$ $b_{m e an} \geq R_{min_cor r ec t}$ ), the baseline reverts to the standard group mean ( $b_{CoRPO} = b_{mean}$ $b_{C o R P O} = b_{m e an}$ ).
- Effect: The model resumes relative competition among correct solutions to refine quality, preserving GRPO's efficiency.

Theoretical Guarantees

Correctness Guarantee: Incorrect trajectories ( $R(y) < R_{min\_correct}$ ) never receive positive advantage.
Mitigation of Overestimation: By clipping the baseline upward when the sample mean underestimates the true mean, CoRPO reduces the probability of inflated advantages.
Implicit Curriculum Learning: The method naturally suppresses noisy updates on hard tasks early in training, allowing the model to stabilize on correctness before tackling difficulty.

3. Key Contributions

Identification of Failure Modes: The paper formally analyzes how GRPO's group-mean baseline leads to advantage overestimation and the reinforcement of incorrect trajectories under ordinal rewards.
Proposal of CoRPO: A computationally efficient baseline modification (a single max operation) that enforces a correctness bias without requiring a learned value function.
Empirical Validation: Demonstration that CoRPO significantly improves cross-domain generalization and mitigates distribution sharpening compared to standard GRPO.

4. Experimental Results

The authors trained LLM verifiers (based on Qwen3-8B) on coding and math tasks using RLVR, comparing GRPO and CoRPO.

Cross-Domain Generalization:
- Models trained on coding tasks using CoRPO outperformed GRPO on math tasks (OOD), and vice versa.
- Specifically, CoRPO achieved 90.1% pass@16 on math tasks (trained on coding) vs. GRPO's 88.8%, despite math being generally easier. This indicates CoRPO learns transferable reasoning patterns rather than domain-specific heuristics.
Robustness to Group Size: CoRPO maintained superior performance even with very small group sizes ( $N=4$ ), whereas GRPO suffered from high variance and bias.
Training Dynamics:
- Negative Reinforcement: Early in training, CoRPO relies heavily on negative reinforcement (ratio of negative to positive loss $\approx 0.3$ ), whereas GRPO learns equally from positive and negative signals ( $\approx 1.0$ ).
- Distribution Stability: CoRPO avoids the "rank bias" where GRPO concentrates probability mass on already-likely solutions. CoRPO applies uniform reinforcement across the likelihood spectrum, preventing premature exploitation.
Implicit Curriculum: CoRPO showed slower initial progress on hard in-domain tasks but achieved parity or superiority by convergence, while GRPO's performance on hard tasks sometimes degraded over time.

5. Significance

Paradigm Shift in RLVR: CoRPO challenges the assumption that relative ranking within a group is sufficient for RLVR. It argues that for tasks with verifiable correctness, an absolute correctness baseline is necessary to prevent the model from learning "how to beat the group" rather than "how to solve the problem."
Efficiency: It achieves these improvements without the computational overhead of training a separate critic (value function), maintaining the efficiency that made GRPO popular.
Generalization: The results suggest that enforcing correctness constraints during RL training leads to more robust, generalizable reasoning capabilities, which is critical for deploying LLMs in diverse, real-world scenarios where training data distributions may not match test distributions.

In summary, CoRPO provides a simple, theoretically grounded fix to a fundamental flaw in GRPO, transforming it from a method prone to overfitting and reinforcing errors into a robust framework for learning generalizable reasoning.