CoRPO: Adding a Correctness Bias to GRPO Improves Generalization
The paper proposes Correctness-Relative Policy Optimization (CoRPO), a modification to Group-Relative Policy Optimization (GRPO) that clips the advantage baseline to a correctness threshold to prevent reinforcing incorrect solutions, thereby significantly improving the model's generalization and cross-domain reasoning capabilities.