CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR
The paper introduces CLIPO, a method that integrates contrastive learning into policy optimization to generalize Reinforcement Learning with Verifiable Rewards (RLVR) by capturing invariant structures across correct reasoning paths, thereby mitigating hallucinations and improving the generalization and robustness of Large Language Models.