How Far Can Unsupervised RLVR Scale LLM Training?
This paper provides a comprehensive theoretical and empirical analysis of unsupervised reinforcement learning with verifiable rewards (URLVR), revealing that intrinsic reward methods are fundamentally limited by a confidence-correctness alignment ceiling that causes model collapse, while suggesting that external rewards grounded in computational asymmetries may offer a scalable alternative.