Ranking Reasoning LLMs under Test-Time Scaling
This paper introduces Scorio, an open-source library that formalizes and implements statistical methods for reliably ranking reasoning LLMs under test-time scaling, demonstrating that most approaches achieve high agreement with a Bayesian gold standard across multiple Olympiad-style math benchmarks.