Revisiting the (Sub)Optimality of Best-of-N for Inference-Time Alignment
This paper challenges prior claims of Best-of-N's suboptimality by demonstrating that, under practical assumptions and when evaluated via win-rate rather than expected reward, properly tuned Best-of-N is both statistically and computationally optimal, while also proposing a simple variant that eliminates reward hacking without sacrificing performance.