Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity
This paper proposes a novel training framework that leverages the -divergence family to explicitly filter incorrect answers and control the precision-diversity trade-off, thereby overcoming the diversity loss inherent in standard Reinforcement Learning and achieving state-of-the-art performance on the Lean theorem-proving benchmark.