Bradley-Terry Policy Optimization for Generative Preference Modeling
This paper introduces Bradley-Terry Policy Optimization (BTPO), a novel framework that derives a consistent Monte Carlo gradient estimator to effectively train large language models with chain-of-thought reasoning on non-verifiable pairwise preference tasks, overcoming the limitations of existing heuristic RL approaches.