Regularized Online RLHF with Generalized Bilinear Preferences
This paper proposes a regularized online RLHF framework using Generalized Bilinear Preference Models to identify Nash Equilibria, establishing the first statistically efficient, dimension-free regret bounds for high-dimensional settings through two simple algorithms that leverage strong convexity and low-rank structures.