Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control
This paper proposes Risk-sensitive Alignment via Dominance (RAD), a novel Safe RLHF framework that replaces traditional expected cost constraints with First-Order Stochastic Dominance constraints within an Optimal Transport framework to universally control spectral risk measures, thereby achieving superior robustness against tail risks and out-of-distribution failures while maintaining helpfulness.