Why Is RLHF Alignment Shallow? A Gradient Analysis
This paper proves that standard RLHF alignment is inherently shallow because gradient signals vanish once a sequence's harmfulness is determined, and it proposes a recovery penalty objective to ensure alignment gradients persist throughout the entire generation process.