Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling
This work proposes and demonstrates that Progressive Diverse Population Sampling (PDPS), a method that uncovers hidden vulnerabilities by shifting focus from rephrasing inputs to systematically and efficiently sampling high quality diverse model outputs, can expose a broad range of suppressed unsafe knowledge in LLMs at a significantly lower computational cost than traditional methods.