Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models
This paper introduces Self-MOA, a fully automated framework that aligns small language models using weak supervision from automated evaluators to achieve significant safety improvements with minimal training data while preserving helpfulness.