Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
This paper investigates the phenomenon of subliminal learning in language model distillation, revealing that hidden bias transfer occurs under hard distillation not through global entanglement, but via a small set of critical "divergence tokens" processed in early layers, making the effect both mechanistically specific and fragile to prompt variations.