You Didn't Have to Say It like That: Subliminal Learning from Faithful Paraphrases
This paper demonstrates that language models can covertly acquire behavioral traits from a teacher model through "subliminal learning" on faithful paraphrases, where the student adopts the teacher's preferences even when the paraphrased content is semantically unrelated or explicitly contradicts those preferences, rendering content-based inspection ineffective.