Steering Awareness: Models Can Be Trained to Detect Activation Steering
This paper demonstrates that language models can be fine-tuned to reliably detect and identify activation steering interventions, revealing that such steering is not inherently undetectable and that models trained to recognize it may paradoxically become more susceptible to behavioral manipulation.