On the Non-Identifiability of Steering Vectors in Large Language Models
This paper demonstrates that steering vectors in large language models are fundamentally non-identifiable, as numerous distinct interventions—including orthogonal perturbations—produce behaviorally indistinguishable results, thereby revealing inherent limits in interpreting these vectors as unique internal representations without additional structural constraints.