Stochastic Parroting in Temporal Attention -- Regulating the Diagonal Sink
This paper investigates the "diagonal sink" phenomenon in temporal attention mechanisms, where information degeneration causes a bias toward initial tokens, and proposes theoretical sensitivity bounds alongside effective regularization methods to mitigate this issue.