Exclusive Self Attention
The paper introduces Exclusive Self Attention (XSA), a modification that constrains attention to information orthogonal to a token's own value vector, thereby improving Transformer performance in language modeling tasks, particularly as sequence length increases.