Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds
This paper provides a first-order analysis demonstrating that cross-entropy training in transformers induces a coupled specialization of attention routing and value updates—functioning as a two-timescale EM procedure—that sculpts low-dimensional Bayesian manifolds, thereby explaining how gradient-based optimization enables precise probabilistic reasoning.