The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology
By employing an interventional approach that modifies Transformer architecture, this paper demonstrates that enforcing spherical topology and uniform attention routing eliminates the delayed generalization phenomenon known as grokking in modular addition tasks, provided these architectural priors align with the task's intrinsic symmetries.