Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning
Mousse is a novel optimizer that improves upon the Muon algorithm by integrating Shampoo's Kronecker-factored preconditioning to adaptively handle the heavy-tailed curvature of deep neural networks, thereby achieving faster training convergence with negligible computational overhead.