When Scaling Fails: Network and Fabric Effects on Distributed GPU Training Performance
This paper presents an empirical study demonstrating that network topology, congestion dynamics, and GPU locality often cause unpredictable scaling failures in distributed GPU training, urging system builders to adopt specific diagnostic principles to address these overlooked fabric-level bottlenecks.