Uber's Failover Architecture: Reconciling Reliability and Efficiency in Hyperscale Microservice Infrastructure
Uber's Failover Architecture (UFA) replaces its costly uniform 2x capacity model with a differentiated, criticality-based approach that opportunistically shares resources and preempts non-critical services during peak failovers, thereby reducing steady-state provisioning from 2x to 1.3x and eliminating over one million CPU cores while maintaining 99.97% availability.
Mayank Bansal, Milind Chabbi, Kenneth Bogh, Srikanth Prodduturi, Kevin Xu, Amit Kumar, David Bell, Ranjib Dey, Yufei Ren, Sachin Sharma, Juan Marcano, Shriniket Kale, Subhav Pradhan, Ivan Beschastnikh, Miguel Covarrubias, Chien-Chih Liao, Sandeep Koushik Sheshadri, Wen Luo, Kai Song, Ashish Samant, Sahil Rihan, Nimish Sheth, Uday Kiran MedisettyTue, 10 Ma💻 cs