Uber's Failover Architecture: Reconciling Reliability and Efficiency in Hyperscale Microservice Infrastructure

Uber's Failover Architecture (UFA) replaces its costly uniform 2x capacity model with a differentiated, criticality-based approach that opportunistically shares resources and preempts non-critical services during peak failovers, thereby reducing steady-state provisioning from 2x to 1.3x and eliminating over one million CPU cores while maintaining 99.97% availability.

Mayank Bansal, Milind Chabbi, Kenneth Bogh, Srikanth Prodduturi, Kevin Xu, Amit Kumar, David Bell, Ranjib Dey, Yufei Ren, Sachin Sharma, Juan Marcano, Shriniket Kale, Subhav Pradhan, Ivan Beschastnikh, Miguel Covarrubias, Chien-Chih Liao, Sandeep Koushik Sheshadri, Wen Luo, Kai Song, Ashish Samant, Sahil Rihan, Nimish Sheth, Uday Kiran Medisetty

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine Uber as a massive, bustling city where millions of people are constantly moving around, ordering food, or hailing rides. To keep this city running smoothly, Uber needs a huge fleet of computers (servers) to process every request.

The Old Problem: The "Double-Booking" Nightmare

For years, Uber operated like a hotel that always keeps two identical rooms for every single guest, just in case one room catches fire.

  • The Setup: They had two giant data centers (let's call them "Region A" and "Region B").
  • The Rule: Every single computer in Region A had to be powerful enough to handle all of Uber's traffic if Region B suddenly vanished. And vice versa.
  • The Waste: This meant they were paying for double the computers they actually needed. Most of the time, half of their computer fleet was sitting idle, doing nothing. It was like buying two buses for every passenger, just in case one breaks down, leaving the second bus empty and expensive.

The Big Realization

Uber's engineers looked at the data and found a surprising truth: Catastrophic failures (where a whole region goes dark) are incredibly rare. They happen less than 20 hours a year.

  • The Insight: Why pay for a "double bus" fleet 99.8% of the time when a disaster is so unlikely? They needed a smarter way to use their computers.

The Solution: Uber's Failover Architecture (UFA)

Think of UFA as a smart, tiered emergency plan that treats different services differently, much like a hospital triage system.

1. Sorting the Passengers (Service Tiers)

Uber realized not all services are equally important.

  • VIPs (Critical Services): These are the life-or-death apps, like matching a rider to a driver or processing a payment. If these stop, the business stops.
  • Regulars (Non-Critical Services): These are things like "showing a user their past ride history," "sending a promotional email," or "running internal tests." If these pause for a bit, the world doesn't end.

2. The "Smart Overbooking" Strategy

Instead of keeping a second full fleet of computers sitting idle, UFA uses a capacity sharing model:

  • Steady State (Normal Days): The "VIP" computers have a little extra space reserved for emergencies. The "Regular" services are allowed to sit in this extra space, using the idle power. It's like letting a few extra people sit on the floor of a bus that has empty seats, as long as they promise to stand up immediately if a VIP needs to sit.
  • The Emergency (Failover): If Region A goes dark and everyone rushes to Region B:
    1. The VIPs get the seats: The system instantly kicks the "Regular" services off the bus (or pauses them).
    2. The Regulars move to the "Burst" zone: These paused services are quickly moved to a different area (like a warehouse or a cloud provider) that can handle them temporarily.
    3. The VIPs take over: Now, the VIP services have all the power they need to handle the full load of the city.

3. The Safety Net: "Fail-Open" vs. "Fail-Close"

This is the trickiest part. In the old days, if a "Regular" service (like a weather widget) crashed, it might accidentally take down a "VIP" service (like the ride-matching engine) because they were tangled together.

  • The Fix: Uber spent years untangling these knots. They built tools to ensure that if a "Regular" service fails, the "VIP" service simply ignores it and keeps working. They call this "Fail-Open."
  • The Analogy: Imagine a restaurant. If the dessert menu breaks, the kitchen shouldn't stop cooking steaks. The steak chef (VIP) needs to be able to say, "Okay, no desserts today, but we'll keep grilling steaks."

The Results: A Win-Win

By implementing this architecture, Uber achieved something amazing:

  • Cost Savings: They got rid of 1 million CPU cores (a massive chunk of their computer fleet). That's like selling off half their empty buses and saving millions of dollars.
  • Efficiency: Instead of their computers running at 20% capacity (mostly sitting idle), they now run at 30%—using the resources much more effectively.
  • Reliability: Despite using fewer computers, their critical services (rides and payments) stayed up 99.97% of the time, even during real disasters.

The Human Element

The paper emphasizes that the technology wasn't the hardest part; the people were.

  • Uber has over 6,000 different software teams. Convincing all of them to change how they write code, test their apps, and handle emergencies was a massive cultural shift.
  • They used "drills" (fake disasters) to practice, ensuring that when a real emergency happened, the system could automatically kick out the low-priority apps and save the high-priority ones in minutes, not hours.

In Summary

Uber stopped treating every computer like a luxury car that needs a spare tire in the trunk at all times. Instead, they built a smart, flexible fleet where:

  1. Important jobs get guaranteed priority.
  2. Less important jobs fill the empty seats when things are calm.
  3. When chaos hits, the less important jobs politely step aside to let the important ones finish the job, then they quickly find a new spot to wait until things calm down.

This saved them a fortune while keeping the city moving safely.