Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

Affinity Tailor is a userspace-guided kernel scheduling system that improves locality and throughput on large multicore systems by treating dynamic, demand-sized CPU sets as soft affinity hints rather than hard partitions, thereby balancing resource utilization with microarchitectural efficiency.

Original authors: Jin Xin Ng, Ori Livneh, Richard O'Grady, Josh Don, Peng Ding, Samuel Grossman, Luis Otero, Chris Kennelly, David Lo, Carlos Villavieja

Published 2026-05-01
📖 4 min read☕ Coffee break read

Original authors: Jin Xin Ng, Ori Livneh, Richard O'Grady, Josh Don, Peng Ding, Samuel Grossman, Luis Otero, Chris Kennelly, David Lo, Carlos Villavieja

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine a massive, high-tech kitchen with hundreds of chefs (the CPU cores) and a limited number of stoves, ovens, and prep stations (the caches and memory). In a modern data center, this kitchen is incredibly crowded. To get the most out of the expensive equipment, the head chef (the operating system scheduler) tries to keep every single stove burning by constantly shuffling different cooks between them.

This is the problem Google's paper, "Affinity Tailor," addresses. Here is the story of how they fixed it, explained simply.

The Problem: The "Hot Potato" Shuffle

In the old way of doing things (called Linux CFS), the scheduler's main rule was: "Never let a stove sit idle if there is a cook waiting to work."

To follow this rule, the scheduler would constantly grab a cook from one stove and drop them onto another one that just became free. While this kept the stoves busy, it caused a massive mess:

  • The "Cold Stove" Effect: Every time a cook moved to a new stove, they had to forget the recipe they were memorizing, throw away their pre-chopped ingredients (cached data), and re-learn the layout of the new station.
  • The Traffic Jam: Because cooks were constantly running back and forth across the kitchen to grab ingredients from a central pantry, the hallways got clogged. The pantry (memory bandwidth) couldn't keep up with the rush.
  • The Result: Even though the stoves were "busy," the food was taking longer to cook because the chefs were wasting time resetting their stations and fighting over the pantry.

The Solution: Affinity Tailor

Google introduced a new system called Affinity Tailor. Instead of forcing cooks to jump around to keep stoves busy, they gave each cooking team a "Preferred Zone."

Think of it like this:

  1. The "Soft" Fence: Instead of locking a team into a specific set of stoves (which would be a hard wall), Affinity Tailor paints a "Preferred Zone" on the floor. The system says, "Team A, please try to stay in this red zone. Team B, stay in this blue zone."
  2. The Prediction: A smart manager (a userspace controller) watches how much work each team is doing. If a team usually needs 3 stoves, the manager gives them a zone with 3 or 4 stoves. If they suddenly get a huge order (a "burst"), the manager lets them spill over into the neighboring zones temporarily, but only if necessary.
  3. The Benefit: Because the cooks stay in their "warm" zones, they keep their chopped ingredients and memorized recipes. They don't have to run across the kitchen to the pantry as often. The "hot" stove stays hot, and the "cold" stove stays cold for the next team that needs it.

How It Works in Real Life

The system uses two different strategies depending on the kitchen layout:

  • For "Chiplet" Kitchens (Split Pantries): Some modern processors are built like a city of small neighborhoods (chiplets), each with its own pantry. Moving between neighborhoods is slow and expensive. Affinity Tailor tries to keep entire teams inside a single neighborhood so they never have to cross the bridge.
  • For "Monolithic" Kitchens (One Big Pantry): In other processors, everyone shares one giant pantry. Here, the system focuses on keeping teams on specific stoves so they don't keep wiping each other's counters clean.

The Results: Speed vs. Waiting

The paper tested this in Google's real-world data centers (thousands of machines). Here is what they found:

  • Faster Cooking: The system made applications run 3% to 12% faster per CPU.
  • Less Waste: Because the cooks were more efficient, they needed less memory bandwidth, leading to 3% to 7% more throughput per gigabyte of memory.
  • The Trade-off: There was a small downside. Sometimes, a cook had to wait a tiny bit longer in line to get into their "Preferred Zone" instead of jumping to the nearest empty stove. This increased the "scheduling latency" (the time spent waiting) by up to 17% in the worst cases.

The Big Takeaway:
The paper argues that the old rule of "never let a stove sit idle" is actually hurting performance on modern, complex hardware. It is better to let a stove sit idle for a microsecond so the cook can stay in their "warm" zone and cook faster, rather than shuffling them to a "cold" zone to keep the stove busy.

In short: Affinity Tailor stops the chaotic shuffle, lets teams stay in their comfort zones, and lets the kitchen run smoother and faster.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →