C-Koordinator: Interference-aware Management for Large-scale and Co-located Microservice Clusters

This paper presents C-Koordinator, an open-source platform developed at Alibaba that leverages multi-dimensional metrics to accurately predict CPI-based interference in large-scale, co-located microservice clusters, thereby achieving over 90.3% prediction accuracy and significantly reducing application latency across all percentiles compared to state-of-the-art systems.

Shengye Song, Minxian Xu, Zuowei Zhang, Chengxi Gao, Fansong Zeng, Yu Ding, Kejiang Ye, Chengzhong Xu

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine a massive, high-speed highway system where thousands of different vehicles are driving together. Some are Formula 1 race cars (critical apps like your bank or a live video stream) that need to go exactly 200 mph without any bumps. Others are delivery trucks (background tasks like data backups) that can go slower and don't mind a few potholes.

In the old days, cloud companies kept these vehicles on separate roads. But that's wasteful! So, they started putting them all on the same highway to save money. This is called Co-location.

The Problem:
When the delivery trucks get too close to the race cars, they cause trouble. They might block the view, kick up dust, or hog the fuel. In computer terms, this is Interference. The race car (your app) suddenly slows down, the "latency" (time it takes to react) spikes, and users get angry.

The paper introduces a new system called C-Koordinator to fix this. Here is how it works, using simple analogies:

1. The "Heartbeat" vs. The "Speedometer" (Why CPI?)

Usually, when a car slows down, you look at the speedometer (Response Time) to see the problem. But in a cloud, the speedometer is tricky. Sometimes a car slows down because the driver is tired (network issues), not because another car is blocking it.

C-Koordinator uses a different tool: CPI (Cycles Per Instruction).

  • The Analogy: Imagine a mechanic listening to the engine's rhythm. If the engine is running smoothly, the rhythm is steady. If another car is bumping into it or stealing its fuel, the engine starts "stuttering" or "choking."
  • The Magic: CPI measures this engine stutter at the hardware level. It doesn't care what the car is doing; it only cares if the engine is struggling because of a neighbor. This makes it a much more reliable way to spot trouble before the car actually crashes.

2. The Crystal Ball (The Prediction Model)

The system doesn't just wait for the engine to stutter; it tries to predict it.

  • The Analogy: Think of C-Koordinator as a super-smart traffic controller with a crystal ball. It watches the weather (system load), the number of trucks (resource usage), and the road conditions (cache misses).
  • How it works: It uses a smart algorithm (called XGBoost, which is like a very fast, experienced coach) to look at these clues and say, "Hey, in 5 seconds, that delivery truck is going to block the race car's lane!"
  • The Result: It predicts interference with 90.3% accuracy. It's like knowing a traffic jam is coming before you even see the brake lights.

3. The Traffic Cop (The Mitigation Strategy)

Once the crystal ball predicts a problem, C-Koordinator acts immediately. It has two levels of enforcement:

  • Level 1: The Gentle Nudge (CPU Suppress)
    • Scenario: The interference is mild. The delivery truck is just a little too close.
    • Action: The traffic cop gently taps the delivery truck's brakes, telling it, "Slow down a bit, let the race car pass." The truck still moves, but it gives the race car more space.
  • Level 2: The Tow Truck (Pod Eviction)
    • Scenario: The interference is severe. The delivery truck is completely blocking the race car.
    • Action: The traffic cop calls a tow truck. It forcibly moves the delivery truck to a different part of the highway (or even a different road) to clear the lane instantly for the race car.

Why is this a big deal?

Before this system, if a race car slowed down, the driver (the cloud provider) often didn't know why until it was too late. By the time they fixed it, the user had already had a bad experience.

C-Koordinator changes the game by:

  1. Listening to the engine (CPI) instead of just watching the speed.
  2. Predicting the crash before it happens.
  3. Acting instantly to protect the important apps.

The Bottom Line:
In the real world, this system has been tested on Alibaba's massive network (millions of apps!). It successfully reduced the "lag" (latency) for users by 16% to 36%. It means your video calls are smoother, your online shopping is faster, and the cloud is running more efficiently, all while keeping the "delivery trucks" and "race cars" happy on the same highway.