Enhancing OLAP Resilience at LinkedIn

This paper presents a holistic resiliency framework for Apache Pinot at LinkedIn, featuring Query Workload Isolation, Impact-Free Rebalancing, Maintenance Zone Awareness, and Adaptive Server Selection, which collectively ensure stable subsecond query latency and high availability for petabyte-scale OLAP workloads under failures and load spikes.

Praveen Chaganlal, Jia Guo, Vivek Vaidyanathan, Dino Occhialini, Sonam Mandal, Subbu Subramaniam, Siddharth Teotia, Tianqi Li, Xiaxuan Gao, Florence Zhang

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine LinkedIn as a massive, high-speed library where millions of people are constantly asking questions about data (like "Who viewed my profile?" or "What ads should I show?"). This library is so big it holds petabytes of books (data), and it needs to answer questions in the blink of an eye (sub-second latency).

The system running this library is called Apache Pinot. But like any giant library, it faces three big problems:

  1. The "Noisy Neighbor" Problem: One person asking a super-hard question can tie up the librarian, making everyone else wait.
  2. The "Moving Shelves" Problem: When the library expands or fixes a broken shelf, moving the books usually causes chaos and stops people from finding things.
  3. The "Slow Librarian" Problem: If one librarian is having a bad day (slow computer), the whole group of librarians assigned to that section slows down, even if the others are fast.

This paper describes how LinkedIn built three "superpowers" to fix these problems and keep the library running smoothly, even when things go wrong.


1. The "Fairness Budget" (Query Workload Isolation)

The Problem: Imagine a shared kitchen where everyone is cooking. If one person decides to deep-fry a whole turkey (a heavy, complex query), they might use up all the stove burners and oil. Suddenly, the person trying to make a quick sandwich (a simple, fast query) has to wait forever. This is the "noisy neighbor" problem.

The Solution: LinkedIn introduced Query Workload Isolation (QWI).

  • The Analogy: Think of this as giving every group of cooks a personal budget of stove-time and oil.
  • How it works: Before a cook starts, the system checks their budget. If they try to use more oil than they have left, the system gently stops them before they clog up the stove.
  • The Magic: It does this so fast (in less than a millisecond) and so efficiently that it doesn't slow down the good cooks. It ensures that if one person tries to hog resources, the rest of the library keeps running at full speed. It's like having a bouncer who stops the rowdy party-goers from blocking the exit, without kicking out the polite guests.

2. The "Smart Moving Day" (Maintenance Zone Awareness & Impact-Free Rebalancing)

The Problem: Libraries need to move books around when they add new shelves or fix broken ones. Usually, this is a disaster: you have to take books off the shelves, move them, and put them back. While you're doing this, the books are missing, and people can't find them. Also, if you move all the books from one building to another, and that building catches fire (a "zone failure"), you lose everything.

The Solution: LinkedIn created a Smart Moving Day strategy.

  • The Analogy: Imagine the library is spread across different buildings (Maintenance Zones).
    • Part A (Smart Placement): When you get new books, you don't just throw them in the nearest building. You spread them out so that if one building burns down, you still have copies of every book in the other buildings. The system uses a "greedy swap" algorithm to shuffle books around until they are perfectly balanced across all buildings.
    • Part B (Impact-Free Moving): When it's time to move books to a new shelf, the system doesn't just grab them and run. It first tells the librarians: "Stop asking for these specific books for a second." Once the librarians stop asking, the movers quietly swap the books. Then, the librarians start asking again.
  • The Magic: Because the system waits for people to stop asking before moving the heavy stuff, nobody notices the move. The library never goes offline, and no one loses data, even during massive upgrades.

3. The "Smart GPS" (Adaptive Server Selection)

The Problem: Imagine a delivery service with 100 drivers. The dispatcher usually sends packages to drivers in a circle (Round Robin). But what if Driver #5 has a flat tire and is moving at 1 mph? The dispatcher keeps sending packages to Driver #5 because it's their turn. The whole delivery is delayed because of one slow driver.

The Solution: LinkedIn built an Adaptive Server Selection (ADSS) system.

  • The Analogy: Instead of a rigid circle, the dispatcher now has a live GPS for every driver.
  • How it works: The system constantly checks: "Who is fast right now? Who is stuck in traffic?" If Driver #5 is slow, the system instantly stops sending packages to them and routes them to Driver #6, who is zooming along.
  • The Magic: It's not just about avoiding the slow driver; it's about predicting traffic. If Driver #6 looks like they are about to get stuck, the system sends the package to Driver #7 instead. This happens automatically and instantly. Even if a driver has a sudden "flat tire" (a computer glitch), the system reroutes traffic in seconds, so the customers (users) never feel the delay.

The Big Picture

By combining these three tools, LinkedIn has built a library that is:

  1. Fair: No single user can crash the system for everyone else.
  2. Resilient: You can move shelves, fix buildings, or upgrade the roof without ever closing the doors.
  3. Smart: It instantly reroutes traffic away from trouble spots, keeping everything running at top speed.

These aren't just theoretical ideas; they are used every day by millions of LinkedIn users, ensuring that when you check your feed or search for a job, the answer comes back instantly, no matter what's happening behind the scenes.