Serving Compound Inference Systems on Datacenter GPUs

JigsawServe is a novel serving framework that optimizes latency, accuracy, and GPU resource costs for compound inference systems by jointly selecting model variants and spatially partitioning GPUs, achieving up to 11.3x higher throughput and significantly improved resource efficiency compared to prior work.

Sriram Devata, Rahul Singh, Sarita Adve

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you run a busy, high-end restaurant kitchen. In the past, every order was simple: "Make me a burger." You had one chef for burgers, and that was it.

But today, customers are ordering "Compound Meals."

  • "I want a burger, but first, check if the meat is fresh, then grill it, then slice it, then wrap it, and finally, write a poem about the burger on the napkin."

This is what Compound Inference Systems are in the world of AI. Instead of one AI model doing one job, a single request triggers a chain of different AI models (a "task graph") working together.

The problem? Your kitchen (the Datacenter GPU) is huge, but most chefs (AI models) are either too big for the counter or sit around doing nothing while waiting for ingredients. You have a limited number of chefs, but a massive amount of complex orders coming in.

Enter JIGSAWSERVE, the new system proposed in this paper. Think of it as a Master Kitchen Manager who uses a "Jigsaw Puzzle" approach to solve the chaos.

Here is how it works, broken down into simple concepts:

1. The Three Superpowers of JIGSAWSERVE

To handle these complex orders efficiently, JIGSAWSERVE uses three tricks that previous managers didn't use all at once:

  • Trick A: The "Menu Flexibility" (Accuracy Scaling)

    • The Old Way: You only have one recipe for "Grilled Burger." It takes 10 minutes and tastes perfect.
    • JIGSAWSERVE: You have a menu of 5 recipes. One takes 10 minutes and is perfect. Another takes 2 minutes and is "pretty good." Another takes 30 seconds and is "okay."
    • The Magic: If the kitchen is swamped, the manager says, "For the burger wrapping step, let's use the 30-second 'okay' recipe because it doesn't ruin the whole meal." But for the "meat freshness check," they use the perfect 10-minute recipe because that's critical. They dial the quality up or down depending on how busy the kitchen is, ensuring the final meal is still good enough without wasting time.
  • Trick B: The "Smart Counter" (GPU Spatial Partitioning)

    • The Old Way: You have a giant counter (a powerful GPU). You assign one chef to the whole counter. If that chef is small, 80% of the counter sits empty. If the chef is huge, they can't fit.
    • JIGSAWSERVE: Imagine the counter is made of Lego bricks. The manager can slice the counter into tiny, isolated islands. One small island for a tiny chef, a medium island for a medium chef.
    • The Magic: Now, instead of one chef hogging the whole counter, you can fit four different chefs on one counter, working on four different parts of the order at the same time, without bumping into each other. This is called Spatial Partitioning.
  • Trick C: The "Map of the Order" (Task-Graph-Informed Budgeting)

    • The Old Way: The manager gives every step of the order the same amount of time and resources. "Everyone gets 5 minutes!"
    • JIGSAWSERVE: The manager looks at the whole order map. "Wait, the 'poem writing' step is fast and doesn't need a fancy chef. But the 'grilling' step is the bottleneck. Let's give the grilling step 3 chefs and the poem step 1 tiny chef."
    • The Magic: They allocate resources based on the flow of the order, not just a flat rule.

2. The "Jigsaw" Analogy

Why is it called JIGSAWSERVE?

Imagine you have a giant puzzle (the total computing power of the datacenter).

  • Old systems tried to force big, square puzzle pieces (whole GPUs) into the puzzle. They didn't fit well, leaving huge gaps of wasted space.
  • JIGSAWSERVE realizes that the puzzle pieces (AI tasks) come in all different shapes and sizes. Some are tiny, some are huge.
  • The system acts like a puzzle solver that picks the perfectly shaped pieces (model variants) and cuts the big puzzle board (GPUs) into custom shapes (spatial partitions) so that every single inch of the board is filled perfectly with no gaps.

3. The Results: A Miracle in Efficiency

The researchers tested this in a real datacenter (a kitchen with 4 super-computer GPUs).

  • The Result: JIGSAWSERVE could handle 11.3 times more orders than the best previous system.
  • The Cost: It used only 43% of the available GPU power to meet the same goals.
  • The Quality: The food (accuracy) was still delicious (meeting accuracy goals), and the wait times (latency) were almost never late.

4. Why This Matters

Before this, if you wanted to run complex AI (like an AR assistant that sees an object, describes it, and speaks to you), you needed a massive, expensive server farm, and it was often inefficient.

JIGSAWSERVE proves that by being smart about how we slice the hardware and flexible about the quality of the AI models, we can:

  1. Save massive amounts of money and electricity.
  2. Serve more users with the same equipment.
  3. Make complex AI applications (like self-driving cars or VR assistants) faster and more reliable.

In short: JIGSAWSERVE is the ultimate kitchen manager that knows exactly which chef to hire, how to cut the counter, and when to serve a "good enough" meal versus a "perfect" one, ensuring the restaurant runs at maximum efficiency without ever serving a bad meal.