Multi-DNN Inference of Sparse Models on Edge SoCs

This paper introduces SparseLoom, a system that employs model stitching to recombine subgraphs from sparse models without re-training, thereby significantly improving throughput, reducing memory overhead, and lowering Service Level Objective violation rates for multi-DNN inference on edge SoCs compared to state-of-the-art systems.

Jiawei Luo, Di Wu, Simon Dobson, Blesson Varghese

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you are running a busy food truck (the Edge SoC) that has to serve three different types of customers simultaneously: a hungry student who wants a quick sandwich (Speech Recognition), a tourist who wants a detailed map of the city (Image Classification), and a business traveler who needs a complex report (Sentiment Analysis).

Your food truck has three different chefs working in the kitchen:

  1. Chef CPU: Slow but very versatile and good at complex logic.
  2. Chef GPU: Fast at chopping and slicing (great for images).
  3. Chef NPU: A specialized robot chef who is incredibly fast at specific cooking tasks but can't do everything.

The Problem: The "One-Size-Fits-All" Menu

Currently, most food trucks operate with a fixed menu.

  • If the tourist wants a sandwich, you must give them the "Standard Sandwich."
  • If the business traveler wants a report, you must give them the "Standard Report."

The problem is that sometimes the "Standard Sandwich" takes too long to make (violating the tourist's time limit), or it tastes too bland (violating the quality requirement). Sometimes, you have a "Lite Sandwich" (pruned model) that is fast but tastes okay, or a "Premium Sandwich" (quantized model) that is fast but slightly less detailed.

Existing systems try to pick the best pre-made sandwich from a small list. But if the customer has a very strict requirement (e.g., "I need a sandwich in 5 seconds that tastes 95% like the premium one"), and your menu only has a 5-second option that tastes 80% good, you fail the customer. This is called an SLO Violation (Service Level Objective Violation).

The Solution: "Model Stitching" (The Custom Sandwich Bar)

The paper introduces a new system called SparseLoom. Instead of just picking a pre-made sandwich, SparseLoom acts like a Master Chef who can instantly recombine ingredients from different recipes to create a perfect custom sandwich on the fly.

This technique is called Model Stitching.

  • How it works: Imagine you have three recipes:

    1. Recipe A (Dense): The full, heavy, delicious recipe.
    2. Recipe B (Pruned): A recipe where you removed the heavy meat to make it lighter/faster.
    3. Recipe C (Quantized): A recipe where you swapped expensive spices for cheaper, faster ones.

    Usually, you serve the whole Recipe A, B, or C.
    SparseLoom says: "Let's take the bun from Recipe A (because it's tasty), the patty from Recipe B (because it's fast), and the sauce from Recipe C (because it's cheap)."

    By stitching these parts together, you create a Brand New Sandwich that didn't exist before. It's fast enough for the tourist but tasty enough for the business traveler. And the best part? You don't need to retrain the chefs. You just rearrange the existing ingredients.

The Three Big Challenges (and how SparseLoom solves them)

Creating these custom sandwiches sounds great, but it's chaotic. The paper identifies three big problems and solves them with three smart tools:

1. The "Too Many Options" Problem (Profiling Cost)

If you have 10 recipes and you can mix and match 3 parts, you suddenly have thousands of possible sandwiches. Testing every single one to see how long it takes to cook would take forever.

  • The Fix: SparseLoom uses a Crystal Ball (Estimators). Instead of actually cooking every single sandwich to time it, the system uses math to predict how long it will take and how good it will taste based on the ingredients used. This saves 99% of the time!

2. The "Wrong Chef" Problem (Processor Placement)

Just because you made a custom sandwich doesn't mean you know which chef should cook which part.

  • Scenario: If you give the "heavy meat" part to the slow CPU, the whole order is late. If you give the "chopping" part to the robot NPU, it flies.
  • The Fix: SparseLoom has a Smart Manager (Optimizer). It looks at the specific custom sandwich you just made and figures out the perfect order: "Chef NPU does the sauce, Chef GPU does the patty, Chef CPU does the bun." It finds the fastest route for every unique combination.

3. The "Fridge Space" Problem (Memory Overhead)

To switch sandwiches instantly, you usually need to keep every single possible sandwich pre-made in your fridge. But your fridge (memory) is tiny. You can't fit thousands of sandwiches.

  • The Fix: SparseLoom uses a Hot-Spot Tracker (Preloader). It realizes that most customers order the "bun" from Recipe A and the "patty" from Recipe B. It keeps only the most popular ingredients in the fridge and throws the rare ones away. When a customer orders a rare combo, it quickly grabs the missing piece from the pantry. This saves about 28% of your fridge space.

The Results: Why It Matters

When the researchers tested this system on real devices (like laptops and phone chips), the results were amazing:

  • Fewer Failures: They reduced the number of customers who got angry (SLO violations) by up to 74%.
  • Faster Service: They served 2.3 times more customers per hour (throughput).
  • Smaller Fridge: They used 28% less memory to run the system.

In a Nutshell

SparseLoom is like upgrading a rigid, pre-set menu into a dynamic, "build-your-own" kitchen. It uses a clever trick called Model Stitching to mix and match parts of different AI models without needing to retrain them. It then uses smart math to predict performance, assign the right tasks to the right computer chips, and save memory by only keeping the most popular "ingredients" ready.

The result? Your edge devices (like your phone or AR glasses) can run multiple smart tasks at once, faster, more accurately, and without running out of battery or memory.