Multi-DNN Inference of Sparse Models on Edge SoCs

Imagine you are running a busy food truck (the Edge SoC) that has to serve three different types of customers simultaneously: a hungry student who wants a quick sandwich (Speech Recognition), a tourist who wants a detailed map of the city (Image Classification), and a business traveler who needs a complex report (Sentiment Analysis).

Your food truck has three different chefs working in the kitchen:

Chef CPU: Slow but very versatile and good at complex logic.
Chef GPU: Fast at chopping and slicing (great for images).
Chef NPU: A specialized robot chef who is incredibly fast at specific cooking tasks but can't do everything.

The Problem: The "One-Size-Fits-All" Menu

Currently, most food trucks operate with a fixed menu.

If the tourist wants a sandwich, you must give them the "Standard Sandwich."
If the business traveler wants a report, you must give them the "Standard Report."

The problem is that sometimes the "Standard Sandwich" takes too long to make (violating the tourist's time limit), or it tastes too bland (violating the quality requirement). Sometimes, you have a "Lite Sandwich" (pruned model) that is fast but tastes okay, or a "Premium Sandwich" (quantized model) that is fast but slightly less detailed.

Existing systems try to pick the best pre-made sandwich from a small list. But if the customer has a very strict requirement (e.g., "I need a sandwich in 5 seconds that tastes 95% like the premium one"), and your menu only has a 5-second option that tastes 80% good, you fail the customer. This is called an SLO Violation (Service Level Objective Violation).

The Solution: "Model Stitching" (The Custom Sandwich Bar)

The paper introduces a new system called SparseLoom. Instead of just picking a pre-made sandwich, SparseLoom acts like a Master Chef who can instantly recombine ingredients from different recipes to create a perfect custom sandwich on the fly.

This technique is called Model Stitching.

How it works: Imagine you have three recipes:
1. Recipe A (Dense): The full, heavy, delicious recipe.
2. Recipe B (Pruned): A recipe where you removed the heavy meat to make it lighter/faster.
3. Recipe C (Quantized): A recipe where you swapped expensive spices for cheaper, faster ones.
Usually, you serve the whole Recipe A, B, or C.
SparseLoom says: "Let's take the bun from Recipe A (because it's tasty), the patty from Recipe B (because it's fast), and the sauce from Recipe C (because it's cheap)."

By stitching these parts together, you create a Brand New Sandwich that didn't exist before. It's fast enough for the tourist but tasty enough for the business traveler. And the best part? You don't need to retrain the chefs. You just rearrange the existing ingredients.

The Three Big Challenges (and how SparseLoom solves them)

Creating these custom sandwiches sounds great, but it's chaotic. The paper identifies three big problems and solves them with three smart tools:

1. The "Too Many Options" Problem (Profiling Cost)

If you have 10 recipes and you can mix and match 3 parts, you suddenly have thousands of possible sandwiches. Testing every single one to see how long it takes to cook would take forever.

The Fix: SparseLoom uses a Crystal Ball (Estimators). Instead of actually cooking every single sandwich to time it, the system uses math to predict how long it will take and how good it will taste based on the ingredients used. This saves 99% of the time!

2. The "Wrong Chef" Problem (Processor Placement)

Just because you made a custom sandwich doesn't mean you know which chef should cook which part.

Scenario: If you give the "heavy meat" part to the slow CPU, the whole order is late. If you give the "chopping" part to the robot NPU, it flies.
The Fix: SparseLoom has a Smart Manager (Optimizer). It looks at the specific custom sandwich you just made and figures out the perfect order: "Chef NPU does the sauce, Chef GPU does the patty, Chef CPU does the bun." It finds the fastest route for every unique combination.

3. The "Fridge Space" Problem (Memory Overhead)

To switch sandwiches instantly, you usually need to keep every single possible sandwich pre-made in your fridge. But your fridge (memory) is tiny. You can't fit thousands of sandwiches.

The Fix: SparseLoom uses a Hot-Spot Tracker (Preloader). It realizes that most customers order the "bun" from Recipe A and the "patty" from Recipe B. It keeps only the most popular ingredients in the fridge and throws the rare ones away. When a customer orders a rare combo, it quickly grabs the missing piece from the pantry. This saves about 28% of your fridge space.

The Results: Why It Matters

When the researchers tested this system on real devices (like laptops and phone chips), the results were amazing:

Fewer Failures: They reduced the number of customers who got angry (SLO violations) by up to 74%.
Faster Service: They served 2.3 times more customers per hour (throughput).
Smaller Fridge: They used 28% less memory to run the system.

In a Nutshell

SparseLoom is like upgrading a rigid, pre-set menu into a dynamic, "build-your-own" kitchen. It uses a clever trick called Model Stitching to mix and match parts of different AI models without needing to retrain them. It then uses smart math to predict performance, assign the right tasks to the right computer chips, and save memory by only keeping the most popular "ingredients" ready.

The result? Your edge devices (like your phone or AR glasses) can run multiple smart tasks at once, faster, more accurately, and without running out of battery or memory.

Here is a detailed technical summary of the paper "Multi-DNN Inference of Sparse Models on Edge SoCs" by Luo et al.

1. Problem Statement

Modern edge applications (e.g., Augmented Reality) often require running multiple Deep Neural Network (DNN) tasks concurrently on heterogeneous System-on-Chips (SoCs) containing CPUs, GPUs, and NPUs. These systems face two primary challenges:

Service Level Objective (SLO) Violations: Edge applications have diverse and dynamic SLOs regarding latency and accuracy. Existing multi-DNN inference systems typically select from a fixed set of pre-trained sparse variants (e.g., pruned or quantized models) for each task. The limited number of these variants often fails to satisfy strict or fluctuating SLOs, leading to high violation rates.
Inefficiency in Heterogeneous Execution: Current systems often use fixed processor placement orders (e.g., always assigning subgraph 1 to NPU, 2 to GPU) or fail to jointly optimize variant selection with processor placement, resulting in suboptimal throughput.
Resource Constraints: Pre-loading all possible model variants to avoid runtime switching latency incurs prohibitive memory overhead on edge devices.

2. Methodology: SparseLoom

The authors propose SparseLoom, a holistic multi-DNN inference system designed for edge SoCs. Its core innovation is Model Stitching, supported by three novel modules to address scalability, throughput, and memory challenges.

A. Core Concept: Model Stitching

Instead of relying solely on pre-generated sparse variants, SparseLoom dynamically creates new model variants by recombining subgraphs from existing sparse models without re-training.

Mechanism: A base model is partitioned into subgraphs (e.g., $S_1, S_2, S_3$ ). Variants are created by mixing and matching these subgraphs from different sparse versions (e.g., taking $S_1$ from a dense model, $S_2$ from a pruned model, and $S_3$ from a quantized model).
Theoretical Basis: This works because compression techniques (pruning, quantization) preserve layer-wise output distributions. Therefore, combining aligned subgraphs maintains high accuracy.
Benefit: This exponentially expands the "variant space," offering a much richer set of accuracy-latency trade-offs to meet specific SLOs.

B. System Architecture & Modules

To make model stitching deployable, SparseLoom introduces three key modules:

Performance Profiler (Addressing Scalability):
- Challenge: The number of stitched variants is exponential, making exhaustive profiling of accuracy and latency infeasible.
- Solution: Uses Accuracy and Latency Estimators.
  - Accuracy Estimator: A supervised regression model (XGBoost) that predicts the accuracy of a stitched variant based on the known accuracies of its constituent subgraphs.
  - Latency Estimator: Approximates end-to-end latency by summing the measured latencies of individual subgraphs on specific processors, ignoring negligible inter-subgraph communication costs on shared-memory SoCs.
- Result: Reduces profiling cost by up to 99% compared to exhaustive profiling.
Sparsity-Aware Optimizer (Addressing Throughput):
- Challenge: Fixed processor placement orders (e.g., NPU-GPU-CPU) are often suboptimal for specific stitched variants.
- Solution: Performs joint optimization of processor placement order and variant selection. It evaluates all possible processor permutations to find the global order that minimizes average latency across all concurrent tasks, then selects the best variant for each task under that order.
Hot-Subgraph Preloader (Addressing Memory):
- Challenge: Pre-loading all stitched variants exceeds edge memory budgets.
- Solution: Implements a Hotness Metric to prioritize subgraph preloading. The "hotness" of a subgraph is calculated based on its frequency of use across valid SLO configurations and its uniqueness (i.e., if it is the only subgraph capable of satisfying a specific SLO).
- Strategy: A greedy algorithm pre-loads the top "hot" subgraphs within a memory budget, ensuring the most critical components are available for runtime switching.

3. Key Contributions

Model Stitching for Edge: Introduced a training-free technique to generate new model variants by recombining subgraphs, significantly expanding the selection space for SLO compliance.
SparseLoom System: Developed the first end-to-end system integrating model stitching with efficient profiling, joint optimization, and memory-aware preloading for heterogeneous edge SoCs.
Efficiency Mechanisms:
- Estimators that reduce profiling time by ~99%.
- Joint optimization that adapts processor placement to specific sparsity patterns.
- Hotness-based preloading that cuts memory usage by ~28% without sacrificing performance.

4. Experimental Results

The system was evaluated on three edge platforms (Intel Desktop, Intel Laptop, NVIDIA Jetson Orin) using four diverse tasks (Image Classification, Sentiment Analysis, Activity Recognition, Speech Recognition).

SLO Violation Reduction: SparseLoom reduced SLO violation rates by up to 74% compared to state-of-the-art baselines (including systems with adaptive variant selection but no stitching).
Throughput Improvement: Achieved up to 2.31× higher inference throughput compared to the best existing baselines, primarily due to optimized processor placement and the wider variant selection.
Memory Efficiency: Reduced memory overhead by an average of 28% compared to systems that pre-load all variants, while maintaining comparable SLO violation rates.
Profiling Overhead: Reduced profiling time from hours (exhaustive) to under 8 minutes (with estimators), a 99% reduction.

5. Significance

This paper addresses a critical bottleneck in edge AI: the inability of current systems to adapt to diverse and strict SLOs due to a lack of model diversity and inefficient resource utilization.

Paradigm Shift: It moves beyond static model selection to dynamic, training-free model composition (stitching).
Practical Deployment: By solving the scalability (profiling) and memory (preloading) challenges, SparseLoom makes advanced multi-DNN inference feasible on resource-constrained edge devices.
Generalizability: The approach is applicable to various DNN architectures (CNNs, Transformers) and hardware platforms, offering a blueprint for future heterogeneous edge computing systems.