SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference

Imagine you have a massive, incredibly smart library (a Large Language Model like the ones powering AI chatbots). This library is so big that it doesn't fit on your phone or even on a single computer. To make it work, the library is split into thousands of tiny, specialized experts.

The Problem: When you ask a question, the library doesn't read the whole book. It just calls on a few specific experts (say, a "math expert" and a "history expert") to answer you. This is called a Mixture-of-Experts (MoE) model.
The Catch: Even though you only need a few experts at a time, the entire library is so huge that your phone can't store even a small shelf of it. If you try to run this AI on your phone, it crashes because of memory limits.
The Old Solution (The "U-Shape"): Previously, people tried to split the work between your phone and a nearby server (the "Edge"). Your phone would send your question up, the server would do the heavy lifting, and send the answer back. But this is like sending a letter back and forth for every single word you type. It's slow, and it wastes a lot of data.

Enter "SlimCaching": The Smart Librarian

The paper introduces a new idea called SlimCaching. Think of it as a super-smart, distributed librarian system that knows exactly which books to keep where so you don't have to wait.

Here is how it works, using a simple analogy:

1. The Setup: A Neighborhood of Smart Shelves

Imagine you live in a neighborhood with many small libraries (Edge Servers) and you have a tiny bookshelf at home (your phone).

Your Home: You keep the most common books you read every day (the "non-expert" parts of the AI that are always needed).
The Neighborhood: The local libraries have limited shelf space. They can't hold the whole library, but they can hold specific, popular "expert" books.

2. The Challenge: The "Teamwork" Problem

In the old way of thinking, if you needed a book, you just checked if the library had it. If yes, great! If no, you asked the next library. This is easy if you only need one book at a time.

But in these advanced AI models, you often need two or more experts at the exact same time to answer a question (e.g., you need the "Math Expert" AND the "Science Expert" simultaneously).

The Trap: If the "Math Expert" is at Library A and the "Science Expert" is at Library B, the system has to run back and forth between them to get both answers. This coordination is messy and slow.
The Mistake: A simple "greedy" strategy (just picking the most popular books to put on shelves) fails here. It might put the Math Expert at Library A and the Science Expert at Library B because they are both popular individually. But if they are never needed together, that's fine. However, if they are needed together, the system gets stuck in traffic.

3. The Solution: The "Successive Decomposition" Strategy

The authors of this paper realized that to fix this, you can't just look at one book at a time. You have to look at the combinations.

They developed a new algorithm (a set of rules for the librarians) that works like this:

Step 1: Instead of trying to solve the whole neighborhood's problem at once (which is too hard), they break it down. They ask: "If Library 1 fills its shelves, what's the best we can do? Then, given Library 1's choices, what's the best for Library 2?"
Step 2: They use a "Dynamic Programming" technique. Imagine a chess player who doesn't just look at the next move, but calculates the best outcome for a whole sequence of moves, considering how the pieces interact.
Step 3: They found a way to speed this up. Since many "expert books" are the same size, they can group them and solve the puzzle much faster, like organizing books by height rather than one by one.

Why is this a Big Deal?

The paper proves that this new method is mathematically guaranteed to be much better than the old "pick the most popular" methods.

Speed: It drastically reduces the time it takes for your phone to get an answer. Instead of waiting for data to travel back and forth between your phone, the server, and the cloud, the system often finds the experts right next to you or in a nearby server that can talk to each other instantly.
Privacy: Your personal data stays on your phone. Only the "hidden thoughts" (intermediate data) are sent to the servers, keeping your private conversations private.
Efficiency: It saves battery and data because it stops the phone from constantly shouting to the cloud for help.

The Bottom Line

SlimCaching is like upgrading a chaotic, disorganized library system into a highly coordinated team. Instead of just stocking the most popular books, the system figures out exactly which groups of books need to be stored together in specific locations to ensure that when you ask a complex question, the right experts are already in the same room, ready to work together instantly.

This means faster AI on your phone, less battery drain, and a smarter way to handle the massive AI models of the future.

1. Problem Statement

The paper addresses the challenge of deploying Mixture-of-Experts (MoE) large language models (LLMs) in storage-constrained edge networks. While MoE architectures (e.g., Switch Transformer, DeepSeek-V3) improve scalability by activating only a subset of "expert" networks per token, they introduce massive parameter counts that exceed the storage capacity of edge devices.

Existing solutions like Split Inference (SI) (specifically U-shaped frameworks) or cloud-based inference suffer from significant drawbacks:

High Latency: U-shaped SI requires uploading hidden states for every token, incurring fixed communication costs regardless of model size.
Inefficiency: Cloud-based inference introduces excessive backhaul latency.
Storage Limits: Devices cannot cache the entire model.

The core research question is: How to optimally place (cache) specific expert networks across distributed edge servers and user devices to minimize average inference latency, given limited storage constraints and the complex dependency of expert activation?

A key difficulty identified is that MoE models typically use a Top-K selection strategy (activating $K$ experts per layer). When $K > 1$ , the activation of experts is correlated (co-activation). This correlation breaks the submodularity property common in standard caching problems, rendering traditional greedy algorithms ineffective and theoretically unguaranteed.

2. Methodology: The SlimCaching Framework

The authors propose SlimCaching, a framework where:

User Devices: Store non-expert components and a small set of "preferred" (frequently activated) experts.
Edge Servers: Cache the remaining experts of various MoE models.
Inference Flow: If all required experts for a token are available locally or at the associated edge server, inference happens with minimal communication. If not, the hidden state is routed to the cloud or other edge servers.

Mathematical Formulation

The problem is formulated as a combinatorial optimization problem (P1) to maximize the reduction in average per-token inference latency subject to multiple knapsack constraints (storage limits of each edge server).

Case $K=1$ : The problem is a monotone submodular maximization with knapsack constraints.
Case $K \ge 1$ : The problem becomes monotone but non-submodular and non-supermodular due to the coupling effects of co-activated experts.

Proposed Algorithms

The authors develop distinct algorithms based on the value of $K$ :

For $K=1$ (Special Case):
- Algorithm: A Greedy-based Algorithm.
- Guarantee: Achieves a $(1 - 1/e)$ -approximation of the optimal solution.
- Mechanism: Iteratively selects the expert-server pair that offers the highest marginal gain in latency reduction per unit of storage.
For $K \ge 1$ (General Case):
- Challenge: Greedy methods fail because the utility of an expert depends on whether its co-activated partners are also cached.
- Approach: Successive Greedy Decomposition. The global problem is decomposed into $N$ subproblems (one per edge server) solved sequentially.
- Subproblem Solution: Each subproblem is a mix of modular and supermodular functions. The authors propose a Dynamic Programming (DP) approach to solve these subproblems.
- Acceleration: An Accelerated Algorithm using max-convolution is introduced. It exploits the fact that experts within a model often have uniform sizes, grouping them to reduce complexity from $O(E)$ to $O(T)$ (where $T$ is the number of distinct expert sizes).
- Guarantee: The solution provides a $(1 - \kappa_g)/2$ -approximation, where $\kappa_g$ is the supermodular curvature. In practical scenarios (symmetric links, communication-dominant latency), this guarantees at least a 1/4-approximation.

3. Key Contributions

Novel Problem Formulation: Defined the expert caching problem for distributed MoE inference, identifying the unique non-submodularity caused by Top-K expert co-activation.
Algorithmic Innovation:
- Developed a greedy algorithm with a $(1-1/e)$ guarantee for $K=1$ .
- Proposed a successive greedy decomposition combined with DP and max-convolution techniques to handle the non-submodular $K \ge 1$ case, providing provable approximation guarantees.
System Design: Introduced the SlimCaching architecture, which balances local storage, edge caching, and cloud offloading to minimize communication overhead while preserving privacy (raw data remains local).

4. Experimental Results

The authors evaluated their method using diverse MoE models (Switch Transformer, MoE-LLaVA, LLaMA-MoE) on SQA and VQA-v2 datasets.

Latency Reduction: The proposed method consistently outperformed baselines (Greedy, LFU, Random, and U-shaped Split Inference).
- Under constrained storage (2.5 GB), it achieved a 16.7% lower latency than the Greedy baseline and 19.5% lower than LFU.
- It significantly outperformed U-shaped SI, which suffers from fixed communication overheads.
Scalability:
- Storage Capacity: As edge storage increased, SlimCaching's latency dropped significantly, whereas U-shaped SI remained flat.
- Network Scale: The algorithm maintained superior performance as the number of edge servers and users increased.
Computational Efficiency: The proposed accelerated algorithm demonstrated significantly lower running times compared to the standard greedy approach, especially as storage capacity and the number of models increased. The complexity scales linearly with storage capacity, whereas greedy scales quadratically.

5. Significance

This work bridges the gap between the theoretical requirements of large-scale MoE models and the practical constraints of edge computing.

Theoretical Impact: It advances the field of combinatorial optimization by addressing non-submodular maximization under multiple knapsack constraints, a class of problems previously lacking effective approximation algorithms with guarantees.
Practical Impact: It provides a viable pathway for deploying state-of-the-art, privacy-preserving LLMs on mobile devices and edge networks without relying on heavy cloud backhauls. By intelligently caching only the necessary "slim" subsets of experts, it enables low-latency, high-performance AI inference in resource-constrained environments.

SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference

Enter "SlimCaching": The Smart Librarian

1. The Setup: A Neighborhood of Smart Shelves

2. The Challenge: The "Teamwork" Problem

3. The Solution: The "Successive Decomposition" Strategy

Why is this a Big Deal?

The Bottom Line

1. Problem Statement

2. Methodology: The SlimCaching Framework

Mathematical Formulation

Proposed Algorithms

3. Key Contributions

4. Experimental Results

5. Significance

More like this

A Theory-guided Weighted L2L^2L2 Loss for solving the BGK model via Physics-informed neural networks

Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

Enhancing sample efficiency in reinforcement-learning-based flow control: replacing the critic with an adaptive reduced-order model

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression

A Theory-guided Weighted $L^2$ Loss for solving the BGK model via Physics-informed neural networks