Safe Multi-Agent Deep Reinforcement Learning for Privacy-Aware Edge-Device Collaborative DNN Inference

Imagine you are trying to solve a massive, complex puzzle (like a high-definition image recognition task) on your smartphone. Your phone is small and has a limited battery, but the puzzle is huge. You have two bad options:

Do it all yourself: Your phone gets hot, the battery dies instantly, and it takes forever.
Send it all to the cloud: You upload the whole picture to a giant server farm. It's fast, but you have to send your private photo over the internet, and a stranger (the server) might be able to reconstruct your face from the data they receive.

This paper proposes a smart middle ground called "Edge-Device Collaborative Inference." Instead of doing everything on the phone or sending everything to the cloud, the phone does the first few steps of the puzzle, then sends just the result of those steps to a nearby "Edge Server" (a mini-computer at a cell tower) to finish the job.

However, there's a catch: How much of the puzzle should the phone do?

If the phone does too little, the server gets a very clear picture of your data (privacy risk).
If the phone does too much, it drains the battery and takes too long (energy/delay risk).

The authors created a "Smart Manager" (an AI algorithm called HC-MAPPO-L) that figures out the perfect balance for millions of users simultaneously. Here is how it works, broken down into simple concepts:

1. The Three-Layer Management Team

Imagine a large construction site with a complex project. You can't just tell every worker what to do at once; you need a hierarchy. The authors' AI acts like a three-tiered management team:

The Strategist (Slow Time): Every few minutes, this manager decides which blueprints (AI models) to keep in the local toolboxes (Edge Servers). It's like a warehouse manager deciding which tools to stock based on what the workers are likely to need soon. They don't change this often because moving heavy toolboxes takes time.
The Coordinator (Medium Time): Every time a user asks for help, this manager decides who helps whom and how much work the user does. It matches the user to the nearest server and decides: "You do the first 3 steps, then send the result to Server A." It also considers: "User B is very privacy-conscious, so they should do 8 steps."
The Foreman (Fast Time): This manager handles the instant traffic. If five people are asking for help at the exact same second, the Foreman quickly splits the server's power and internet bandwidth among them so no one gets stuck in a traffic jam.

2. The "Safety Net" (Lagrangian Relaxation)

In the real world, if you tell a delivery driver, "Be fast, but don't run over anyone," they might get confused. If you just say "Go fast," they might speed and crash.

Most AI systems try to learn by trial and error, but if they break the rules (like taking too long), they just get a "punishment" score. This often leads to the AI giving up or acting unpredictably.

This paper uses a Lagrangian Safety Net. Think of it like a strict coach with a stopwatch.

If the team is running too slow, the coach immediately tightens the leash (increases the penalty) and forces the AI to prioritize speed.
If the team is running comfortably fast, the coach loosens the leash, allowing the AI to focus on saving battery or protecting privacy.
This ensures the system never consistently breaks the speed limit, even while trying to be efficient.

3. The Privacy vs. Speed Trade-off

The paper treats privacy like a dimmer switch.

Bright Light (Low Privacy): The phone sends raw data early. The server sees everything. Fast, but risky.
Dim Light (High Privacy): The phone processes the data until it's just abstract shapes (like "a dog" rather than "your dog"). The server sees only the shape. Slower, but safe.

The AI learns to adjust this dimmer switch automatically. If a user is in a hurry, it turns the light up (speed). If a user is worried about privacy, it turns the light down (safety).

4. The Results: Why It's Better

The authors tested their "Smart Manager" against other methods (like simple greedy rules or standard AI).

The "Greedy" approach: Like a person who always picks the closest server without thinking. It often leads to traffic jams and privacy leaks.
The "Standard AI": Often ignores the rules and crashes the system by taking too long.
HC-MAPPO-L (The Winner): It consistently kept the "delivery time" under the limit (3 seconds) while saving the most battery and keeping the most secrets. It was like a traffic controller that never let a red light turn green until the intersection was clear, yet kept traffic flowing smoothly.

The Big Picture

This paper solves a modern dilemma: How do we use powerful AI on our phones without draining them or spying on us?

The answer is a hierarchical, safety-first AI that acts like a brilliant orchestra conductor. It doesn't just tell the musicians (devices and servers) to play loud or soft; it listens to the tempo (delay), watches the sheet music (privacy), and manages the instruments (battery) to ensure the symphony is perfect, on time, and safe for everyone.

1. Problem Statement

The paper addresses the critical challenges in deploying Deep Neural Network (DNN) inference on resource-constrained edge devices and mobile platforms. While edge-device collaborative inference (splitting models between devices and servers) reduces latency and bandwidth usage, existing solutions often overlook privacy leakage.

The Core Conflict: Transmitting intermediate features to edge servers exposes sensitive data to inversion attacks. Shallow partitions (more computation on the device) protect privacy but increase local energy consumption and latency. Deep partitions (more offloading) reduce latency but expose sensitive features.
The Optimization Gap: Most existing works focus on optimizing Quality of Service (QoS) metrics like delay and energy, treating privacy as a secondary constraint or ignoring it. There is a lack of frameworks that jointly optimize model deployment, user-server association, model partitioning, and resource allocation while strictly enforcing long-term delay constraints and quantifying privacy risks.
Algorithmic Challenge: Standard Deep Reinforcement Learning (DRL) algorithms struggle to satisfy hard, long-term average constraints (like delay) without causing training instability or policy collapse.

2. Methodology

The authors propose a novel framework formulated as a Constrained Markov Decision Process (CMDP) and solved using a Hierarchical Constrained Multi-Agent Proximal Policy Optimization with Lagrangian relaxation (HC-MAPPO-L) algorithm.

A. System Model

Architecture: A hierarchical system with a central cloud (model repository), $J$ edge servers, and $K$ heterogeneous user devices.
Collaborative Inference: DNNs are partitioned at specific layers. The device executes the first $l$ layers, uploads intermediate features, and the server executes the remaining layers.
Privacy Quantification: Privacy leakage is quantified using the Structural Similarity (SSIM) index. The SSIM score measures the reconstructability of the original input from intermediate features; higher SSIM implies higher privacy risk.
Objective: Minimize the long-term weighted sum of average energy consumption and privacy cost, subject to a long-term average inference delay constraint.

B. The HC-MAPPO-L Algorithm

The algorithm utilizes a Centralized Training with Decentralized Execution (CTDE) paradigm with a three-layer hierarchical structure to handle different timescales and decision complexities:

Slow-Timescale Layer (Model Deployment):
- Decision: Which models to cache on which servers (updates every $\Delta T$ slots).
- Technique: Uses an Auto-Regressive Policy. Instead of selecting a binary vector of size $2^I$ (intractable for large model libraries), the agent sequentially selects models one by one, masking unavailable options. This reduces the search space to polynomial time.
Fast-Timescale Layer (User Association & Partitioning):
- Decision: Which server to connect to and where to split the model (updates every slot).
- Technique: Formulated as a Constrained MDP (Dec-POCMDP).
- Safety Mechanism: Integrates Lagrangian Relaxation. A shared Lagrange multiplier ( $\lambda$ ) is dynamically updated based on delay violations. The policy maximizes a Lagrangian objective: $Reward - \lambda \times (Delay - Threshold)$ . This ensures the long-term delay constraint is strictly met without sacrificing training stability.
Fast-Timescale Layer (Resource Allocation):
- Decision: Allocating computing power and bandwidth to associated users.
- Technique: Uses an Attention-Based Policy. The server encodes user requests (model type, input size, partition depth) and uses an attention mechanism to dynamically weight resource distribution among users, handling variable numbers of active users efficiently.

3. Key Contributions

Comprehensive Optimization Framework: The paper establishes a CMDP formulation that jointly optimizes model deployment, user association, privacy-aware partitioning, and resource allocation, explicitly modeling privacy leakage via SSIM.
HC-MAPPO-L Algorithm: A novel safe RL algorithm that combines:
- Hierarchical Architecture: Separating slow deployment decisions from fast operational decisions.
- Lagrangian Dual Updates: Enforcing long-term delay constraints rigorously.
- Specialized Policies: Auto-regressive for combinatorial deployment and attention-based for dynamic resource allocation.
Empirical Validation: Extensive simulations demonstrating that the proposed method outperforms state-of-the-art baselines (heuristic, unconstrained RL, and other MARL variants) in cost-efficiency, constraint satisfaction, and scalability.

4. Experimental Results

The algorithm was evaluated in a simulation with 10 edge servers and 50 users, using various DNN models (LeNet, ResNet, VGG).

Constraint Satisfaction: HC-MAPPO-L consistently maintained average inference delays below the 3-second threshold (achieving ~2.74s), whereas unconstrained baselines (e.g., H-MAPPO) frequently violated constraints (exceeding 4.38s).
Cost-Performance Trade-off: The proposed method achieved the lowest user cost (weighted sum of energy and privacy) compared to all baselines, outperforming the next best heuristic by ~12% and unconstrained RL by ~21%.
Scalability: The algorithm maintained stable performance and low costs as the number of users, servers, and service diversity increased. It achieved a 96% service success rate even under high service diversity.
Fairness: User cost distribution analysis showed HC-MAPPO-L provides a more uniform distribution of costs across users compared to greedy or unconstrained methods, which often left specific users with high costs.
Adaptability: The system intelligently adapted to resource changes. For example, when user compute capacity increased, the algorithm shifted to deeper partitions (higher local energy) to significantly reduce privacy leakage, demonstrating a sophisticated trade-off strategy.

5. Significance

This work is significant because it bridges the gap between privacy preservation and efficient edge computing in a dynamic, multi-agent environment.

Safety in RL: It demonstrates a practical application of Safe RL (via Lagrangian relaxation) to enforce hard QoS constraints in complex edge networks, a critical requirement for real-world deployment.
Privacy-Aware Design: By treating privacy as a quantifiable cost (SSIM) rather than a binary constraint, it enables fine-grained trade-offs between data protection and system performance.
Scalable Architecture: The hierarchical decomposition and attention mechanisms provide a blueprint for scaling DRL solutions to large-scale, heterogeneous edge networks where traditional optimization methods fail.

In summary, the paper presents a robust, safe, and scalable solution for the next generation of privacy-aware edge intelligence, ensuring that DNN inference remains fast, energy-efficient, and secure against privacy attacks.

Safe Multi-Agent Deep Reinforcement Learning for Privacy-Aware Edge-Device Collaborative DNN Inference

1. The Three-Layer Management Team

2. The "Safety Net" (Lagrangian Relaxation)

3. The Privacy vs. Speed Trade-off

4. The Results: Why It's Better

The Big Picture

1. Problem Statement

2. Methodology

A. System Model

B. The HC-MAPPO-L Algorithm

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Entropy After for reasoning model early exiting

Alternatives to the Laplacian for Scalable Spectral Clustering with Group Fairness Constraints

A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

FLeX: Fourier-based Low-rank EXpansion for multilingual transfer