Hybrid Orchestration of Edge AI and Microservices via Graph-based Self-Imitation Learning

Imagine you are running a massive, high-speed digital theme park. This park has two types of attractions:

The "AI" Rides: These are the heavy hitters, like a rollercoaster that requires a massive, specialized engine (a GPU) to run. They are powerful but expensive and hard to fit in.
The "Microservice" Rides: These are the smaller, simpler attractions, like ticket booths, food stands, and security gates. They are lightweight and can go almost anywhere.

The Problem:
In the past, theme park managers (the "orchestrators") treated these two types of rides separately. They would try to figure out where to put the ticket booths without thinking about where the rollercoasters were, and vice versa.

But here's the catch: To ride the rollercoaster (the AI), you must first buy a ticket (the microservice) and pass through security. If the ticket booth is on the other side of the park from the rollercoaster, you waste time walking back and forth. If the park is crowded, the lines get huge, and the whole system slows down.

The paper argues that to make the park run fast and smoothly, you can't just look at the rides in isolation. You have to figure out where to build the rides AND how to guide the guests through them at the same time.

The Solution: SIL-GPO (The "Smart Park Manager")

The authors propose a new AI manager called SIL-GPO. Think of it as a super-intelligent park manager who learns by playing the game over and over again. Here is how it works, broken down into simple concepts:

1. Seeing the Whole Map (Graph Neural Networks)

Most managers just look at a list of rides. SIL-GPO, however, looks at the entire map as a living, breathing web.

The Analogy: Imagine the park isn't just a list of buildings, but a spiderweb where every ride is connected to every other ride by invisible strings (data lines).
How it helps: SIL-GPO uses "Graph Attention Networks" to see which rides are tightly connected. If the "Ticket Booth" and the "Rollercoaster" are on the same string, the manager knows, "Hey, let's build these two right next to each other so guests don't have to walk far!"

2. Learning from Its Best Days (Self-Imitation Learning)

Usually, AI learns by trying random things and getting punished when it fails. This is slow and frustrating.

The Analogy: Imagine a student taking a test. A normal teacher says, "You got a C, try again." But SIL-GPO is like a coach who says, "Remember that one time you got an A? Let's look at exactly what you did that day and do it again!"
How it helps: The system keeps a special "Hall of Fame" of its best decisions (high-reward paths). When it gets stuck or confused, it looks back at its "Hall of Fame" and copies those successful moves. This helps it learn faster and avoid getting stuck in bad habits.

3. The "Step-by-Step" Strategy

Building the whole park at once is too hard. So, SIL-GPO builds it one ride at a time.

The Analogy: Instead of trying to design the whole park in one day, it places one ticket booth, checks if the lines are shorter, then places a food stand, checks again, and so on.
The Reward System: Every time it places a ride and the lines get shorter, it gets a "gold star" (reward). If lines get longer, it gets a "time-out" (penalty). Over time, it learns the perfect layout.

Why Does This Matter?

In the real world, this isn't just about theme parks; it's about Edge AI.

Edge AI means running smart apps (like self-driving cars or factory robots) on local servers close to you, rather than sending data all the way to a giant cloud server far away.
The Result: By using SIL-GPO, these local servers can handle requests much faster. The paper shows that this new manager reduces the time it takes to get a result by 15% to 30% compared to older methods, while using less electricity and computer power.

The Bottom Line

This paper introduces a smart, learning system that figures out the perfect way to arrange complex computer services and guide data through them. It does this by seeing the big picture (the graph), learning from its best moments (self-imitation), and optimizing the whole system together rather than piece by piece. The result? Faster apps, less lag, and happier users.

Here is a detailed technical summary of the paper "Hybrid Orchestration of Edge AI and Microservices via Graph-based Self-Imitation Learning".

1. Problem Statement

Modern edge AI applications increasingly rely on microservice architectures that integrate both AI services (e.g., large language model inference) and conventional microservices (e.g., authentication, API gateways) into complex request chains. These applications face stringent latency requirements that cloud-based solutions often fail to meet due to transmission delays.

The core challenges identified in the paper are:

Hybrid Orchestration Complexity: Existing research typically optimizes AI services and microservices in isolation. However, they are tightly coupled; AI services often require prerequisite microservices (e.g., authentication) before execution.
Resource Heterogeneity & Constraints: AI services require significant GPU resources and often run as single instances, while microservices are lightweight and often deployed as multiple instances. Edge servers have limited resources, making it difficult to support large-scale service aggregation.
Joint Optimization Gap: Efficient performance requires the joint optimization of service deployment (where to place instances) and request routing (how to direct traffic). Current approaches often decouple these or ignore routing optimization, leading to suboptimal end-to-end latency.
Sparse Rewards & Large Action Spaces: The combinatorial nature of deploying multiple service instances across heterogeneous servers creates a massive action space, making traditional Reinforcement Learning (RL) prone to slow convergence and getting stuck in local optima due to sparse reward signals.

2. Methodology: SIL-GPO

The authors propose SIL-GPO (Self-Imitation Learning-enhanced Graph Policy Optimization), a Reinforcement Learning framework designed to solve the hybrid orchestration problem.

A. System Modeling

Network Model: The edge environment is modeled as a graph of heterogeneous servers (Universal-Computing-Storage and Hybrid-Accelerated-Computing with GPUs).
Queuing Model: The system uses an Open Jackson queuing network to model service requests. It captures:
- Processing Delays: Derived from Open Jackson networks (M/M/C queues) for both microservices and AI services.
- AI Inference Efficiency: The paper provides a detailed mathematical derivation of the computational load for LLM inference (specifically LLAMA3), separating the Prefill and Autoregressive Decoding stages to calculate precise processing rates and memory requirements (KV cache).
- Communication Delays: Includes transmission, queuing, inter-server communication, and result return delays.
Objective: Minimize the average end-to-end service request response delay subject to resource constraints (CPU, GPU, Memory).

B. Algorithm Design (SIL-GPO)

The algorithm frames the problem as a Markov Decision Process (MDP) with the following components:

State Representation (Graph-Based):
Instead of flat vectors, the state is encoded using Graph Attention Networks (GATs) to capture topological dependencies. The state includes:
- $G^D_t$ (Deployment Topology): Node features representing the number of service instances on each server.
- $G^R_t$ (Routing Graphs): Subgraphs representing routing probabilities for specific service requests.
- $G^S_t$ (Service Invocation Graph): Directed edges capturing dependencies between services in a request chain.
- Vector Features: Arrival rates and server availability flags.
Action Space:
The agent performs incremental deployment, selecting one service instance to deploy on a specific server at each step. This reduces the combinatorial explosion of the action space.
Reward Function (Dual-Stage):
To address the sparse reward problem, the authors design a two-part reward:
- Intermediate Sparse Reward ( $R_1$ ): Provides immediate feedback based on local resource utilization and delay changes after each incremental deployment step.
- Final Settlement Reward ( $R_2$ ): A global reward calculated at the end of an episode, comparing the total delay against the previous episode and the historical minimum.
Self-Imitation Learning (SIL):
To accelerate convergence and escape local optima, SIL-GPO integrates Self-Imitation Learning.
- It maintains a High-Return Experience Buffer that stores trajectories with cumulative rewards higher than a threshold.
- The agent is encouraged to "imitate" these high-reward trajectories, effectively prioritizing successful policies during training.
- The total loss function combines the standard PPO-Clip loss with a self-imitation loss term.

3. Key Contributions

Fine-Grained Hybrid Model: Developed a multi-instance hybrid orchestration model based on Open Jackson queuing networks that explicitly accounts for the distinct resource demands (GPU vs. CPU) and processing characteristics of AI services and microservices.
Joint Optimization Framework: Formulated the hybrid orchestration problem as a Mixed Integer Nonlinear Programming (MINLP) problem and solved it via a unified RL framework that jointly optimizes deployment and routing.
Graph-Based RL with SIL: Proposed the SIL-GPO algorithm, which leverages GATs to encode complex service topologies and dependencies, and utilizes Self-Imitation Learning to improve exploration efficiency and convergence speed in large combinatorial action spaces.
Comprehensive Evaluation: Validated the approach using trace-driven simulations based on real-world edge computing datasets (EUA dataset).

4. Experimental Results

The authors conducted extensive experiments comparing SIL-GPO against state-of-the-art baselines:

Baselines:
- HELAS: Genetic Algorithm-based meta-heuristic.
- MFDS-FPR: Greedy-based heuristic.
- RSDQL: Deep Q-Learning with reward sharing.
Performance Metrics: Total response delay, CPU/GPU usage, and memory usage.
Key Findings:
- Latency Reduction: SIL-GPO significantly outperformed all baselines. Compared to the optimal baseline (HELAS), it reduced total response delay by 15.19% in specific comparisons, and up to 32.6% in high-load scenarios.
- Resource Efficiency: While achieving lower latency, SIL-GPO maintained resource utilization comparable to or better than RSDQL, and significantly better than heuristic methods (HELAS, MFDS-FPR).
- Robustness: The algorithm demonstrated consistent superiority across varying service request arrival rates, request chain lengths, and total request volumes.
- Convergence: The integration of Self-Imitation Learning and GATs led to faster convergence and more stable training compared to standard Deep RL approaches.

5. Significance

This paper addresses a critical gap in Edge AI deployment by recognizing that AI services cannot be optimized in isolation from the supporting microservice infrastructure.

Practical Impact: It offers a scalable, unified solution for deploying complex AI applications (like image generation or intelligent driving) at the edge, ensuring low-latency user experiences.
Theoretical Advancement: It demonstrates the effectiveness of combining Graph Neural Networks (for structural understanding) with Self-Imitation Learning (for efficient policy search) in solving complex, resource-constrained orchestration problems.
Future Direction: The framework paves the way for more sophisticated, autonomous edge management systems capable of handling the growing complexity of AI-driven microservice ecosystems.