A Multi-Prototype-Guided Federated Knowledge Distillation Approach in AI-RAN Enabled Multi-Access Edge Computing System

Imagine a massive, high-tech city called AI-RAN MEC. In this city, thousands of small devices (like your smart thermostat, self-driving cars, or security cameras) are constantly generating data. They want to learn together to become smarter, but they can't send all their private data to a central "brain" because of privacy laws and bandwidth limits.

This is where Federated Learning (FL) comes in. Instead of sending data to the center, the devices learn locally and only send their "lessons learned" (mathematical updates) to a central server. The server mixes these lessons to create a smarter "Global Brain," which is then sent back to the devices.

The Problem: The "Non-IID" Mess
The problem is that every device sees the world differently.

Device A (a security camera in a park) only sees dogs and trees.
Device B (a traffic camera) only sees cars and pedestrians.
Device C (a weather station) only sees clouds.

In the old way of doing things, the central server tried to average everyone's lessons. It was like trying to make a single "average" recipe by mixing a cake batter with a bowl of soup. The result is a muddy mess that doesn't taste like either. In technical terms, this is called Non-IID data (data that isn't uniform), and it causes the Global Brain to get confused and perform poorly.

The Solution: MP-FedKD (The "Multi-Prototype" Approach)
This paper proposes a new, smarter way to teach these devices, called MP-FedKD. Think of it as a revolutionary study group with four special tricks:

1. Self-Knowledge Distillation (The "Time-Traveling Mentor")

Usually, to learn from a teacher, you need a big, smart teacher. But here, the devices don't have a pre-trained teacher.

The Analogy: Imagine you are studying for a test. Instead of hiring a tutor, you use your yesterday's self as the teacher. You take what you knew yesterday, compare it to what you are learning today, and use that to guide your study.
The Benefit: This helps the device learn from its own history without needing an external, heavy-duty teacher, making the learning process smoother and more consistent.

2. Multi-Prototype Generation (The "Clustered Library")

In the old method, the server tried to create just one "average" example (a prototype) for each category.

The Analogy: Imagine trying to describe "Dogs" with just one picture. If you average a Chihuahua and a Great Dane, you get a blurry, medium-sized dog that looks like neither. You lose the unique details of both.
The Fix: This paper uses a technique called CHAC (Conditional Hierarchical Agglomerative Clustering). Instead of one blurry picture, the device creates multiple distinct examples (prototypes).
- One prototype for "Small Dogs."
- One for "Big Dogs."
- One for "Fluffy Dogs."
The Benefit: The Global Brain gets a rich library of specific examples rather than a single, muddy average. It preserves the unique details of the data.

3. Prototype Alignment (The "Historical Bridge")

When the server collects all these local examples to make the Global Brain, it usually just averages them again, which risks losing information.

The Analogy: Imagine the server is a librarian. Instead of just shoving all the books into a pile and averaging their titles, the librarian looks at the old books (from the previous round) and uses them to understand the new books better.
The Fix: The system forces the "Global Prototypes" to learn from the "Local Embeddings" (the detailed data points) of the previous round. It's like a bridge connecting the past knowledge to the present, ensuring no unique details are dropped during the averaging process.

4. LEMGP Loss (The "Magnet and Repeller")

Finally, the paper invents a new rule for how the devices should learn, called LEMGP Loss.

The Analogy: Think of the learning process as a dance floor.
- The Attractive Force (Magnet): If you are dancing with someone who belongs to the same group (e.g., "Dogs"), you want to move closer to the "Global Dog Prototype."
- The Repulsive Force (Repeller): If you see someone from a different group (e.g., "Cats"), you want to push away from the "Global Cat Prototype."
The Benefit: This new rule ensures that the devices clearly distinguish between different categories, keeping the groups separate and distinct, which prevents confusion.

The Result

The authors tested this new system on six different datasets (like images of clothes, cars, and landscapes) under messy, non-uniform conditions.

The Outcome:
Just like a student who uses a smart study group, multiple specific examples, and clear rules, the MP-FedKD system learned much faster and more accurately than the old methods.

It was more accurate (better test scores).
It made fewer mistakes (lower error rates).
It handled the "messy data" (Non-IID) much better than previous systems.

In Summary:
This paper teaches us that when you have a group of learners with different backgrounds, you shouldn't just average their answers. Instead, you should:

Let them learn from their own past selves.
Group their examples into specific, detailed clusters.
Connect the new group to the old group to keep details.
Use clear rules to keep different groups apart.

By doing this, the whole network becomes smarter, faster, and more reliable, even when everyone is working with different data.

Here is a detailed technical summary of the paper "A Multi-Prototype-Guided Federated Knowledge Distillation Approach in AI-RAN Enabled Multi-Access Edge Computing System."

1. Problem Statement

The paper addresses the critical challenge of statistical heterogeneity (non-IID data) in Federated Learning (FL) within AI-RAN enabled Multi-Access Edge Computing (MEC) systems.

Context: As wireless networks evolve toward 6G and AI-native RAN, edge devices generate massive amounts of data. Centralized training is impractical due to privacy concerns, making FL the preferred distributed approach.
The Challenge: In real-world MEC scenarios, data across edge devices is non-independent and identically distributed (non-IID) due to varying environments, sources, and hardware.
Limitations of Existing Solutions:
- Standard FL: Suffers from model divergence and accuracy degradation when local updates diverge due to non-IID data.
- Single-Prototype Strategies: Existing methods often use a single prototype per class (generated by averaging embedding vectors). The paper argues that this averaging operation causes a loss of useful information, failing to capture the full feature distribution of heterogeneous data.
- Standard Knowledge Distillation (KD): Conventional KD requires a pre-trained "teacher" network, which is computationally expensive and difficult to align with local "student" networks in dynamic edge environments.

2. Methodology: MP-FedKD

The authors propose Multi-Prototype-Guided Federated Knowledge Distillation (MP-FedKD). This framework integrates four core components to handle non-IID data without revealing raw data:

A. Self-Knowledge Distillation (SKD)

Instead of using an external teacher network, the system employs SKD.

Mechanism: The local model from the previous round ( $t-1$ ) acts as the "teacher," guiding the training of the current round's local model ( $t$ ) (the "student").
Benefit: Eliminates the need for pre-training a separate teacher network and reduces communication overhead while effectively transferring knowledge to handle data heterogeneity.

B. Conditional Hierarchical Agglomerative Clustering (CHAC) for Multi-Prototype Generation

To overcome the information loss of single-prototype averaging, the paper introduces a multi-prototype strategy.

Mechanism: On each local client, embeddings of samples belonging to the same class are clustered using Hierarchical Agglomerative Clustering (HAC).
Condition: A "conditional" rule is applied: clustering only occurs if the number of samples in a class exceeds a specified threshold. If the sample count is low, each sample is treated as its own cluster.
Algorithm: Uses Ward's method (based on Sum of Squares) to merge clusters. The centroids of these clusters become the local multi-prototypes for that class.
Advantage: Captures sub-structures within a class, providing a more comprehensive representation of the data distribution than a single mean vector.

C. Prototype Alignment (PA)

To mitigate information loss during the aggregation of global prototypes:

Mechanism: The global prototype for a class is updated not just by averaging current local prototypes, but by allowing it to "learn" from the historical local embeddings generated by the previous round's local models.
Implementation: A specific PA Loss (based on Mean Squared Error) is designed to minimize the distance between the current global prototype and the historical local embeddings.

D. LEMGP Loss Function

A novel loss function is designed for local client training, combining four terms:

Cross-Entropy (CE): Standard classification loss.
SKD Loss: Kullback-Leibler (KL) divergence between the previous and current local model outputs.
PA Loss: Aligns global prototypes with historical local embeddings.
LEMGP Loss (Local Embedding to Global Prototype): Based on the COREL loss, this consists of:
- Attractive Part: A weighted MSE loss pulling local embeddings toward the global prototype of the same class.
- Repulsive Part: A logarithmic/exponential function pushing local embeddings away from global prototypes of other classes.

3. Key Contributions

MP-FedKD Framework: A novel approach integrating SKD, multi-prototype generation, prototype alignment, and a custom loss function specifically for AI-RAN enabled MEC.
CHAC Algorithm: A conditional hierarchical clustering method that generates multiple prototypes per class, preserving feature diversity lost in single-prototype averaging.
Prototype Alignment Mechanism: A strategy where global prototypes learn from historical local embeddings to prevent information dilution during aggregation.
LEMGP Loss: A composite loss function that simultaneously optimizes for classification, self-distillation, prototype alignment, and inter-class separation.
Comprehensive Evaluation: Extensive experiments across six datasets (CIFAR-10, MNIST, Fashion-MNIST, EuroSAT, and hybrid datasets M+F and C+E) under various non-IID settings.

4. Experimental Results

The proposed method was evaluated against state-of-the-art baselines (FedProx, FedProto, FedAS, MOON, E-FPKD, FedALA) using ResNet-10 and S-CNN backbones.

Accuracy Improvements:
- MP-FedKD consistently outperformed all baselines.
- On the EuroSAT dataset with 10 clients, accuracy improvements ranged from 1.98% to 28.70% over baselines.
- On CIFAR-10 with 20 clients, the proposed method achieved accuracy 2.01× higher than FedProx and 1.50× higher than FedAS.
Error Reduction: The method demonstrated significantly lower Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) compared to FedProx and FedProto.
Ablation Studies:
- Removing Prototype Alignment or LEMGP loss resulted in significant accuracy drops (e.g., a 1.58% drop on CIFAR-10 without LEMGP), confirming their necessity.
- CHAC vs. K-Means: The hierarchical approach (CHAC) outperformed K-Means clustering, achieving ~1.03× higher accuracy, attributed to the richer information provided by the dendrogram structure.
Scalability: The method maintained high performance across different client counts (10, 20, 50) and various non-IID levels (Dirichlet parameters 0.3 to 0.9).

5. Significance

This work is significant for the deployment of AI-native 6G networks and Edge Computing for several reasons:

Privacy-Preserving Efficiency: It enables high-performance model training on edge devices without data centralization, addressing privacy concerns.
Robustness to Heterogeneity: By moving beyond single-prototype averaging, the system effectively handles the complex, real-world data distributions found in MEC environments.
Resource Optimization: The use of Self-Knowledge Distillation removes the computational burden of maintaining a separate teacher network, making the solution more viable for resource-constrained edge devices.
System Integration: The paper provides a concrete system model integrating AI-RAN, MEC, and FL, offering a blueprint for next-generation intelligent wireless networks.