Neural Collapse-Inspired Multi-Label Federated Learning under Label-Distribution Skew

Imagine a group of doctors from different hospitals trying to build a single, super-smart AI to diagnose diseases. This is the world of Federated Learning (FL). Instead of sharing patient records (which is illegal and unsafe), they keep their data private and just share the "lessons" their local AI learns.

However, there's a big problem: The patients are different.

Hospital A (in a city) sees mostly heart issues and lung fluid.
Hospital B (in a rural area) sees mostly skin rashes and bone fractures.
Hospital C sees a mix, but mostly rare, complex cases.

This is called Label Skew. If they just average their lessons together (like the standard method, FedAvg), the final AI becomes confused. It gets really good at diagnosing heart issues (because Hospital A is loud) but terrible at spotting skin rashes (because Hospital B is quiet). In the medical world, missing a rare disease is dangerous.

The Paper's Solution: "FedNCA-ML"

The authors propose a new method called FedNCA-ML. To understand it, let's use a few analogies.

1. The Problem: The "Noisy Classroom"

Imagine a classroom where every student is trying to learn the same subject, but:

Student A only has textbooks about Cats.
Student B only has textbooks about Dogs.
Student C has a mix, but mostly Birds.

If they all try to teach a single "Global Teacher" at the same time, the Global Teacher gets a headache. They try to please everyone, but end up knowing a little bit about everything and a lot about nothing. They also get confused because Student A thinks "Fur" means "Cat," while Student B thinks "Fur" means "Dog."

2. The Secret Weapon: "Neural Collapse" (The Perfect Geometry)

The paper uses a mathematical concept called Neural Collapse.

The Analogy: Imagine a group of friends trying to stand in a room so they are all equally far apart from each other, forming a perfect, symmetrical shape (like a star or a pyramid).
In AI: This means forcing the AI to organize its knowledge so that every disease (or "class") has its own distinct, perfectly separated "seat" in its brain. No matter which hospital you come from, "Heart Disease" should always look the same to the AI, and it should be far away from "Skin Rash."

3. The Innovation: "The Specialized Detectors" (LADM)

The old way tried to force the AI to look at a whole picture and guess all diseases at once. This is like asking a detective to solve 10 different crimes in one room without separating the clues. It gets messy.

The new method uses a Label-Aware Disentanglement Module (LADM).

The Analogy: Instead of one detective looking at the whole crime scene, the AI puts on 10 different pairs of glasses.
- One pair of glasses is tuned only to look for Heart Disease clues.
- Another pair is tuned only for Lung Fluid.
- Another for Skin Rashes.
Even if a hospital only has pictures of Heart Disease, the "Heart Glasses" get really sharp training. The "Skin Glasses" might not get much data, but they don't get confused by the Heart data. They stay focused on their specific job.

4. The Glue: The "Shared Blueprint" (ETF)

How do we make sure the "Heart Glasses" at Hospital A look the same as the "Heart Glasses" at Hospital B?

The Analogy: The researchers give every hospital the exact same blueprint (called an ETF matrix).
This blueprint acts like a rigid ruler. It tells every local AI: "No matter what you see, your 'Heart Disease' answer must point in this exact direction."
This stops the hospitals from drifting apart and developing their own weird definitions of diseases.

5. The Cleanup Crew: "Noise Cancellation"

Sometimes, the AI gets confused by things that aren't there.

The Analogy: If you are looking for a "Dog," you shouldn't get excited if you see a "Cat."
The paper adds two special "cleaning rules":
1. Rejection Loss: "If you think this is a Dog, but it's actually a Cat, stop looking at it!" (Suppresses false alarms).
2. Contrastive Loss: "Make sure all the 'Dog' pictures you see look very similar to each other, and very different from 'Cat' pictures." (Groups similar things tightly).

The Result

By using this "Specialized Glasses + Shared Blueprint + Noise Cancellation" approach, the new AI:

Doesn't forget the rare diseases. (It treats the quiet hospitals with the same respect as the loud ones).
Understands the connections. (It knows that Heart Disease and Lung Fluid often happen together, but keeps them distinct).
Works better for everyone.

In short: The paper teaches a group of isolated AI doctors how to collaborate without sharing secrets, ensuring that the final team is an expert at every disease, not just the common ones. It turns a chaotic, biased group into a perfectly organized, balanced super-team.

1. Problem Statement

The paper addresses the challenge of Multi-Label Federated Learning (ML-FL) under Label-Distribution Skew. While standard Federated Learning (FL) allows collaborative training without sharing raw data, existing methods primarily focus on single-label classification. Real-world applications (e.g., medical imaging) often involve multi-label data where multiple conditions co-occur in a single sample.

The specific challenges in this setting are:

Severe Label Imbalance: Local client data often contains majority, minority, or even missing classes. Clients tend to overfit to dominant local labels, leading to a global model with poor generalization on rare classes.
Multi-Label Co-occurrence Bias: Frequent labels often appear with many others, dominating the training signal and suppressing the learning of discriminative features for rare, co-occurring conditions.
Cross-Client Inconsistency: Clients differ not only in label frequencies but also in label dependency structures (how labels co-occur). This heterogeneity causes optimization conflicts, leading to "client drift" where local models diverge significantly from the global objective.

2. Methodology: FedNCA-ML

The authors propose FedNCA-ML (Federated Neural Collapse Alignment for Multi-Label Learning), a framework inspired by Neural Collapse (NC) theory. NC describes an ideal latent geometry where class features collapse to their means, and class prototypes form a maximally separated Simplex Equiangular Tight Frame (ETF).

The framework consists of three core components:

A. Label-Aware Disentanglement Module (LADM)

In multi-label settings, a single pooled image representation often entangles evidence for multiple classes, causing gradient interference.

Mechanism: LADM uses a DETR-style cross-attention mechanism. It employs a set of fixed, class-specific query vectors (derived from the global ETF) to attend to image tokens.
Function: This extracts class-wise features ( $\mathbf{h}_{ic}$ ) from the shared backbone features. Instead of forcing the backbone to learn mutually exclusive features, LADM disentangles the evidence, allowing the backbone to preserve semantic proximity while the queries handle specific class predictions.
Consistency: A shared query matrix is used across all clients to anchor class representations to the same directions, reducing inter-client drift.

B. Neural Collapse-Inspired Feature Alignment

Shared ETF Prior: The method fixes the classifier weights to a globally shared Simplex ETF matrix ( $\mathbf{M}$ ). This matrix serves a dual purpose: it acts as the classifier weights and as the fixed query vectors for the LADM.
Alignment: By anchoring class-wise features to this fixed ETF geometry, the model enforces a consistent decision boundary across all clients, regardless of their local label distribution. This mitigates the tendency of clients to over-specialize to their local data.

C. Regularization Losses

To further enhance clustering and robustness, two complementary losses are introduced:

Negative Feature Rejection Loss ( $\mathcal{L}_{Neg}$ ): Prevents features of negative classes (labels not present in the sample) from spuriously aligning with other class prototypes. It penalizes high similarity between a negative feature and non-corresponding prototypes.
Positive Feature Contrastive Loss ( $\mathcal{L}_{Pos}$ ): Encourages compact intra-class clustering. It pushes positive features closer to their own class prototype while pushing them away from others, using a prototype-based softmax.

Total Objective: The training loss combines Binary Cross-Entropy (BCE) with the two regularization terms:
$\mathcal{L}_{total} = \mathcal{L}_{BCE} + \lambda_1 \mathcal{L}_{Neg} + \lambda_2 \mathcal{L}_{Pos}$

3. Key Contributions

Problem Formalization: The first formalization of multi-label FL under the combined constraints of label skew (missing classes) and heterogeneous label co-occurrence patterns.
FedNCA-ML Framework: A novel pre-learning approach that leverages NC theory to enforce a shared geometric prior (ETF) across heterogeneous clients, mitigating representation drift.
Class-Wise Attention Mechanism: The introduction of LADM, which reformulates multi-label learning into per-class subproblems compatible with NC alignment, preserving semantic relationships in the backbone while ensuring class-specific discrimination.
Dual Regularization: The design of rejection and contrastive losses to suppress noisy negative signals and promote tight, well-separated feature clusters in the latent space.

4. Experimental Results

The method was evaluated on five benchmark datasets (CIFAR-10, PASCAL VOC, MS COCO, DermaMNIST, ChestX-ray14) under nine different FL settings (varying Dirichlet concentration parameters $\beta$ and class-presence ratios $\gamma$ ).

Performance Gains: FedNCA-ML consistently outperformed state-of-the-art baselines (FedAvg, FedProx, SCAFFOLD, FedLGT, etc.).
- CIFAR-10: Achieved up to 3.92% improvement in class-wise AUC and 4.57% in class-wise F1 score under severe skew.
- DermaMNIST: Improved class-wise F1 by 4.93% in highly imbalanced scenarios.
- ChestX-ray14: Showed significant improvements in class-wise AUC (up to 1.21%), demonstrating better recognition of minority disease classes compared to methods that over-predict the majority "No Finding" class.
Ablation Studies:
- Removing LADM caused performance degradation due to feature entanglement.
- Using fixed ETF queries outperformed learnable queries, confirming that shared geometric priors are crucial for stability in non-IID settings.
- Visualization (t-SNE and Grad-CAM) confirmed that FedNCA-ML produces more compact, semantically coherent clusters and focuses on correct image regions for specific classes.

5. Significance

This paper makes a significant contribution to the field of Federated Learning by bridging Neural Collapse theory with multi-label learning in non-IID environments.

Medical Relevance: The focus on label-skewed multi-label scenarios is highly relevant for medical imaging (e.g., chest X-rays, dermatology), where diseases are rare, co-occur, and data distributions vary significantly across hospitals.
Theoretical Insight: It demonstrates that enforcing a specific geometric structure (Simplex ETF) on class-wise representations can effectively resolve the conflicts caused by heterogeneous label dependencies, offering a principled solution to the "client drift" problem in complex multi-label settings.
Practical Impact: The proposed method provides a robust, privacy-preserving solution for training global models on distributed, highly imbalanced multi-label data without requiring raw data sharing.