Breaking the Prototype Bias Loop: Confidence-Aware Federated Contrastive Learning for Highly Imbalanced Clients

Imagine a group of students from different schools trying to solve a giant puzzle together, but they can't share their actual puzzle pieces (because of privacy rules). Instead, they send the teacher a small sketch of what their piece looks like. This is Federated Learning.

Usually, this works great. But in this paper, the authors discovered a specific problem that happens when the puzzle is unbalanced: some students have hundreds of pieces of the "Sky" category, but only one piece of the "Rare Bird" category.

Here is the story of how they fixed it, using simple analogies.

1. The Problem: The "Broken Compass" Loop

In the old way of doing this (called Prototype-Based Learning), every student calculates the "average" look of their "Sky" pieces and their "Rare Bird" pieces. They send these averages (called Prototypes) to the teacher. The teacher mixes them all together to make a "Global Average" and sends it back.

The Trap:

Student A has 1,000 Sky pieces and 1 Rare Bird piece. Their "Rare Bird" average is shaky and wrong because it's based on just one piece.
Student B has 100 Sky pieces and 0 Rare Bird pieces.
The teacher blindly mixes everyone's averages. The shaky "Rare Bird" average from Student A gets mixed in, making the Global Average for birds slightly wrong.
The teacher sends this slightly wrong "Global Bird" back to everyone.
The Loop: Now, Student A tries to match their pieces to this wrong "Global Bird." Because the guide is wrong, Student A's next sketch becomes even more wrong.
This repeats every round. The "Rare Bird" gets more distorted, and the students get confused. The authors call this the "Prototype Bias Loop." It's like a broken compass that keeps pointing slightly off, and everyone keeps walking further off course because they trust the compass.

2. The Solution: CAFedCL (The "Smart Team Captain")

The authors propose a new system called CAFedCL. Instead of blindly trusting everyone's sketch, the system adds a "Confidence Check."

Think of the teacher as a Smart Team Captain who doesn't just average the sketches; they weigh them based on how reliable the student is.

A. The "Confidence Score" (Weighing the Votes)

Before the teacher mixes the sketches, they ask each student: "How sure are you about your 'Rare Bird' sketch?"

If a student has only one bird piece, they say, "I'm not very sure."
If a student has 100 bird pieces, they say, "I'm very confident!"
The Fix: The teacher gives less weight to the shaky sketches and more weight to the confident ones. This stops the "broken compass" from getting worse. It's like ignoring the opinion of a student who is guessing, so the group doesn't get led astray.

B. The "Generative Augmentation" (The Magic Photocopier)

For the students who have almost no "Rare Bird" pieces (the minority), the system gives them a Magic Photocopier (a Generative AI).

This photocopier looks at the few real bird pieces they have and creates fake but realistic bird pieces to help them practice.
This gives the student more data to work with, making their sketch more accurate before they even send it to the teacher.

C. The "Geometry Regularizer" (The Fence)

Sometimes, when students are confused, they might accidentally mix up the "Sky" and the "Bird" categories, making them look too similar.

The system puts up a fence (a mathematical rule) that forces the "Sky" group and the "Bird" group to stay far apart.
This ensures that even if the data is messy, the categories don't collapse into a giant, confusing blob.

3. The Result: A Fairer, Smarter Team

By using this new method, the team achieves two big things:

Better Accuracy: The final puzzle is solved much more correctly, especially for the rare pieces (the "Rare Birds").
Fairness: In the old system, students with rare data got left behind and performed poorly. In this new system, the "Smart Team Captain" ensures that even the students with difficult, rare data get a fair shot at success.

Summary

The paper is about fixing a flaw where a group learning together gets stuck in a cycle of making mistakes because they trust bad data too much. They fixed it by:

Listening to the experts (trusting confident students more).
Helping the beginners (using AI to create more practice data for rare items).
Keeping categories distinct (making sure "Birds" don't look like "Sky").

The result is a learning system that is robust, fair, and doesn't get confused by messy, unbalanced data.

1. Problem Statement

The paper addresses a critical failure mode in Federated Contrastive Learning (FedCL) when applied to scenarios with extreme class imbalance and client heterogeneity (non-IID data).

The Prototype Bias Loop: In standard prototype-based FedCL, clients compute local class prototypes (feature centroids) and send them to a server for aggregation. The server broadcasts these global prototypes back to clients to serve as anchors for contrastive learning.
The Mechanism of Failure: Under extreme imbalance, minority classes on specific clients have very few samples, leading to high-variance, noisy local prototypes.
1. These noisy prototypes are naively aggregated into biased global prototypes.
2. These biased global prototypes are then reused as contrastive anchors in the next round.
3. Clients align their local representations to these biased anchors, reinforcing the error.
4. This creates a self-reinforcing feedback loop where errors accumulate across communication rounds, distorting the prototype geometry and severely degrading minority class discrimination.

2. Methodology: CAFedCL

The authors propose Confidence-Aware Federated Contrastive Learning (CAFedCL), a framework designed to break this loop by treating prototypes as uncertain estimates rather than deterministic targets. The method consists of three core components:

A. Class-wise Confidence-Weighted Aggregation

Instead of simple averaging (which treats all clients equally), CAFedCL introduces a confidence mechanism to down-weight unreliable contributions during aggregation.

Confidence Score ( $conf_{k,c}$ ): Each client calculates a reliability score for each class $c$ $c$ , combining three signals:
1. Data Availability: Proportional to the effective sample size ( $n_{eff}$ ).
2. Predictive Uncertainty: Based on validation performance (lower uncertainty = higher confidence).
3. Generation Quality: If generative augmentation is used, the discriminator score of synthetic samples is included.
Aggregation: Global prototypes and encoder parameters are updated using these confidence scores as weights. This mathematically reduces the "variance injection" term in the global error bound, preventing noisy minority-class prototypes from dominating the global anchor.

B. Geometric Consistency Regularization

To prevent class collapse (where different classes merge due to majority dominance) and maintain structural integrity:

Margin-based Penalty: A geometric regularizer ( $L_{geo}$ ) enforces a minimum distance between global class prototypes.
Alignment Term: A soft alignment loss ( $L_{align}$ ) ensures local prototypes remain consistent with the global coordinate system without over-committing to early, potentially biased anchors.

C. Tail Augmentation (Optional)

For clients with extremely scarce minority data:

A conditional GAN is used to generate synthetic samples for tail classes.
These samples increase the effective sample size ( $n_{eff}$ ), thereby reducing the variance of the local prototype estimates before aggregation.

Theoretical Analysis

The authors provide an expectation-based analysis (Proposition 2) decomposing the global prototype error into three terms:

Anchor Feedback: Error propagation from reusing biased anchors.
Heterogeneity Gap: Shifts due to non-IID data.
Variance Injection: Noise from unreliable clients.
The analysis proves that confidence-weighted aggregation specifically suppresses the Variance Injection term, which is the primary driver of the bias loop in long-tailed settings.

3. Key Contributions

Identification of the Prototype Bias Loop: The paper formally identifies and characterizes how naive prototype aggregation and anchor reuse create a self-reinforcing error loop in imbalanced federated settings.
CAFedCL Framework: Proposes a novel framework integrating confidence-aware aggregation, geometric regularization, and optional generative augmentation to stabilize minority representations.
Theoretical Guarantees: Provides a mathematical bound showing that the proposed aggregation mechanism reduces estimation variance and bounds global prototype drift.
Empirical Superiority: Demonstrates consistent improvements in both global accuracy and client fairness across diverse datasets and heterogeneity levels.

4. Experimental Results

The method was evaluated on CIFAR-10, CIFAR-100, and EMNIST under various non-IID and long-tailed settings (controlled by Dirichlet parameter $\alpha$ and Imbalance Ratio $IR$).

Performance: CAFedCL consistently outperformed state-of-the-art baselines (including FedAvg, FedProx, MOON, FedProto, and FedRCL).
- Example: On CIFAR-10 with pathological heterogeneity ($IR=10$), CAFedCL achieved 90.36% accuracy compared to 89.45% for the next best (FedTGP) and 84.94% for MOON.
Fairness (Client Consistency): CAFedCL achieved the lowest standard deviation (Std) in client-wise accuracy across all datasets. This indicates it successfully prevents the "client drift" where hard-to-learn clients (with minority classes) suffer significantly.
Robustness: The method remained stable even under extreme conditions (e.g., $\alpha=0.05$ , $IR=100$, varying client counts), whereas baselines suffered sharp accuracy drops and high variance.
Ablation Study:
- Removing ConfAgg caused the largest performance drop, confirming its critical role in breaking the bias loop.
- Removing Geometric Regularization or GAN augmentation also led to significant degradation, proving the synergy of all components.

5. Significance

This work is significant because it moves beyond simple parameter aggregation to address the structural instability inherent in representation learning under imbalance.

Theoretical Insight: It shifts the paradigm from viewing prototypes as static truths to dynamic, uncertain estimates that require reliability calibration.
Practical Impact: It offers a solution for real-world applications where data is naturally imbalanced and privacy-sensitive, such as medical screening (rare diseases) and industrial defect detection (rare faults), ensuring that models do not ignore minority classes due to federated aggregation artifacts.
Scalability: The approach achieves these gains with modest communication overhead, making it viable for large-scale federated systems.