Federated Learning Performance Depends on Site… — Plain-Language Explanation

Original authors: Jackson, N. J., Yan, C., Caro-Vega, Y., Paredes, F., Ismerio Moreira, R., Cadet, S., Varela, D., Cesar, C., Duda, S. N., Shepherd, B. E., Malin, B. A.

Published 2026-03-27

📖 5 min read🧠 Deep dive

View on medRxiv ↗PDF ↗

CC BY 4.0

Original authors: Jackson, N. J., Yan, C., Caro-Vega, Y., Paredes, F., Ismerio Moreira, R., Cadet, S., Varela, D., Cesar, C., Duda, S. N., Shepherd, B. E., Malin, B. A.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to build the ultimate HIV care recipe book. You want this book to be so good that it can predict who might get sick, who might need extra help, and how to save lives, no matter where they live.

To make this recipe book perfect, you need to taste-test it with ingredients from all over the world: Brazil, Haiti, Mexico, Chile, and Honduras. But here's the problem: Privacy laws and security rules mean you can't physically ship the ingredients (patient data) to one giant kitchen. Each country has to keep their ingredients in their own locked pantry.

This is where the researchers in this paper stepped in with a clever new cooking method called Federated Learning (FL).

The "Cook-Along" Analogy

Instead of shipping ingredients, imagine a Master Chef (the global AI model) who sends a blank recipe card to every local kitchen.

The Local Cooks: Each local chef (the hospital in Haiti, Mexico, etc.) takes the blank card and cooks a dish using only their own local ingredients. They taste it, figure out what went wrong, and write down the changes they made to the recipe (e.g., "add more salt," "cook 2 minutes longer").
The Secret Sauce: They send only the changes back to the Master Chef. They do not send the actual ingredients or the names of the people who ate the food.
The Global Update: The Master Chef collects all the "change notes" from every kitchen, averages them out, and creates a new, improved global recipe card.
Repeat: This new card goes back to the local kitchens, and the process repeats until the recipe is perfect.

This way, the Master Chef learns from everyone's experience without ever seeing anyone's private data.

What Did They Find?

The researchers tested this "Cook-Along" method against two other ways of making the recipe:

The "All-Ingredients" Method (Centralized): Everyone sends their data to one kitchen. (This is the gold standard but illegal in many places due to privacy).
The "Solo Chef" Method (Site-Specific): Each chef tries to make the recipe using only their own tiny pantry.

Here are the three big takeaways, explained simply:

1. The "Cook-Along" is Almost as Good as the "All-Ingredients" Method

The Federated Learning recipe was 99% as good as the one made with all the data combined. It was way better than the "Solo Chef" method.

Why it matters: We can build super-smart medical AI without breaking privacy laws. We don't need to break the bank or the law to save lives.

2. Size Matters (But Not How You Think)

You might think the biggest kitchens (like the one in Haiti with 13,000 patients) would benefit the most from this group cooking.

The Twist: Actually, the small kitchens benefited the most!
The Analogy: If you are a tiny restaurant with only 50 customers, you don't know much about what everyone else likes. But if you join a group of 100 restaurants, you suddenly learn about 10,000 customers. Your menu gets amazing.
The Big Kitchen: The huge kitchen in Haiti already knew so much about its own customers that joining the group didn't change its recipe much. It was already a pro.

3. The "Different Flavors" Problem (Heterogeneity)

This is the most interesting part. Sometimes, the ingredients in one country are just too different from another.

The Analogy: Imagine the Haitian kitchen uses spicy, tropical ingredients, while the Chilean kitchen uses mild, root-vegetable ingredients. If the Master Chef tries to force a "one-size-fits-all" recipe on both, the Haitian dish might taste bland, and the Chilean dish might be too spicy.
The Solution: The researchers found that after the group cooking, it helps if each local chef does a little bit of "Fine-Tuning."
- They take the global recipe and tweak it slightly to match their specific local taste.
- Result: This "Fine-Tuning" made the recipes even better, especially for tricky tasks like predicting Tuberculosis.

The Bottom Line

This paper proves that we can build super-smart medical AI that respects privacy. We don't need to share private patient data to learn from each other.

For small hospitals: Joining the group is a game-changer; it gives them the power of a giant database.
For big hospitals: They might not need the group as much, but they can still help others.
For everyone: If the local patients are very different from the rest of the world, the best strategy is to learn from the group, then tweak the model locally to fit your specific community.

It's like a global potluck where everyone brings a dish, but instead of eating the food, we just share the secret recipes to make the world's best meal, all while keeping everyone's family recipes safe in their own kitchens.

1. Problem Statement

Machine Learning (ML) holds significant promise for improving HIV care through clinical prediction models (e.g., mortality risk, opportunistic infections). However, the development of robust, generalizable models is hindered by data silos and privacy regulations that prevent the pooling of patient-level data across international sites.

Current Limitation: Most existing ML models are trained on single-site datasets with small sample sizes, leading to poor generalizability.
The Barrier: Centralized data sharing is often legally or ethically impossible in international collaborations (e.g., between Latin American and Caribbean countries) due to strict data governance frameworks.
The Challenge: While Federated Learning (FL) offers a privacy-preserving alternative, its efficacy in highly heterogeneous, resource-diverse international HIV cohorts remains unproven. Specifically, it is unclear how site size and between-site heterogeneity (differences in patient populations, clinical practices, and data collection) impact FL performance.

2. Methodology

The study utilized data from CCASAnet, a large international HIV consortium comprising 22,234 People Living with HIV (PLWH) across six clinical sites in five countries (Brazil, Chile, Honduras [2 sites], Haiti, and Mexico).

Prediction Tasks

The researchers developed models for four distinct clinical outcomes:

1-year mortality.
3-year mortality.
1-year incidence of Tuberculosis (TB).
1-year incidence of AIDS-defining cancers.

Experimental Design

The study compared seven ML training approaches across three data-sharing scenarios:

Centralized (Upper Bound): All data pooled and trained centrally.
- Variants: Standard Centralized and Centralized-FT (Fine-Tuned per site).
Site-Specific (Lower Bound): Models trained only on local data with no sharing.
Federated Learning (FL): Models trained without sharing raw data.
- Algorithms: FedAvg (Federated Averaging) and FedProx (adds a regularization term to handle heterogeneity).
- Variants: FedAvg-FT and FedProx-FT (Global FL models fine-tuned locally on specific site data).

Technical Implementation

Model Architecture: Fully connected neural networks trained with cross-entropy loss.
Hyperparameter Tuning: Bayesian search (100 iterations) to optimize layer width, learning rates, and iterations.
Evaluation: Primary metric was Area Under the Receiver Operating Characteristic Curve (AUC), evaluated over 250 random data splits.
Ablation Studies:
- Site Size Control: Simulated six sites with identical population distributions but different sample sizes to isolate the effect of data volume.
- Heterogeneity Control: Used Latent Dirichlet Allocation (LDA) on Brazilian data to create simulated sites with varying degrees of heterogeneity (controlled by parameter $\alpha$ ).

3. Key Contributions

First Large-Scale FL Evaluation in HIV: This is the first systematic evaluation of FL for clinical prediction in a multi-country HIV consortium, demonstrating that FL can achieve near-centralized performance without sharing patient-level data.
Quantification of Heterogeneity Impact: The study empirically proves that between-site heterogeneity is a more critical driver of FL performance variation than site size alone.
Validation of Local Fine-Tuning: The authors demonstrate that local fine-tuning (adapting the global model to local data) is a crucial strategy to recover site-specific patterns lost during global aggregation, often outperforming standard FL and even centralized models in specific tasks.
Identification of "Data Addition Dilemma": The study highlights that for very large, epidemiologically distinct sites (e.g., Haiti), adding external data via FL can sometimes yield negligible or negative returns compared to robust local models.

4. Key Results

Overall Performance

FL vs. Centralized: FL algorithms achieved near-centralized performance. For example, in 1-year mortality prediction, FedProx-FT achieved an AUC of 0.758, compared to 0.762 for the Centralized model.
FL vs. Site-Specific: FL substantially outperformed Site-Specific models (e.g., FedProx-FT AUC 0.758 vs. Site-Specific 0.747 for 1-year mortality).
Task Dependency: Performance gains varied by task. The most significant improvements were seen in TB prediction, where FedProx-FT (0.784) outperformed the Centralized model (0.779).

Impact of Site Size

General Trend: Smaller sites (e.g., Honduras, Mexico) saw substantial performance gains from FL compared to their Site-Specific baselines.
The Haiti Anomaly: The largest site (Haiti, ~13,000 patients) saw negligible improvement from FL. Its Site-Specific model performed nearly as well as the Centralized model.
- Reasoning: Haiti has a distinct HIV epidemic profile (higher prevalence, different treatment resources, distinct data practices). The local data volume was sufficient to train a reliable model, and the heterogeneity of other sites' data did not add value (and potentially added noise).

Impact of Heterogeneity

Simulation Results: As heterogeneity ( $\alpha$ ) increased, the performance of standard FL algorithms (FedAvg, FedProx) decreased.
Threshold Effect: Under conditions of very high heterogeneity, Site-Specific models outperformed FL algorithms. This suggests that FL is less effective when clinical populations are fundamentally dissimilar.

Effectiveness of Fine-Tuning

Consistent Improvement: Locally fine-tuned models (FedAvg-FT, FedProx-FT) consistently matched or outperformed their non-fine-tuned counterparts across all tasks.
Mechanism: Fine-tuning allows the model to capture local nuances (e.g., specific risk factors for TB in a specific country) that are diluted during the global averaging process.

5. Significance and Implications

Scalable Privacy-Preserving Infrastructure: The study validates FL as a viable infrastructure for international HIV research, enabling the creation of high-performance models without violating data sovereignty or privacy laws.
Strategic Deployment:
- Small/Homogeneous Sites: Should actively participate in FL to leverage external data.
- Large/Unique Sites: May benefit less from raw FL participation but should utilize Local Fine-Tuning of global models to maximize utility.
Future Directions: The authors suggest that for highly heterogeneous settings, "one-shot" or few-shot FL methods might be more pragmatic than iterative communication. They also emphasize the need for FL methods that can handle missing data and lack of standardization, which remain practical barriers in real-world deployments.

In conclusion, this paper establishes that while Federated Learning is a powerful tool for global health, its success is not uniform. It requires careful assessment of site heterogeneity and the implementation of local fine-tuning strategies to maximize clinical utility across diverse international populations.

Federated Learning Performance Depends on Site Variation in Global HIV Data Consortia