Conformalized Neural Networks for Federated Uncertainty Quantification under Dual Heterogeneity

Imagine you are the head of a global medical network. You have hospitals in big cities with super-computers and massive databases (let's call them "Strong Hospitals") and small rural clinics with older computers and fewer patient records ("Weak Hospitals").

You want to build a single AI system to help all of them diagnose diseases. But there's a catch:

Privacy: They can't send their patient data to a central server.
Heterogeneity: The data in the city is very different from the data in the country, and the computers run different software.

The biggest problem? Uncertainty.
If the AI says, "I'm 99% sure this is a broken bone," but it's actually a sprain, the patient gets hurt. In a centralized system, you can easily measure how often the AI is wrong. But in this distributed network, the "Strong Hospitals" might be overconfident (thinking they are perfect), while the "Weak Hospitals" might be under-confident or just plain wrong, yet the average of all hospitals looks perfect. This hides the failures of the small clinics.

The Problem: The "Average" Lie

The paper argues that simply averaging the results from all hospitals is dangerous.

Analogy: Imagine a classroom test. The top student gets 100%, and the struggling student gets 0%. The class average is 50%. If you tell the principal, "The class is doing fine at 50%," you are lying. The struggling student is failing, and that failure is hidden by the top student's success.
In AI terms, this leads to "Silent Failures." The system looks good globally, but the small, under-resourced agents are making dangerous mistakes without anyone noticing.

The Solution: FedWQ-CP (The "Weighted Wisdom" System)

The authors propose a new method called FedWQ-CP. Think of it as a clever way to set a "Safety Margin" for the whole network without anyone sharing their secret data.

Here is how it works, using a simple analogy:

1. The Local Calibration (The "Practice Test")

Every hospital (agent) takes a "practice test" on their own local data.

They ask: "How wrong are we usually?"
They calculate a Threshold Score.
- Strong Hospital: "Our AI is usually very precise. We only need a small safety margin to be 95% sure."
- Weak Hospital: "Our AI is a bit shaky. We need a huge safety margin to be 95% sure."

2. The Secret Exchange (The "One-Shot Whisper")

Instead of sending all their practice test scores (which would be too much data and a privacy risk), each hospital sends only two numbers to the central server:

Their Threshold Score (How much safety margin they need).
Their Sample Size (How many practice tests they took).

3. The Smart Aggregation (The "Weighted Average")

This is the magic part. The server doesn't just take a simple average. It uses a Weighted Average.

Analogy: Imagine a town council voting on a new speed limit.
- If you have 100 residents, your vote counts for 100.
- If you have 5 residents, your vote counts for 5.
- You don't let the 5 residents outvote the 100 just because they are loud.
In FedWQ-CP, the server gives more weight to the thresholds from hospitals that had more data. This ensures the final "Global Safety Margin" is stable and reliable, not skewed by a tiny clinic with very noisy data.

4. The Result (The "Universal Safety Net")

The server sends this single, smartly calculated Global Safety Margin back to everyone.

Now, every hospital, big or small, uses this same margin to make predictions.
The Outcome: The system guarantees that every hospital, whether strong or weak, is actually 95% confident in its predictions. No more silent failures.

Why is this a Big Deal?

It's Fast: It only takes one round of communication. No back-and-forth chatting.
It's Private: No raw data ever leaves the local hospital.
It's Fair: It fixes the problem where big players hide the failures of small players.
It's Efficient: It produces the smallest possible prediction sets.
- Analogy: If you are guessing a number, a "safe" guess might be "Between 1 and 100." A "smart" safe guess is "Between 48 and 52." FedWQ-CP gives you the tightest, most useful range that is still safe, rather than a huge, useless range.

Summary

The paper introduces a way to build a trustworthy AI network for a world where everyone is different (different data, different computers). It stops the "Average" from lying about the "Weak" players. By using a weighted voting system based on how much data each player has, it creates a single, reliable safety rule that protects everyone, everywhere, without compromising privacy or speed.

1. Problem Statement

The paper addresses the critical challenge of Uncertainty Quantification (UQ) in Federated Learning (FL) systems, specifically under conditions of Dual Heterogeneity:

Data Heterogeneity: Agents (clients) possess non-IID data distributions (e.g., label shifts or covariate shifts) and varying dataset sizes.
Model Heterogeneity: Agents deploy different neural network architectures (e.g., shallow CNNs vs. deep ResNets) with varying predictive strengths and training intensities.

The Core Issue: Existing federated UQ methods often fail to provide reliable coverage guarantees for individual agents in these settings.

Silent Local Failures: A system might achieve a satisfactory global average coverage (e.g., 95%) while specific agents (particularly those with weak models or small datasets) suffer from severe under-coverage (silent failures), while others suffer from over-coverage (inefficiency).
Score Incompatibility: Different architectures produce non-conformity scores with different scales and "temperatures," making direct aggregation of raw scores invalid.
Limitations of Existing Methods: Current approaches either require iterative optimization (high communication cost), assume structural distribution shifts (e.g., density ratios), or fail to account for the joint effect of data and model heterogeneity.

2. Methodology: FedWQ-CP

The authors propose FedWQ-CP (Federated Weighted Quantile Conformal Prediction), a one-shot (single communication round) framework designed to handle dual heterogeneity without sharing raw data or model parameters.

Key Components:

Local Calibration (Agent-Side):
- Each agent $k$ trains a local predictor $f_k$ on shared training data (or local data) and freezes it.
- Using a local calibration set $D_{cal}^k$ , the agent computes non-conformity scores $V_{k,i}$ .
- The agent calculates a local conformal quantile threshold ( $\hat{q}_k$ ) based on the empirical distribution of these scores. This acts as an architecture-specific normalizer, mapping internal uncertainty to a rank-based threshold.
- Communication: The agent transmits only two scalars to the server: the local threshold $\hat{q}_k$ and the calibration sample size $n_k$ .
Global Aggregation (Server-Side):
- The server aggregates the local thresholds using a calibration-size-weighted average:
  $\hat{q} = \sum_{k=1}^{M} \frac{n_k}{N} \hat{q}_k$
  where $N = \sum n_k$ .
- Rationale: This weighting ensures that agents with larger, more statistically reliable calibration sets contribute more to the global threshold, mitigating the noise from agents with small datasets. It creates a global uncertainty boundary that balances predictive strength and statistical reliability.
Prediction:
- The global threshold $\hat{q}$ is broadcast back to all agents.
- Each agent constructs its prediction set $C_k(x) = \{y : V_k(x, y) \leq \hat{q}\}$ for test inputs.

Theoretical Insights:

The paper provides a decomposition of coverage error into a calibration-to-test shift term and an aggregation error term.
While averaging quantiles is not mathematically equivalent to the quantile of a mixture distribution (due to non-linearity), the authors prove that under asymptotic regimes where heterogeneity vanishes, the aggregation bias diminishes.
The method does not require explicit modeling of distribution shifts (e.g., no need to estimate density ratios or label shift parameters).

3. Key Contributions

Novel Framework: Introduction of FedWQ-CP, the first federated UQ method that simultaneously addresses data and model heterogeneity via a simple, one-shot weighted quantile aggregation.
Efficiency & Privacy: The method requires only one communication round and transmits only two scalars per agent, making it highly scalable and privacy-preserving (no raw data or gradients shared).
Robustness: It eliminates the need for iterative optimization or structural assumptions about data shifts, making it applicable to diverse input modalities (classification and regression) and heterogeneous architectures.
Theoretical Analysis: Provides bounds on coverage error and stability, demonstrating that the weighted aggregation effectively stabilizes global coverage even when local models vary significantly.

4. Experimental Results

The authors evaluated FedWQ-CP on seven public datasets (MNIST, FashionMNIST, CIFAR-10, and four medical imaging datasets: DermaMNIST, BloodMNIST, TissueMNIST, RetinaMNIST) across both classification and regression tasks.

Coverage Performance:
- FedWQ-CP consistently achieved near-nominal coverage (e.g., ~95% for $\alpha=0.05$ ) at both the global level and individual agent levels (both strong and weak agents).
- In contrast, baselines like DP-FedCP suffered from severe under-coverage on weak agents, while others like SplitCP (centralized) or FedCP-QQ often exhibited over-coverage (inefficiency) or failed to maintain agent-wise guarantees.
Efficiency (Prediction Set Size):
- FedWQ-CP produced the smallest prediction sets (classification) or shortest intervals (regression) among all methods.
- It reduced inefficiency by significant margins (e.g., up to 68% reduction in set size compared to FedCP-QQ on MNIST), proving that weighted aggregation preserves the predictive strength of strong agents without inflating uncertainty for the whole system.
Ablation Study:
- Removing the sample-size weighting (using simple unweighted averaging) led to systematic under-coverage on weak agents, confirming that the weighted aggregation is critical for handling dual heterogeneity.

5. Significance

This work is significant for the deployment of high-stakes Federated Learning systems (e.g., healthcare diagnostics across hospitals with varying resources and data distributions).

Safety: It prevents "silent failures" where a model appears globally reliable but fails catastrophically for specific, resource-constrained agents.
Practicality: By requiring only a single round of communication and minimal data transfer, it solves the scalability bottleneck of existing federated UQ methods.
Generalizability: It offers a distribution-free, assumption-light solution that works across diverse architectures and data shifts, making it a robust standard for uncertainty quantification in real-world, heterogeneous FL environments.