Learning Centre Partitions from Summaries

Imagine you are the head of a massive logistics company trying to figure out why flights are late. You have data from 22 different airports across the U.S. (like JFK, LAX, and ORD).

The Problem:
You can't just dump all the raw flight data into one giant spreadsheet. Why? Because of privacy laws and security. Each airport keeps its own data locked in its own server. All you can get from them are "summary reports"—like "average delay time" or "how much the weather affects delays."

Now, here's the tricky part: Are all these airports actually the same?

Maybe JFK and LaGuardia (both in NYC) have very similar delay patterns.
Maybe Salt Lake City is totally different because of its mountain weather.
Maybe some airports are outliers with unique problems.

If you just average all the data together, you might get a "fake average" that doesn't represent any real airport. It's like averaging the speed of a Ferrari and a tractor; the result is useless for understanding either vehicle. You need to group the airports that are similar and treat the different ones separately.

The Solution: The "CoC" Algorithm
The authors of this paper invented a smart, step-by-step method called the Clusters-of-Centres (CoC) algorithm. Think of it as a very strict, scientific bouncer at a club who decides who gets to stand in the same VIP group.

Here is how it works, using simple analogies:

1. The "Cochran Test" (The Lie Detector)

First, the algorithm looks at two airports (or groups of airports) and asks: "Are you guys actually the same, or are you just pretending?"

It uses a special statistical test (a multivariate Cochran-type test) that acts like a lie detector. It looks at the summary reports and checks if the differences between the airports are just random noise (like a few bad weather days) or if they are real, structural differences (like one airport being in a hurricane zone and the other in a desert).

If the test says "Same": They get to merge into a group.
If the test says "Different": They stay separate.

2. The "Merge" Dance (The Algorithm)

The algorithm doesn't just guess. It starts with every airport in its own little group. Then, it goes through them one by one:

"Hey, Airport A, do you look like Airport B?"
"Yes? Okay, merge!"
"No? Okay, stay separate."

It does this sequentially, building bigger and bigger groups of similar airports, but only merging them if the "lie detector" is convinced they are identical.

3. The "Bootstrap" (The Second Opinion)

Here is the genius part. In real life, data is messy. Sometimes the "lie detector" might make a mistake because of a fluke in the data. To fix this, the authors use a technique called Bootstrapping.

Imagine you are trying to decide if two people are twins. Instead of looking at them once, you ask 100 different judges to look at them and decide.

The algorithm takes the summary data and creates 100 slightly different "fake" versions of it (resampling).
It runs the merge test on all 100 versions.
If the two airports merge in 99 out of 100 versions, the algorithm is 100% confident they belong together.
If they only merge in 50 versions, the algorithm says, "I'm not sure, let's keep them separate."

This "multi-round" process ensures that the final grouping is rock-solid and not just a lucky guess.

4. The Real-World Test (Airline Data)

The authors tested this on real U.S. airline data from 2007.

The Result: The algorithm looked at 22 major airports and decided... none of them should be grouped together.
Why? Even though some airports are close geographically, their specific delay patterns (how they react to rain, time of day, etc.) were so unique that the algorithm couldn't find any statistical proof that they were "the same."
The Takeaway: Every airport has its own unique personality. Treating them all as one big group would have been a mistake.

Why Does This Matter?

This paper solves a huge problem in the modern world: How do we learn from massive amounts of data without breaking privacy?

Whether it's:

Hospitals trying to find the best treatment for a disease without sharing patient records.
Banks trying to detect fraud without sharing customer transaction details.
Schools trying to improve teaching methods without pooling student grades.

This method allows these organizations to work together, find the "hidden groups" (like hospitals that treat similar patients), and make better decisions, all while keeping their sensitive data locked up tight.

In a nutshell:
The paper gives us a mathematical "glue" that only sticks things together if they are truly identical, and a "safety net" (the bootstrap) to make sure we don't glue the wrong things together by accident. It turns a messy pile of isolated data summaries into a clear, organized map of reality.

Here is a detailed technical summary of the paper "Learning Centre Partitions from Summaries" by Zinsou Max Debaly et al.

1. Problem Statement

The paper addresses the challenge of distributed statistical inference in multi-centre studies where data cannot be pooled due to privacy constraints (e.g., GDPR, HIPAA). Instead of raw data, only centre-level summary statistics (estimators, sensitivity matrices, and variance estimates) are exchanged.

A critical issue in such settings is heterogeneity: different centres often follow different underlying data distributions or have distinct parameter vectors ( $\theta_{0,k}$ ). Standard aggregation methods (like simple averaging or inverse-variance weighting) assume homogeneity. If this assumption is violated, aggregation leads to biased estimates and misleading conclusions (e.g., effect cancellation).

The Core Problem: How to statistically test for equality of parameter vectors across multiple centres using only summary statistics, and subsequently learn the true partition of centres into homogeneous groups (clusters) without prior knowledge of the grouping structure.

2. Methodology

The authors propose a framework combining multivariate hypothesis testing with a sequential clustering algorithm.

A. Statistical Foundations

Local Estimators: Each centre $k$ provides an estimator $\hat{\theta}_{n,k}$ satisfying a Bahadur decomposition: $\sqrt{n}(\hat{\theta}_{n,k} - \theta_{0,k}) = V_k^{-1} U_{n,k} + \epsilon_{n,k}$ .
Assumptions: The method relies on standard regularity conditions (consistency of sensitivity $V_k$ and variability $Q_k$ matrices, asymptotic normality of scores, and independence across centres).
Aggregated Estimator: They utilize the Aggregated Estimating Equation (AEE) estimator, which is computationally efficient and asymptotically equivalent to pooled-data estimators under homogeneity.

B. Hypothesis Testing

The paper develops two types of tests operating solely on summary statistics:

Global Homogeneity Test: A multivariate Cochran-type test to evaluate $H_0: \theta_{0,1} = \dots = \theta_{0,K}$ $H_{0} : θ_{0, 1} = \dots = θ_{0, K}$ .
- Statistic: Based on the weighted differences between local estimators and the global AEE estimator.
- Null Distribution: The test statistic converges to a mixture of $\chi^2$ distributions ( $\sum \lambda_\ell \chi^2_\ell$ ), where weights $\lambda_\ell$ are eigenvalues derived from the sensitivity and covariance matrices.
Two-Block Integration Test: A test to evaluate equality between two specific sets of centres (clusters) $S_1$ and $S_2$ . This is used iteratively to decide whether to merge clusters.

C. The Clusters-of-Centres (CoC) Algorithm

The authors introduce a sequential, test-driven clustering algorithm:

Initialization: Start with each centre as a singleton cluster.
Sequential Merging: Iterate through centres, testing if a new centre can be merged into existing clusters using the Two-Block Integration Test.
Decision Rule: If the $p$ -value $\ge \alpha$ (homogeneity not rejected), merge the centre into the cluster with the largest $p$ -value (deterministic tie-breaking).
One-Shot Limitation: A single pass (one-shot) algorithm has a non-zero probability of failing to merge truly homogeneous centres (Type I error in merging) due to the significance level $\alpha$ .

D. Multi-Round Bootstrap CoC Algorithm

To overcome the limitations of the one-shot approach and achieve perfect partition recovery, the authors propose a multi-round bootstrap procedure:

Mechanism: Instead of re-running the algorithm on the same data, they generate $R$ independent bootstrap resamples of the summary statistics (specifically the point estimators $\hat{\theta}$ ) while keeping the sensitivity/variance matrices fixed (to save communication costs).
Process: The CoC algorithm is run on the first bootstrap set to get an initial partition. Subsequent rounds re-evaluate candidate fusions using new bootstrap sets.
Golden-Partition Recovery: The algorithm returns the partition from the final round. Theoretically, as the number of rounds $R(n) \to \infty$ , the probability of recovering the true partition $\mathcal{P}$ tends to 1.

E. Error Control and Detectability

Error Bounds: The paper derives explicit Type-I (false splitting) and Type-II (false merging) error bounds using Berry–Esseen approximations and deviation inequalities ( $\sqrt{\log n}/n$ ).
Detectability Threshold: They establish that the method can reliably detect heterogeneity if the separation between clusters scales as $\Delta \ge \kappa \sqrt{\frac{\log n}{n}}$ .
Shrinkage Rejection Region: A variant is proposed where the rejection region shrinks as $n$ increases, ensuring both error rates vanish simultaneously.

3. Key Contributions

Multivariate Cochran-Type Tests: Development of global and two-block integration tests for parameter equality that rely exclusively on summary statistics and handle multivariate parameters (covariance structures included), filling a gap in distributed inference literature.
Test-Driven Clustering (CoC): A novel algorithm that learns the full centre partition without assuming a "prevailing" common parameter (unlike robust estimators that assume $>50\%$ of centres are identical). It accommodates the extreme case where every centre has a unique parameter.
Golden-Partition Recovery: Proof that a multi-round bootstrap variant of the CoC algorithm recovers the true underlying partition with probability approaching 1, provided the number of rounds grows with the sample size and a separation condition holds.
Finite-Sample Guarantees: Explicit non-asymptotic bounds on Type-I and Type-II errors derived via concentration inequalities, characterizing the precise detectability threshold.
Communication Efficiency: The bootstrap strategy is designed to minimize data transfer; centres only send their local estimators for each round, reusing pre-computed sensitivity matrices.

4. Results

Simulation Study

Setup: Logistic regression models with $K$ centres partitioned into $L$ true clusters. Covariates followed an AR(1) structure.
Metrics: Adjusted Rand Index (ARI), False Merge Rate, and False Split Rate.
Findings:
- Sample Size ( $n$ ): ARI increases and false splitting decreases as $n$ grows.
- Separation ( $\delta$ ): Larger separation between cluster parameters leads to near-perfect recovery.
- Threshold Tuning ( $u_n$ ): A trade-off exists. Low $u_n$ (conservative) leads to high false splitting; high $u_n$ (aggressive) leads to false merging. An intermediate value ( $u_n=2$ ) provided the best balance.
- Bootstrap Rounds: Increasing rounds ( $R$ ) from 50 to 100 yielded systematic improvements, particularly in difficult (small $n$ , small $\delta$ ) regimes.
- Scalability: The method performed well for $K=20$ and $K=40$ centres, though larger $K$ increased the difficulty of recovery.

Real Data Application

Dataset: U.S. airline on-time performance data (2007), analyzing arrival delays across 22 major destination airports.
Model: Logistic regression predicting delays ( $>15$ mins) based on distance, day, month, and time of day.
Result: The CoC algorithm assigned every airport to a singleton cluster.
Interpretation: This suggests that, within the model's framework, each airport exhibits a distinct delay profile. The authors caution that this reflects separability under the assumed independence of airports; in reality, network effects (weather, airline routing) might induce dependence not captured by the local models.

5. Significance and Implications

Privacy-Preserving Analytics: The method enables rigorous heterogeneity testing and clustering in federated learning environments without violating data privacy laws, as no individual-level data is shared.
Beyond Meta-Analysis: Unlike traditional meta-analysis which often assumes a random-effects model or univariate testing, this approach handles multivariate parameters and explicitly learns the structure of heterogeneity (who belongs to which group).
Theoretical Rigor: The paper provides strong theoretical guarantees (convergence to the true partition) that are often missing in heuristic clustering methods for distributed data.
Practical Utility: The approach is computationally lightweight (using summary statistics) and robust to varying sample sizes across centres, making it applicable to diverse fields like healthcare (multi-hospital studies), finance, and environmental monitoring.

In summary, this work provides a comprehensive statistical toolkit for discovering hidden structures in distributed data, moving beyond simple aggregation to a nuanced understanding of where and how data centres differ.