HeteroFedSyn: Differentially Private Tabular Data Synthesis for Heterogeneous Federated Settings

Here is an explanation of the paper HeteroFedSyn, broken down into simple concepts with creative analogies.

The Big Picture: The "Secret Recipe" Problem

Imagine you have a group of five different restaurants (let's call them Party 1 through Party 5). Each restaurant has its own secret recipe book (the data) with thousands of customer orders.

The Goal: They want to create a new, fake recipe book (a synthetic dataset) that looks and tastes exactly like the combined real books, so they can share it with food critics or investors without revealing any single customer's actual order.
The Problem: They can't send their real recipe books to a central office because that would leak secrets. They also can't just mix their books together because Restaurant A mostly sells burgers, while Restaurant B mostly sells sushi. If they just mix them, the result is a weird, biased mess.
The Solution: They need a way to share statistics (like "50% of orders are burgers") rather than the actual orders, while adding just enough "static" (noise) to hide individual secrets, but not so much that the recipe becomes unrecognizable.

This is what HeteroFedSyn does. It's a new system for creating fake data from real data held by different groups, without anyone ever seeing the raw data.

The Three Main Hurdles (and How They Solved Them)

The paper identifies three big problems with doing this in a "federated" (distributed) setting and offers a clever fix for each.

1. The "Too Much Noise" Problem

The Analogy: Imagine trying to hear a whisper in a crowded stadium. If you ask 50 people to whisper their secret to a central recorder, and everyone adds a little bit of static to protect their identity, the final recording is just white noise. You can't hear the message.
The Paper's Fix: Instead of sending the whole "recipe book" (which is huge), they send a compressed sketch.

They use a mathematical trick called Random Projection. Think of it like taking a high-definition photo of a complex dish and shrinking it down to a tiny, low-resolution thumbnail.
Surprisingly, even though the thumbnail is small, it still keeps the shape of the dish (the relationship between ingredients). This saves a massive amount of data and reduces the amount of "static" needed to hide the details.

2. The "Distorted Picture" Problem

The Analogy: If you try to measure the distance between two points on a blurry, noisy photo, your ruler will give you the wrong answer. In this case, the "blur" is the privacy noise added by the restaurants. If the central office tries to calculate how "connected" two ingredients are (e.g., do people who order burgers also order fries?) using these blurry numbers, the math breaks.
The Paper's Fix: They invented a Mathematical De-blurring Tool.

The paper shows a specific formula that acts like a filter. It takes the blurry, noisy numbers sent by the restaurants and mathematically subtracts the "blur" to reveal the true connection between ingredients.
This allows the central office to know exactly which ingredients go together, even though they never saw the real data.

3. The "Redundant Clues" Problem

The Analogy: Imagine you are trying to guess a secret code. You ask for clues.

Clue 1: "The first letter is A."
Clue 2: "The second letter is B."
Clue 3: "The first and second letters are AB."
Clue 3 is useless because you already know it from Clues 1 and 2. In data, if you already know how "Burgers" relate to "Fries" and how "Fries" relate to "Soda," you don't need to spend your "privacy budget" (your limited allowance of noise) to tell you how "Burgers" relate to "Soda." It's already implied.
The Paper's Fix: They use an Adaptive Selection Strategy.
Instead of picking clues randomly or all at once, the system picks the most important clue, builds a fake dataset, and then checks what's missing.
If the system realizes it already knows the relationship between two items because of other clues it picked, it skips that pair. This saves the "privacy budget" for the clues that actually matter, making the final fake data much more accurate.

How It Works in Real Life (The Workflow)

The Sketch Phase: Each restaurant calculates simple stats about their own customers (e.g., "50% are men," "30% order at night"). They shrink these stats down (compression) and add a little privacy noise. They send these sketches to the central office.
The Detective Phase: The central office uses the "De-blurring Tool" to figure out which ingredients are strongly linked (e.g., "Burgers and Fries go together").
The Selection Phase: The office picks the most important links. It asks the restaurants to send the specific noisy stats for just those links.
The Cooking Phase: The office uses these selected links to cook up a new, fake recipe book. It keeps adjusting the fake book until it matches the statistics sent by the restaurants.
The Result: The fake book is released. It looks and behaves like the real combined data, but no one can trace a specific entry back to a specific customer.

Why Is This a Big Deal?

It's the First of Its Kind: Before this, most systems assumed all data was in one place (Centralized) or that everyone just added noise to their own data (Local). This is the first system designed specifically for groups of organizations (like hospitals or banks) working together without sharing raw data.
It Handles "Messy" Data: Real-world data is rarely perfect. One hospital might have mostly elderly patients, while another has mostly kids. This system handles that "heterogeneity" (difference) very well, ensuring the final fake data isn't biased toward just one group.
It Works: The experiments showed that even with all the extra noise from having multiple parties, the fake data was almost as good as if they had put all the real data in one room.

The Bottom Line

HeteroFedSyn is like a master chef who can recreate a complex, multi-restaurant menu by only tasting tiny, noisy samples from each kitchen. By using smart compression, math tricks to remove the noise, and a strategy to avoid asking for redundant information, they create a perfect "fake menu" that protects the secrets of every single restaurant while still being useful for analysis.

Here is a detailed technical summary of the paper "HeteroFedSyn: Differentially Private Tabular Data Synthesis for Heterogeneous Federated Settings."

1. Problem Statement

The paper addresses the challenge of generating Differentially Private (DP) synthetic tabular data in a horizontal federated learning (FL) setting.

Context: Multiple organizations (participants) hold disjoint subsets of data with the same attributes (e.g., hospitals with different patient records) but wish to collaborate to create a global synthetic dataset for downstream tasks (e.g., machine learning, range queries) without sharing raw data.
Limitations of Existing Approaches:
- Centralized DP Synthesis: Assumes all data is on a single server, which is often impractical due to privacy regulations.
- Local DP (LDP): Requires users to perturb individual records before sharing. This introduces excessive noise (scaling quadratically with dataset size) and fails to capture global correlations effectively.
- Naïve Federated Synthesis: Simply running local synthesis and merging results leads to biased mixtures because data distributions are heterogeneous across participants.
Core Challenge: How to collaboratively select the most informative statistical features (marginals) and synthesize a global dataset while preserving privacy, minimizing noise accumulation, and handling heterogeneous data distributions without direct access to raw data.

2. Methodology: HeteroFedSyn

The authors propose HeteroFedSyn, a framework built upon the PrivSyn paradigm (which uses 2-way marginals for synthesis) but adapted for distributed, heterogeneous environments. The framework consists of four main building blocks:

A. Marginal Sharing with Random Projection

Process: Participants compute local 1-way and 2-way marginals.
Compression: To reduce communication overhead and noise sensitivity, 2-way marginals (which have high dimensionality $d_a \times d_b$ ) are compressed into a lower-dimensional vector of size $k$ using a random projection matrix ( $P_{a,b}$ ).
Noise Addition: Gaussian noise is added to the compressed marginals and 1-way marginals before transmission to the server.
Aggregation: The server aggregates these noisy, compressed marginals proportionally to participant dataset sizes to estimate the global distribution.

B. Dependency Measurement (Unbiased Estimation)

Goal: Identify which attribute pairs have strong correlations (dependencies) to prioritize them for synthesis.
Metric: The framework uses an $\ell_2$ -based dependency metric, InDif2, defined as the distance between the actual 2-way marginal and the marginal assuming independence ( $M_{a,b} - M_a \times M_b$ ).
Challenge: Directly computing this on noisy, compressed data yields biased results due to noise interaction.
Solution: The authors derive a rigorous unbiased estimator for the squared InDif2 score. This estimator mathematically cancels out the noise terms introduced by the Gaussian mechanism and random projection, allowing the server to accurately estimate dependencies using only the noisy, compressed data.

C. Marginal Selection Strategies

The framework offers two algorithms for selecting which 2-way marginals to release:

FedPrivSyn (Non-Adaptive): Uses a greedy selection strategy similar to centralized PrivSyn. It selects marginals based on initial dependency scores.
AdaFedPrivSyn (Adaptive): Introduces a dynamic selection mechanism.
- Logic: Selecting one marginal (e.g., $A, B$ ) may implicitly constrain the correlation of another (e.g., $B, C$ ), making the selection of a third ( $A, C$ ) redundant.
- Mechanism: After selecting a marginal and synthesizing a partial dataset, the algorithm updates the dependency scores of remaining marginals based on the current synthetic data. This avoids redundancy and maximizes the utility of the limited privacy budget.

D. Data Synthesis

The server uses the selected noisy marginals to generate the synthetic dataset.
It employs the GUM (Greedy Uniform Marginal) algorithm (from PrivSyn), which iteratively duplicates and replaces values in a randomly initialized dataset to match the target noisy marginals.
Attributes not covered by selected 2-way marginals are handled using their 1-way marginals.

E. Privacy Budget Allocation

The total privacy budget ( $\rho$ $ρ$ ) is split into three parts:
1. Initial Sharing: A small fraction ( $q$ ) for sharing all 1-way and compressed 2-way marginals for dependency measurement.
2. Selection: No budget consumed (server-side computation).
3. Final Release: The majority of the budget ($1 - 2q$) is reserved for releasing the selected informative marginals to ensure high utility.
The paper empirically suggests $q < 1/3$ to ensure the majority of the budget is used for the final, high-value marginals.

3. Key Contributions

First Framework for Heterogeneous FL: HeteroFedSyn is the first DP tabular data synthesis framework specifically designed for horizontal federated settings with heterogeneous data distributions.
Noise-Efficient Dependency Measurement:
- Introduced an $\ell_2$ -based dependency metric combined with random projection to reduce communication and noise.
- Developed a mathematical unbiased estimator to correct for multiplicative noise and compression errors, enabling accurate dependency scoring from noisy data.
Adaptive Marginal Selection: Proposed AdaFedPrivSyn, which dynamically updates dependency scores during the selection process to eliminate redundancy and optimize privacy budget usage.
Comprehensive Evaluation: Validated the approach across range queries, Wasserstein fidelity, and machine learning tasks (Random Forest, MLP, XGBoost).

4. Experimental Results

The authors evaluated HeteroFedSyn on five real-world datasets (Adult, Abalone, Obesity, Insurance, Shoppers) against baselines like centralized PrivSyn and naïve federated approaches.

Utility vs. Noise: Despite the inherent noise accumulation in federated settings (which is $O(c)$ times higher than centralized), HeteroFedSyn achieved utility comparable to centralized PrivSyn. The error rates remained within the same order of magnitude rather than degrading proportionally to the noise.
Performance Metrics:
- Range Queries & Fidelity: AdaFedPrivSyn consistently outperformed other methods, particularly on datasets with many attributes (e.g., Adult, Shoppers), demonstrating the value of adaptive selection.
- Machine Learning: Synthetic data trained models (RF, MLP, XGBoost) achieved performance close to models trained on raw data.
Robustness: The framework remained effective even with:
- Varying numbers of participants ( $c=5$ to $25$).
- Heterogeneous data distributions (biased vs. uniform).
- Different privacy budget allocations (though tighter budgets benefit from allocating more to the final release phase).
Parameter Sensitivity: Random projection dimension $k=10$ was found to offer the best trade-off between compression error and noise reduction.

5. Significance

Bridging the Gap: This work bridges the gap between theoretical DP synthesis and practical federated deployment, addressing the critical issue of data heterogeneity which previous methods ignored.
Efficiency: By using random projection and adaptive selection, it significantly reduces communication costs and privacy budget waste, making DP synthesis feasible for large-scale, multi-party collaborations.
Practical Impact: It enables organizations (e.g., hospitals, banks) to share high-quality synthetic data for research and AI training without compromising individual privacy or requiring a trusted central data repository.
Future Direction: The paper highlights that while algorithmic optimizations help, future work may need to leverage public knowledge to further reduce privacy costs in distributed high-dimensional settings.

In summary, HeteroFedSyn provides a robust, efficient, and privacy-preserving solution for generating synthetic tabular data in federated environments, proving that high-utility data sharing is possible even under strict differential privacy constraints and heterogeneous data distributions.