Data Fusion with Distributional Equivalence Test-then-pool

Imagine you are a doctor trying to figure out if a new medicine works. The gold standard is a Randomized Controlled Trial (RCT): you give the medicine to one group of people (the Treatment Group) and a sugar pill to another (the Control Group), then compare the results.

But here's the problem: finding people to take the sugar pill is hard, expensive, and sometimes unethical. You might only have 50 people in your control group, but 200 in the treatment group. This makes your results "wobbly" and less reliable.

The Temptation: You look at your computer and see data from a previous trial where 100 people took the same sugar pill. "Why not just mix them together?" you think. "That gives me 150 control people! My results will be much stronger!"

The Danger: But wait. The people in the old trial might be different. Maybe they were older, lived in a different country, or were measured with different tools. If you blindly mix them, you might introduce bias. It's like trying to compare the speed of a Ferrari to a bicycle, but then adding a picture of a horse to the "bicycle" group. You'll get a confusing, wrong answer.

The Old Solution: "Test, then Pool"

Scientists have tried to solve this with a method called Test-then-Pool (TTP).

Test: They check if the old data and new data look "similar."
Pool: If they look similar, they mix them. If not, they keep them separate.

The Flaw: The old way of testing was too simple. It mostly checked if the average results were the same. But two groups can have the same average but very different shapes.

Analogy: Imagine two classes of students. Class A has scores of 50, 50, 50, 50, 50. Class B has scores of 0, 0, 0, 0, 250. Both have an average of 50. If you only check the average, you think they are the same. But Class B is wild and unpredictable, while Class A is steady. Mixing them would ruin your analysis.

The New Solution: "Distributional Equivalence Test-then-Pool"

The authors of this paper (Yang, Liu, and Evans) invented a smarter, more rigorous way to decide whether to mix the data. Think of it as a High-Tech Data Matchmaker.

Here is how their new method works, step-by-step:

1. The "Full-Body Scan" (Distributional Testing)

Instead of just checking the average (the "head"), they scan the entire body of the data. They use a mathematical tool called MMD (Maximum Mean Discrepancy).

Analogy: Imagine you are trying to match two fingerprints. The old method just checked if the ridges were the same height. The new method looks at the entire pattern, the swirls, the loops, and the tiny details. It asks: "Is the whole shape of this group of people the same as that group?"

2. The "Equivalence" Test (The Safety Margin)

This is the cleverest part. The old method asked: "Are they exactly identical?" (Which is impossible in real life).
The new method asks: "Are they close enough to be considered twins?"

They set a tolerance radius (let's call it $\theta$ ).
If the difference between the old and new groups is smaller than this radius, they say, "Okay, these are close enough. We can mix them."
If the difference is larger, they say, "No way, they are too different. Keep them separate."
Why this matters: This prevents the "Type-I Error" (false alarms). It guarantees that even if you mix them, you haven't introduced a dangerous bias that would make your medicine look fake or real when it isn't.

3. The "Partial" Safety Net (Bootstrap & Permutation)

Once they decide to mix the data, they need to run the final test to see if the medicine works. But because the groups were only "close enough" (not identical), standard math tricks don't work perfectly.

Analogy: Imagine you are weighing a package. Usually, you put it on a scale. But if the scale is slightly wobbly (because you mixed two slightly different groups), you can't trust the reading.
The Fix: The authors invented "Partial Bootstrap" and "Partial Permutation."
- Imagine you have a bag of marbles. To check if your scale is accurate, you take out some marbles, weigh them, put them back, and do it 1,000 times to see how much the weight usually wobbles.
- Their "Partial" method is smart: it simulates the wobble exactly as it would happen with the mixed groups, ensuring the final result is statistically valid, even if the groups weren't perfect twins.

Why This is a Big Deal

It's Safer: It rigorously controls the risk of making a wrong conclusion (Type-I error).
It's Smarter: It catches differences that simple averages miss (like the wild vs. steady student example).
It's Powerful: By safely using historical data, researchers can run smaller, cheaper, and faster trials without sacrificing accuracy.

The Real-World Test

The authors tested this on the Prospera program in Mexico (a famous study on cash transfers for school attendance).

They took a small slice of the current data and tried to mix it with old data.
Result: Their new method found the program worked much more clearly (higher power) than the old methods, while still keeping the error rate low. It proved that you can safely "borrow" from the past to understand the future, as long as you use the right "matchmaking" rules.

In a nutshell: This paper gives scientists a new, super-secure way to combine old and new data. It's like having a strict but fair referee who says, "You can use the old team's stats, but only if they are truly similar enough, and we'll double-check the math to make sure no one cheats."

Here is a detailed technical summary of the paper "Data Fusion with Distributional Equivalence Test-then-pool" by Yang, Liu, and Evans.

1. Problem Statement

Randomized Controlled Trials (RCTs) are the gold standard for causal inference but often suffer from limited sample sizes in the concurrent control arm, leading to reduced statistical power and high variance. A common remedy is data fusion, where historical control data is borrowed to supplement current controls.

However, naive borrowing introduces bias if the historical and current populations differ (e.g., due to regional or assessment biases). Existing Test-then-Pool (TTP) frameworks attempt to mitigate this by testing for equality between historical and current controls before pooling.

Limitations of Classic TTP: Standard TTP uses a two-sample test (e.g., testing $H_0: \mu_c = \mu_h$ ). If the test fails to reject, data is pooled. This approach has low power to detect heterogeneity, leading to the inappropriate pooling of dissimilar data and inflated Type-I error rates in the final treatment effect test.
Limitations of Existing Equivalence TTP: Recent methods (e.g., Li et al., 2020) use equivalence testing to ensure similarity. However, they typically rely on mean-based comparisons, ignoring distributional differences (variance, tails, shape), and often lack rigorous theoretical guarantees for the Type-I error control of the entire two-stage procedure when distributions are not identical.

Goal: Develop a TTP framework that fuses control arms based on full distributional similarity (not just means) while rigorously controlling the Type-I error rate of the final treatment effect test, even when the fused controls are not perfectly identical.

2. Methodology

The authors propose a new Distributional Equivalence Test-then-Pool (TTP) framework. The procedure consists of two stages:

A. Fusion Stage: MMD Equivalence Testing

Instead of testing for equality ( $H_0: Q_c = Q_h$ ), the method tests for distributional equivalence.

Metric: Uses Maximum Mean Discrepancy (MMD) to measure the distance between the current control distribution ( $Q_c$ ) and historical control distribution ( $Q_h$ ). MMD captures differences in the entire distribution, not just the mean.
Hypothesis:
- $H_0^f: D(Q_c, Q_h) \geq \theta$ (The distributions are too different to pool).
- $H_1^f: D(Q_c, Q_h) < \theta$ (The distributions are sufficiently similar to pool).
- Here, $\theta > 0$ is a pre-specified equivalence margin.
Test Statistic: $\Delta_{m,\ell} = \theta - \widehat{D}(Q_c, Q_h)$ .
Critical Value: Approximated using a bootstrap procedure (specifically, a bootstrap of the MMD statistic under the null of equivalence) to handle the intractable distribution of the test statistic.

B. Causality Stage: Partial Bootstrap/Permutation

If the fusion test rejects $H_0^f$ (i.e., distributions are deemed similar), the historical data is pooled with current controls. The treatment effect is then tested ( $H_0: Q_c = Q_t$ ).

Challenge: If $Q_c$ and $Q_h$ are similar but not identical ( $Q_c \neq Q_h$ ), standard permutation tests between the pooled control and treatment group fail to approximate the correct null distribution, leading to size distortion.
Solution: The authors propose two novel resampling methods to approximate the null distribution under the causal null ( $Q_c = Q_t$ $Q_{c} = Q_{t}$ ) while accounting for the potential heterogeneity between $Q_c$ $Q_{c}$ and $Q_h$ $Q_{h}$ :
1. Partial Bootstrap:
  - Resamples the current control ( $Q_c$ ) and treatment ( $Q_t$ ) from the current control empirical distribution (enforcing $Q_c = Q_t$ under the null).
  - Resamples the historical control ( $Q_h$ ) from its own empirical distribution.
  - This preserves the dependence structure of the fused control arm even if $Q_c \neq Q_h$ .
2. Partial Permutation:
  - Permutes the pooled current control and treatment samples.
  - Keeps the historical control sample fixed (as an ancillary sample) and re-computes the fused control distribution.
  - This avoids breaking the natural distance between $Q_c$ and $Q_h$ that standard permutation would introduce.

C. Theoretical Guarantees

The paper establishes:

Validity: The overall procedure controls the Type-I error rate at the nominal level $\alpha$ , even when $Q_c \neq Q_h$ but $D(Q_c, Q_h) < \theta$ .
Consistency: The test consistently rejects the null when a treatment effect exists, provided the treatment effect is sufficiently large relative to the heterogeneity between historical and current controls.

3. Key Contributions

Distributional Extension of TTP: Moves beyond mean-based comparisons to full distributional testing using MMD, allowing detection of treatment effects involving variance shifts, tail behavior, and complex heterogeneity.
Rigorous Error Control: Provides the first formal proof of Type-I error control for a TTP procedure that uses equivalence testing and allows for non-identical (but similar) control arms in the fused sample.
Novel Resampling Techniques: Introduces Partial Bootstrap and Partial Permutation procedures. These are specifically designed to approximate the null distribution in the presence of heterogeneous controls, solving a critical flaw in applying standard permutation tests to fused datasets.
Theoretical Framework: Proves the asymptotic validity and consistency of the entire two-stage framework, including conditions under which the test remains consistent even when historical controls are slightly different from current ones.

4. Experimental Results

The authors validate the method through synthetic experiments and a real-world application:

Synthetic Experiments:
- Type-I Error Control: The proposed method maintains the nominal Type-I error rate (e.g., 5%) across various scenarios (mean shifts, variance shifts), whereas classic TTP methods show significant inflation.
- Power: The method achieves significantly higher power than tests that do not fuse data, and outperforms classic TTP methods, particularly in detecting non-mean differences (e.g., variance shifts).
- Partial Bootstrap vs. Permutation: The Partial Bootstrap method consistently yields higher power and better approximation of the null distribution compared to Partial Permutation, especially when $Q_c \neq Q_t$ .
- Sensitivity to $\theta$ : While larger $\theta$ increases the fusion rate, overly large margins can reduce power if the historical data introduces bias. A conservative $\theta$ is recommended.
Real-World Application (Prospera Program):
- Applied to Mexico's conditional cash transfer program data.
- The method successfully fused 1997 baseline data (historical) with 1998 control data.
- The proposed Distributional TTP achieved a rejection rate of 61%, significantly outperforming non-fused distributional tests (40%) and mean-based tests (23-36%), demonstrating its ability to leverage historical data effectively without inflating false positives.

5. Significance

This paper addresses a critical bottleneck in modern clinical trials: how to efficiently use historical data without compromising statistical validity.

Practical Impact: It offers a principled, flexible tool for regulators and researchers to incorporate historical controls, potentially reducing trial costs, duration, and patient recruitment burdens.
Methodological Advancement: By shifting from mean-based to distributional testing and solving the "heterogeneous control" problem via partial resampling, it sets a new standard for data fusion in causal inference.
Robustness: The framework is robust to various kernel choices and distributional shapes, making it applicable to high-dimensional and complex outcome data where traditional parametric assumptions fail.

In summary, the authors provide a mathematically rigorous, high-power solution for data fusion that balances efficiency (borrowing data) with safety (strict error control), extending the utility of TTP beyond simple mean comparisons to the full distributional landscape.