CREB: Consistent Reference External Batch Harmonization

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to teach a computer to recognize patterns in the human brain using MRI scans. You gather data from 28 different hospitals around the world. The problem? Each hospital uses a different MRI machine, different software, and different settings. It's like asking 28 different chefs to bake a "perfect cake" using different ovens, different brands of flour, and different measuring cups. The result is that the cakes (the brain data) look and taste slightly different, not because the ingredients (the actual brain biology) are different, but because of the kitchen equipment (the "site" effects).

If you try to train your computer on all these different cakes at once, it gets confused. It might learn that "blue frosting" means "happy brain" just because one specific chef always used blue frosting, not because it's actually true.

This is the problem the paper solves. Here is the breakdown of their solution, CREB, using simple analogies:

The Old Way: The "Big Pot" Problem

Traditionally, scientists used a method called ComBat to fix these differences. Imagine you have a giant pot of soup where you dump in all the data from every hospital (training, testing, and future data). You stir it all together to make the flavors consistent.

The Flaw: In machine learning, you aren't supposed to taste the "test" soup before you decide if your recipe works. If you mix the test data in with the training data to fix the flavors, you are cheating. You are letting the computer peek at the answers before the exam. This is called data leakage. It makes the computer look smarter than it really is. Plus, if a new hospital sends you data tomorrow, you can't use the old method because you'd have to dump their new soup into the giant pot and stir everything again, which is messy and requires you to share all your private training data.

The New Way: CREB (The "Master Recipe Card")

The authors created a new method called CREB (Consistent Reference External Batch Harmonization). Think of this as creating a Master Recipe Card (or a "Bundle") that fits in your pocket.

Here is how the two-step process works:

Step 1: CREB Learn (Writing the Recipe)

First, the scientists take only their training data (the data from the 28 hospitals they are using to teach the computer). They analyze it to figure out exactly how much "flavor distortion" each hospital adds.

They calculate the average "noise" for each hospital.
They write these numbers down on a tiny, lightweight digital card (a "bundle" that is only about 13MB—smaller than a single high-res photo).
Crucially: They throw away the actual brain data. They only keep the recipe for how to fix the differences.

Step 2: CREB Apply (Cooking the New Dish)

Now, imagine a new hospital sends you data from a patient they just scanned. You don't need to send them your training data, and you don't need to mix their data with yours.

You take your tiny "Master Recipe Card."
You look at the new data and say, "Ah, this hospital's machine adds a little too much salt."
You use the recipe to subtract that extra salt and adjust the flavor.
The new data is now perfectly aligned with your original training data, ready for the computer to analyze.

Why is this a Big Deal?

No Cheating (No Data Leakage): Because you never mix the test data with the training data to fix the flavors, the computer's exam results are honest. It proves the model actually learned the biology, not the quirks of the MRI machines.
Easy to Share: You can send the "Master Recipe Card" (the bundle) to anyone. It's tiny and doesn't contain any private patient data. They can use it to fix their own data instantly.
Future-Proof: If a brand new hospital joins the network next year, you don't need to retrain your whole system. You just use the same Master Recipe Card to fix their data.

Did it Work?

The authors tested this against the old method (NeuroHarmonize).

The Result: The new method (CREB) cleaned up the data just as well as the old method.
The Bonus: It kept the important biological signals intact. For example, the computer could still correctly see that "older brains have less gray matter" after the data was cleaned. It didn't accidentally scrub away the real science while cleaning up the noise.

The Bottom Line

CREB is like a universal translator for brain scans. It allows scientists to train AI on data from many different places without cheating, and then easily apply that AI to new patients from new places, all without ever needing to share the original private data. It makes the science of brain imaging more accurate, fair, and ready for the real world.

1. Problem Statement

Machine learning models in neuroimaging (specifically fMRI and structural MRI) increasingly rely on large, multi-site datasets to improve generalizability. However, data collected across different sites suffer from site effects (non-biological variability caused by different scanners, field strengths, acquisition protocols, and manufacturers).

The Challenge of Harmonization: Standard harmonization tools like ComBat and NeuroHarmonize require all data (training, validation, and test sets) to be available simultaneously to estimate site effects.
Data Leakage: In machine learning workflows, harmonizing training and test data together introduces data leakage, where information from the test set inadvertently influences the training process. This artificially inflates model performance and compromises the validity of downstream analyses.
Deployment Limitation: Traditional methods cannot harmonize new, unseen data (external test sets) without re-accessing the original training data, which is often impossible due to privacy constraints, data size, or sharing limitations.

2. Methodology: CREB (Consistent Reference External Batch)

The authors propose CREB, a novel two-stage extension of the ComBat algorithm based on the Empirical Bayes framework. Unlike traditional ComBat, which estimates priors across all sites simultaneously, CREB decouples the learning of site priors from the application of harmonization.

Core Workflow

The process consists of two distinct stages:

Stage 1: CREB Learn (Bundle Generation)

Input: A designated training dataset containing multiple sites.
Process:
1. A regression model is fitted using only biological covariates (e.g., age, sex) and an intercept. Crucially, site indicators are NOT included in the design matrix.
2. Residuals are calculated by removing the biological signal from the raw data.
3. Sufficient statistics (sample mean, variance, and sample size) of these residuals are computed for each site and feature.
4. Prior Estimation: Global empirical Bayes priors (Normal-Inverse-Gamma distributions for additive and multiplicative effects) are estimated across all training sites and features.
Output: A lightweight, distributable "bundle" (approx. 13MB) containing the regression coefficients, pooled variance, and the estimated prior distributions. This bundle serves as a consistent reference point.

Stage 2: CREB Apply (External Harmonization)

Input: New, unseen external data (test sets) and the "bundle" from Stage 1.
Process:
1. The external data is residualized and standardized using the regression coefficients and pooled variance stored in the bundle.
2. For each new site in the external data, the posterior distribution of site effects is updated using the priors learned from the training set.
3. The site effects (additive and multiplicative) are estimated and removed from the data.
Key Innovation: This allows new data to be harmonized to the training distribution without access to the original training data or the test data itself, completely preventing data leakage.

Implementation Details

Updates: The method supports both closed-form joint updates (estimating mean and variance simultaneously) and iterative updates (alternating estimation until convergence).
Data Types: Validated on both functional connectivity (fMRI) and structural gray matter volume (T1-weighted MRI).
Software: Implemented in Python and publicly available on GitHub.

3. Key Contributions

Prevention of Data Leakage: CREB is the first harmonization method designed specifically for machine learning pipelines that allows training and testing sets to be harmonized independently, eliminating the risk of test data influencing the training process.
Deployability: The creation of a small, shareable "bundle" (<13MB) enables the harmonization of new, unseen data in real-world deployment scenarios without requiring the redistribution of massive training datasets.
Biological Signal Preservation: The method rigorously preserves biological associations (e.g., age-related changes in connectivity and gray matter volume) while removing site-specific noise.
Generalizability: Demonstrated effectiveness across diverse datasets spanning the adult lifespan (ages 18–97) and multiple imaging modalities.

4. Results

The authors evaluated CREB using data from 2,846 participants (training) across 9 studies and 1,113 participants (testing) across 3 studies.

Similarity to Standard Methods:
- CREB outputs were highly similar to NeuroHarmonize (the gold standard that uses joint harmonization).
- Euclidean Distance: Average distance between CREB and NeuroHarmonize outputs was 2.6.
- Mean Absolute Error (MAE): Average MAE was 0.019, with a maximum difference of ~0.08.
Site Effect Removal:
- Raw Data: Significant site differences existed (e.g., $p < 0.001$ for mPFC-PCC connectivity).
- Post-Harmonization: Both NeuroHarmonize and CREB successfully removed site effects.
  - NeuroHarmonize: 3 edges remained significantly different.
  - CREB: 0 edges remained significantly different.
Biological Signal Preservation:
- Functional Connectivity: Linear regression of connectivity vs. age showed that $R^2$ values were preserved. For example, the correlation between age and specific edge connectivity remained strong ( $R^2 \approx 0.13 - 0.18$ ) after CREB harmonization, comparable to raw and NeuroHarmonized data.
- Gray Matter Volume: CREB preserved the negative correlation between age and total gray matter volume ( $R^2 = 0.41$ ), comparable to NeuroHarmonize ( $R^2 = 0.45$ ) and Raw data ( $R^2 = 0.38$ ).

5. Significance and Conclusion

CREB represents a paradigm shift in neuroimaging data harmonization for machine learning. By moving from a "joint estimation" model to a "learned prior" model, it solves the critical bottleneck of data leakage and deployment feasibility.

Impact on ML: It enables the training of robust, generalizable models on multi-site data and allows these models to be deployed on new, unseen clinical or research data without compromising data privacy or statistical integrity.
Practical Utility: The lightweight "bundle" format makes it feasible to integrate harmonization directly into machine learning pipelines, ensuring that data from new scanners or sites can be standardized to a common reference distribution instantly.
Limitations: The method assumes that the biological covariate distributions (e.g., age, sex) in the training and target datasets overlap sufficiently. It also requires generating a new bundle if the preprocessing pipeline or biological covariates change.

In summary, CREB provides a robust, leak-free, and easily deployable solution for standardizing multi-site neuroimaging data, facilitating the next generation of generalizable brain imaging machine learning models.