The Problem: The "Weather Forecast" Problem

Imagine you are a scientist trying to predict which molecules will make good medicines. You build a computer model to do this.

Now, imagine you train that model on a specific set of data. It predicts that Molecule A is a "winner" (it will work as a drug).

But then, you decide to retrain the model. You don't change the rules or the data source; you just use a slightly different random sampling of that same data (like drawing a new hand of cards from the same deck).

The Shocking Result:
When you retrain the model, it suddenly says Molecule A is a "loser" and Molecule B is the new winner.

The paper calls this "Cross-Sample Prediction Churn." It's the rate at which the model flips its decision just because you shuffled the training data slightly.

The Paper's Finding: In 9 different chemistry tests, the model's overall accuracy only changed by a tiny bit (about 1–4%). But, the specific decision for individual molecules flipped 8% to 22% of the time.
The Analogy: Imagine a judge who is 95% accurate overall. But if you ask them to judge 100 specific cases, and you ask them to re-judge the same 100 cases after taking a different lunch break, they might change their verdict on 20 of them. That's a lot of instability for the specific cases that matter most.

Why Current "Fixes" Don't Work

Scientists have tried to fix this using standard "uncertainty" tools, like:

Deep Ensembles: Training 5 different models and averaging their answers.
MC Dropout: Turning parts of the model "off" randomly during testing to see how much the answer wobbles.
Stochastic Weight Averaging: Smoothing out the model's internal math.

The Paper's Verdict: These tools are like trying to fix a shaky camera by adjusting the lens focus (the model's internal settings) while the camera is still being held by a shaking hand (the data).

These methods fix the "lens" but ignore the "shaking hand."
The paper found these methods did not reduce the churn. They didn't stop the model from flipping its decisions when the data changed.

The Solution: Two New Methods

The authors propose two methods that actually work because they address the "shaking hand" (the data) rather than just the "lens."

1. K-Bootstrap Bagging (The "Committee" Approach)

How it works: Instead of training one model, you train a whole committee of models (e.g., 5 of them). Each member of the committee is trained on a slightly different random sample of the data. When you need an answer, you ask the whole committee and take the average vote.
The Result: This cuts the flipping rate by 40–54%.
The Catch: It requires 5 times more computer power to train 5 models instead of 1.

2. Twin-Bootstrap (The "Twin Sisters" Approach)

How it works: This is the paper's main invention. Imagine training two "twin" neural networks at the same time.
- Twin A learns from Sample X.
- Twin B learns from Sample Y (a slightly different sample).
- The Secret Sauce: Every time they learn, the twins are forced to talk to each other. If they disagree on a molecule, they get a "penalty" (a consistency loss) to force them to agree.
The Result:
- It reduces the flipping rate by an additional 45% compared to the standard committee method.
- It achieves this with only 2x the computer power (training two twins instead of five separate models).
- It keeps the accuracy just as high as the original model.

Why This Matters (The "Real World" Impact)

The paper argues that in scientific labs, decisions are made molecule-by-molecule.

The Scenario: A scientist uses the model to pick the top 10 molecules to synthesize in a lab.
The Risk: If the model has high "churn," the scientist might pick Molecule #1 today. But if they retrain the model tomorrow (which happens often in science), the model might say, "Actually, Molecule #1 is bad, let's try Molecule #10."
The Cost: This wastes time and money. The lab might synthesize the wrong molecule, or waste effort re-evaluating the same list.

The paper suggests that scientific reports should always include a "Churn Score" alongside accuracy. Just knowing a model is "90% accurate" isn't enough; you need to know if that accuracy is stable or if the model is just guessing wildly every time you refresh the page.

Summary

The Issue: Scientific AI models often flip their specific predictions when retrained on slightly different data, even if their overall score looks good.
The Old Way: Standard tricks to measure uncertainty (like ensembles) don't fix this specific problem.
The New Way:
1. Bagging: Train a big committee of models (works well, but expensive).
2. Twin-Bootstrap: Train two models together and force them to agree (works even better and is cheaper).
The Goal: Make scientific AI reliable enough that a scientist can trust the specific molecule it recommends, knowing the recommendation won't change just because they ran the training code one more time.

Technical Summary: Reducing Cross-Sample Prediction Churn in Scientific Machine Learning

Problem Definition: Cross-Sample Prediction Churn

Scientific machine learning (ML) benchmarks typically report aggregate predictive performance (e.g., accuracy, AUC) but fail to report the stability of individual predictions when the model is retrained on a different draw of the same training population. The authors define cross-sample prediction churn as the fraction of test predictions that change class labels between two models trained on independent bootstraps of the same training set.

While aggregate accuracy often remains stable (varying by only 1.3–4.2 percentage points across retrainings), the authors demonstrate that individual predictions are highly unstable. Across nine chemistry benchmarks, 8.0% to 21.8% of test molecules flip their predicted class between retrainings. This "per-prediction stability gap" is critical for operational workflows in closed-loop laboratories, Bayesian optimization, and virtual screening, where model outputs directly dictate experimental decisions (e.g., which molecule to synthesize). High churn implies that the specific molecules selected for synthesis or screening are sensitive to the random sampling of the training data, rendering the workflow non-reproducible.

Methodology and Proposed Solutions

The paper evaluates standard parameter-side uncertainty techniques against data-side methods to determine which can reduce this churn.

1. Failure of Parameter-Side Techniques

The authors test three standard methods that sample over model weights at fixed data:

Deep Ensembles: Averaging predictions from $K$ models with different initializations.
Monte Carlo (MC) Dropout: Averaging stochastic forward passes of a single model.
Stochastic Weight Averaging (SWA): Averaging weights from a single training trajectory.

Result: These methods do not consistently reduce cross-sample churn. Across the nine benchmarks, they shift the class-flip rate by $-22.3\%$ to $+12.5\%$ relative to Empirical Risk Minimization (ERM), with no consistent sign of improvement. The authors argue this is because these methods address parameter variance while holding the data axis constant, whereas the dominant source of variance in scientific ML with small datasets is the data sampling itself.

2. Data-Side Solution A: K-Bootstrap Bagging

The classical Bagging approach (Breiman, 1996) trains $K$ models on $K$ independent bootstraps of the training set and averages their predictions.

Performance: Reduces churn by 40–54% across all datasets compared to ERM.
Cost: Requires $K \times$ the compute of a single ERM training run (e.g., $5\times$ for $K=5$ ).
Accuracy: Achieves this reduction with no cost to aggregate accuracy.

3. Data-Side Solution B: Twin-Bootstrap

The authors propose Twin-Bootstrap, a method that trains two networks ( $\theta_A, \theta_B$ ) jointly on independent bootstraps ( $S_A, S_B$ ) of the training set.

Mechanism: The networks are trained to minimize a combined loss consisting of standard cross-entropy on their respective bootstraps plus a symmetric KL-divergence consistency loss ( $L_{cons}$ ) between their predictions on the union of the mini-batches.
Data Overlap: Due to bootstrap sampling with replacement, the two bootstraps share approximately 40% of the training indices in expectation. The consistency loss acts on this overlap, while the cross-entropy losses specialize on the non-shared remainder.
Hyperparameter ( $\lambda$ ): The weight of the consistency loss is selected on a development set (BACE) using a rule that maximizes $\lambda$ while keeping accuracy within 0.02 of the ERM baseline. The selected value is $\lambda=300$ for the default MLP architecture.
Performance: At matched 2 $\times$ ERM compute (training two networks), twin-bootstrap reduces churn a further median 45% beyond bagging with $K=2$ . It matches the performance of bagging with $K=5$ (which requires $5\times$ compute) in mean rank.

Key Results

Magnitude of Churn

Churn Rates: On 9 chemistry benchmarks (MoleculeNet, TDC ADME/Tox, materials science), cross-sample churn flips 8.0–21.8% of test predictions.
Aggregate Stability: Aggregate accuracy moves only 1.3–4.2 percentage points between retrainings, hiding the significant per-prediction instability.
Minority Class Instability: On imbalanced datasets, minority-class predictions are 2–4 $\times$ more unstable than majority-class predictions, affecting the most critical "active" or "toxic" predictions.

Comparative Performance

Parameter-Side vs. Data-Side: Deep ensembles, MC dropout, and SWA fail to reduce churn consistently. Bagging and Twin-Bootstrap are the only methods that reliably reduce churn.
Efficiency: Twin-bootstrap achieves churn reduction comparable to $5\times$ -compute Bagging ( $K=5$ ) while only requiring 2 $\times$ ERM compute.
Distributional Agreement: Twin-bootstrap reduces the symmetric KL divergence (distributional disagreement) by an additional factor of $\sim9\times$ beyond Bagging- $K=5$ , indicating superior stabilization of the full probability distribution, not just the argmax.

Downstream Impact

Bayesian Optimization (BO): In BO simulations, twin-bootstrap significantly increases the Jaccard overlap of the top-10 selected molecules between retrainings (e.g., from 0.03 to 0.68 on the AMES dataset). It reduces the cross-trajectory standard deviation of the final-best acquired value by 34–100% in regression tasks.
Triage Workflow: Sorting test examples by their estimated churn (using a single extra retraining) allows practitioners to identify the most fragile predictions. Reviewing the top 30% of predictions ranked by churn captures 58–100% of all class flips, outperforming predictive entropy.

Generalization

The method generalizes across architectures and tasks:

Architectures: Works on MLPs, Graph Isomorphism Networks (GIN), and pretrained backbones (ChemBERTa, ResNet-50).
Hyperparameter Tuning: While the optimal $\lambda$ value changes with architecture (e.g., $\lambda=300$ for MLP, $\lambda=10$ for GIN/ChemBERTa), the selection rule (maximize $\lambda$ subject to a small accuracy drop on the development set) transfers unchanged.
Tasks: The ranking of methods (Twin-Bootstrap $\approx$ Bagging- $K=5$ > ERM) holds for both classification and regression tasks.

Significance and Claims

The paper argues that cross-sample prediction churn is a missing metric in scientific ML benchmarking. Without reporting this metric, parameter-side uncertainty methods (ensembles, dropout) and data-side methods (bagging, twin-bootstrap) appear indistinguishable on standard accuracy metrics, despite differing fundamentally in their ability to stabilize operational decisions.

The authors claim that:

Churn is the operational stability metric: In closed-loop labs and virtual screening, the reproducibility of the specific molecules selected is more critical than the aggregate accuracy.
Data resampling is the key lever: Stability is determined more by how the training procedure resamples data than by the model class itself.
Twin-Bootstrap offers a practical recipe: It provides a computationally efficient ( $2\times$ ERM) method to design in cross-sample stability at training time without changing the deployment pipeline, simply by tuning a single hyperparameter on a development set.

The paper concludes that reducing churn has direct operational consequences, cutting wasted experimental work and making computational triage decisions reproducible, though it notes that low churn does not guarantee correctness (a stably wrong model is still wrong).

Reducing cross-sample prediction churn in scientific machine learning