Reducing cross-sample prediction churn in scientific machine learning

This paper introduces the concept of "cross-sample prediction churn" to highlight the instability of scientific machine learning models across different training data draws and demonstrates that data-side methods like KK-bootstrap bagging and the proposed twin-bootstrap approach significantly reduce this churn without sacrificing predictive accuracy, unlike standard parameter-side techniques.

Original authors: Gordan Prastalo, Kevin Maik Jablonka

Published 2026-05-14
📖 5 min read🧠 Deep dive

Original authors: Gordan Prastalo, Kevin Maik Jablonka

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Problem: The "Weather Forecast" Problem

Imagine you are a scientist trying to predict which molecules will make good medicines. You build a computer model to do this.

Now, imagine you train that model on a specific set of data. It predicts that Molecule A is a "winner" (it will work as a drug).

But then, you decide to retrain the model. You don't change the rules or the data source; you just use a slightly different random sampling of that same data (like drawing a new hand of cards from the same deck).

The Shocking Result:
When you retrain the model, it suddenly says Molecule A is a "loser" and Molecule B is the new winner.

The paper calls this "Cross-Sample Prediction Churn." It's the rate at which the model flips its decision just because you shuffled the training data slightly.

  • The Paper's Finding: In 9 different chemistry tests, the model's overall accuracy only changed by a tiny bit (about 1–4%). But, the specific decision for individual molecules flipped 8% to 22% of the time.
  • The Analogy: Imagine a judge who is 95% accurate overall. But if you ask them to judge 100 specific cases, and you ask them to re-judge the same 100 cases after taking a different lunch break, they might change their verdict on 20 of them. That's a lot of instability for the specific cases that matter most.

Why Current "Fixes" Don't Work

Scientists have tried to fix this using standard "uncertainty" tools, like:

  1. Deep Ensembles: Training 5 different models and averaging their answers.
  2. MC Dropout: Turning parts of the model "off" randomly during testing to see how much the answer wobbles.
  3. Stochastic Weight Averaging: Smoothing out the model's internal math.

The Paper's Verdict: These tools are like trying to fix a shaky camera by adjusting the lens focus (the model's internal settings) while the camera is still being held by a shaking hand (the data).

  • These methods fix the "lens" but ignore the "shaking hand."
  • The paper found these methods did not reduce the churn. They didn't stop the model from flipping its decisions when the data changed.

The Solution: Two New Methods

The authors propose two methods that actually work because they address the "shaking hand" (the data) rather than just the "lens."

1. K-Bootstrap Bagging (The "Committee" Approach)

  • How it works: Instead of training one model, you train a whole committee of models (e.g., 5 of them). Each member of the committee is trained on a slightly different random sample of the data. When you need an answer, you ask the whole committee and take the average vote.
  • The Result: This cuts the flipping rate by 40–54%.
  • The Catch: It requires 5 times more computer power to train 5 models instead of 1.

2. Twin-Bootstrap (The "Twin Sisters" Approach)

  • How it works: This is the paper's main invention. Imagine training two "twin" neural networks at the same time.
    • Twin A learns from Sample X.
    • Twin B learns from Sample Y (a slightly different sample).
    • The Secret Sauce: Every time they learn, the twins are forced to talk to each other. If they disagree on a molecule, they get a "penalty" (a consistency loss) to force them to agree.
  • The Result:
    • It reduces the flipping rate by an additional 45% compared to the standard committee method.
    • It achieves this with only 2x the computer power (training two twins instead of five separate models).
    • It keeps the accuracy just as high as the original model.

Why This Matters (The "Real World" Impact)

The paper argues that in scientific labs, decisions are made molecule-by-molecule.

  • The Scenario: A scientist uses the model to pick the top 10 molecules to synthesize in a lab.
  • The Risk: If the model has high "churn," the scientist might pick Molecule #1 today. But if they retrain the model tomorrow (which happens often in science), the model might say, "Actually, Molecule #1 is bad, let's try Molecule #10."
  • The Cost: This wastes time and money. The lab might synthesize the wrong molecule, or waste effort re-evaluating the same list.

The paper suggests that scientific reports should always include a "Churn Score" alongside accuracy. Just knowing a model is "90% accurate" isn't enough; you need to know if that accuracy is stable or if the model is just guessing wildly every time you refresh the page.

Summary

  • The Issue: Scientific AI models often flip their specific predictions when retrained on slightly different data, even if their overall score looks good.
  • The Old Way: Standard tricks to measure uncertainty (like ensembles) don't fix this specific problem.
  • The New Way:
    1. Bagging: Train a big committee of models (works well, but expensive).
    2. Twin-Bootstrap: Train two models together and force them to agree (works even better and is cheaper).
  • The Goal: Make scientific AI reliable enough that a scientist can trust the specific molecule it recommends, knowing the recommendation won't change just because they ran the training code one more time.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →