A causally informed framework for robust confounder control in biomedical machine learning

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Problem: The "Fake" Connection

Imagine you are a detective trying to figure out why people are getting sick. You notice that people who carry umbrellas are much more likely to get wet.

The Wrong Conclusion: "Carrying umbrellas causes people to get wet!"
The Real Truth: It's raining. Rain causes people to carry umbrellas and causes them to get wet. The umbrella didn't cause the wetness; the rain did.

In the world of medical AI (Machine Learning), computers are great detectives, but they are terrible at understanding why things happen. They just see patterns. If an AI is trained to predict a disease based on brain scans, it might accidentally learn that "older people get sick more often" and decide that age is the disease. Or, it might learn that "people who take a certain pill have a specific brain shape," and decide the pill is the cause of the brain shape, when actually the brain shape caused the need for the pill.

This is called confounding. The AI is learning "fake" connections (like the umbrella) instead of the real biological causes (like the rain). This makes the AI useless when it tries to help new patients in different hospitals or situations.

The Solution: A Three-Step "Causal" Framework

The authors of this paper say, "Stop guessing! We need a map." They propose a three-step framework to help AI researchers build better, more honest models.

Step 1: Draw the Map (The DAG)

Before you feed data to the computer, you need to draw a map of how you think the world works. The authors call this a DAG (Directed Acyclic Graph).

The Analogy: Imagine you are planning a road trip. You don't just drive randomly; you look at a map to see which roads connect to which.
In the Paper: Researchers use their medical knowledge to draw arrows showing what causes what. For example: Sex Hormones → Muscle Mass → Hand Grip Strength.
Why it helps: This map forces the researcher to think: "Does this variable actually cause the problem, or is it just a side effect?" It stops the AI from getting tricked by fake patterns.

Step 2: Pick the Right Filters (Deconfounders)

Once you have the map, you need to decide which variables to "block" so the AI only sees the real signal.

The Analogy: Imagine you are trying to listen to a specific singer in a noisy room. You need to put on noise-canceling headphones that block out the specific noises (the confounders) but let the singer through.
The Challenge: Sometimes the "noise" (the confounder) isn't even recorded in the data. Maybe the AI needs to know about "stress levels," but the hospital never measured stress.
The Fix: The paper suggests clever tricks for these missing pieces.
- The Proxy Trick: If you can't measure "stress," maybe you can measure "how much coffee they drink" or "how fast they blink." These are proxies—clues that hint at the missing stress.
- The Instrument Trick: Sometimes you need a "randomizer" (like a genetic lottery) that affects the brain but not the disease directly, helping to isolate the true cause.

Step 3: Clean the Data (Adjustment)

Now that you know what to block, you actually clean the data before teaching the AI.

The Analogy: If you are baking a cake and you know the flour is wet (confounded), you dry it out before mixing it. If you don't, your cake will be soggy.
The Paper's Warning: The authors point out that most scientists currently use a very simple, blunt tool to dry the flour (called "linear residualization"). It's like using a hairdryer on a delicate cake—it might work for simple things, but it ruins complex patterns.
The Better Way: They suggest using a more advanced tool called Double Machine Learning (DML). This is like using a smart, temperature-controlled oven that can dry the flour without cooking the cake. It's much better at handling complex, messy biological data.

The Big Takeaway: "Correlation is not Causation"

The paper ends with a very important warning. Even if you do all these steps perfectly, you still cannot claim the AI has found the "truth" or a "cure."

The Analogy: Think of the AI as a very smart parrot. If you teach the parrot to say "Fire causes smoke" by showing it pictures of fires and smoke, the parrot learns the pattern. But the parrot doesn't understand fire. If you show it a picture of smoke from a fog machine, the parrot might get confused.
The Reality: A "deconfounded" AI is a much smarter parrot. It won't get tricked by the umbrella/rain example. It will give you a much more reliable prediction. However, it is still just predicting patterns, not performing magic or proving a biological law.

Summary

This paper is a guidebook for medical AI researchers. It says:

Don't just guess which variables to ignore; draw a map of causes first.
Use smart tricks to handle missing data (proxies and instruments).
Use better cleaning tools (like Double Machine Learning) instead of simple ones.
Remember: Even with these tools, the AI is still a prediction machine, not a time-traveling scientist. But by removing the "fake" connections, we can finally trust its predictions enough to use them in real hospitals.

1. Problem Statement

Machine Learning (ML) in neurobiomedicine often produces predictive models that lack generalizability and biological validity. A primary cause of this failure is confounding bias, where models exploit spurious associations driven by third variables (confounders) rather than genuine biological mechanisms.

Current Limitations: Standard practices often define confounders heuristically (e.g., age, sex) or based solely on statistical correlations. This approach fails to distinguish between confounders (common causes of features and targets), colliders (common effects), and mediators (intermediate causes).
Consequences: Adjusting for the wrong variables (e.g., conditioning on a collider) can introduce bias (Berkson's paradox), while failing to adjust for true confounders leads to models that overfit to specific data distributions (e.g., demographic shifts) and fail in new settings.
The Gap: There is a lack of structured, causally grounded frameworks for selecting adjustment variables in supervised ML (SML), particularly when dealing with unmeasured confounders common in observational neurobiomedical data.

2. Methodology: A Three-Step Causal Framework

The authors propose a pragmatic, three-step framework to integrate causal inference principles into SML workflows. The framework uses Directed Acyclic Graphs (DAGs) to formalize domain knowledge and guide variable selection.

Step 1: Causal Analysis (DAG Construction)

Approach: A "bottom-up" strategy is used to construct a DAG based on domain expertise and literature.
Process: Researchers start with the target variable ( $Y$ ) and iteratively add variables that causally influence $Y$ or the input features ( $X$ ).
Goal: To explicitly map causal relationships, distinguishing between confounders, mediators, and colliders. This step forces researchers to articulate assumptions about the data-generating process rather than relying on arbitrary correlations.

Step 2: Identification of Deconfounders

Backdoor Criterion: The primary method for identifying a sufficient set of adjustment variables (deconfounders). A set $Z$ satisfies the backdoor criterion if it blocks all non-causal paths from $X$ to $Y$ without opening new paths (e.g., by conditioning on a collider).
Handling Unmeasured Confounders: Since neurobiomedical data often lacks key variables (e.g., hormone levels), the framework discusses alternative strategies:
- Alternative Backdoor Sets: Finding other measured variables that block the same paths (e.g., using "muscle mass" and "sex" instead of unmeasured "sex hormones").
- Front-door Criterion: Using an intermediate variable ( $F$ ) that intercepts the causal path ( $X \to F \to Y$ ) if backdoor adjustment is impossible.
- Instrumental Variables (IVs): Using variables that affect $X$ but not $Y$ directly to isolate causal variation.
- Proxy Variables: Using two proxies ( $P_1, P_2$ ) for an unmeasured confounder $Z$ to non-parametrically recover the influence of $Z$ , provided specific independence and relevance conditions are met.

Step 3: Statistical Evaluation and Adjustment

Validation: Identified deconfounders must be statistically associated with both $X$ and $Y$ to ensure they are actionable in the data.
Adjustment Strategies:
- Linear Residualization: The paper critiques the standard practice of linearly regressing confounders out of features (or targets) only. It highlights that this assumes linear relationships and often fails to remove non-linear confounding or misaligns feature-target distributions if only one side is adjusted.
- Double Machine Learning (DML): The authors propose adapting DML (originally for causal inference) to SML. DML uses cross-fitting and flexible ML models to residualize both features and targets, effectively removing non-linear confounding effects while maintaining robustness against overfitting.

3. Key Results and Case Study

The framework is demonstrated using a UK Biobank dataset to predict Hand Grip Strength (HGS) from Gray Matter Volume (GMV) features (1088 brain parcels).

The "Vanilla" Model: A linear Support Vector Regression (SVR) without confounder adjustment achieved a correlation of $r=0.48$ .
Causal Analysis: The DAG revealed that muscle mass and sex were valid deconfounders (blocking backdoor paths from GMV to HGS), whereas sex hormones were unmeasured.
The "Deconfounded" Model: After applying linear feature residualization for muscle mass and sex, the predictive performance collapsed to $r=0.00$ .
Interpretation: The initial high accuracy ( $r=0.48$ ) was entirely driven by confounding bias (demographic and behavioral correlates) rather than a true biological link between GMV and HGS. The collapse in performance after deconfounding proved the original model was spurious.
Implication: The lack of signal in the deconfounded model suggests that either no strong linear biological relationship exists, or that more complex non-linear models are required to capture the true signal once bias is removed.

4. Key Contributions

Structured Framework: Provides a reproducible, three-step workflow (Causal Analysis $\to$ Deconfounder Selection $\to$ Statistical Adjustment) to replace heuristic confounder selection in biomedical ML.
Handling Unmeasured Confounders: Systematically reviews and adapts causal inference tools (Front-door, IVs, Proxies) for SML contexts where data is incomplete.
Critique of Residualization: Demonstrates the limitations of standard linear feature residualization and advocates for dual adjustment (features and targets) and non-linear methods like Double Machine Learning (DML).
Clarification of Causal Claims: Explicitly distinguishes between debiased prediction and causal inference. The authors argue that while deconfounding improves generalizability and validity, SML models remain associative ( $P(Y|X)$ ) and cannot automatically be interpreted as causal ($P(Y|do(X))$) without further assumptions (e.g., ignorability, consistency, faithfulness).

5. Significance

Scientific Rigor: The paper addresses a critical reproducibility crisis in neurobiomedical ML by preventing models from learning spurious correlations (e.g., attributing brain changes to a disease when they are actually due to age or medication).
Clinical Utility: By ensuring models generalize across different populations and data distributions, the framework supports the development of reliable clinical decision-support tools.
Bridging Disciplines: It successfully bridges the gap between causal inference theory and practical machine learning application, offering a toolkit for researchers to build more robust, interpretable, and biologically meaningful predictive models.
Future Direction: The discussion on DML suggests a path forward for handling complex, non-linear confounding in high-dimensional biological data, moving beyond the limitations of simple linear adjustments.