Instrumental and Proximal Causal Inference with Gaussian Processes

Imagine you are a doctor trying to figure out if a new medicine actually cures a disease. You look at your patient records and see that people who took the medicine got better. But wait! Maybe those people were also healthier to begin with, or maybe they ate better food. You can't tell if the medicine worked or if it was just their healthy lifestyle. In statistics, this hidden factor (like lifestyle) is called a "confounder," and it makes it very hard to prove cause and effect.

This paper introduces a new, smarter way to solve this problem, especially when you can't see those hidden factors. The authors call their method GPIV and GPProxy.

Here is the breakdown using simple analogies:

1. The Problem: The "Hidden Puppeteer"

In many real-world situations (like economics or medicine), we can't run a perfect experiment where we control everything. We only have observational data.

The Confounder: Imagine a hidden puppeteer pulling strings on both the "Treatment" (the medicine) and the "Outcome" (getting better). If you just look at the data, you think the medicine caused the recovery, but really, the puppeteer did.
The Old Tools: Scientists have used two main tricks to find the puppeteer:
- Instrumental Variables (IV): Using a "randomizer" (like a lottery for who gets the medicine) that isn't influenced by the puppeteer.
- Proxies: Using "surrogate" clues (like a patient's mood or a side effect) that hint at what the puppeteer is doing.

2. The Flaw in Old Tools: "Guessing without a Safety Net"

The existing methods are good at giving you a single number (e.g., "The medicine improves health by 10%"). But they are terrible at telling you how sure they are.

It's like a weather forecaster saying, "It will rain tomorrow," but refusing to say if there's a 10% chance or a 90% chance.
If you are a doctor, you need to know: Is this 10% improvement a solid fact, or just a lucky guess? If it's a guess, you shouldn't prescribe the medicine to everyone.

3. The Solution: The "Gaussian Process" (The Flexible Rubber Sheet)

The authors propose using a Gaussian Process (GP).

The Analogy: Imagine a giant, stretchy rubber sheet stretched over a landscape of data points.
How it works: When you feed the data into this sheet, it doesn't just snap to a single line. It stretches and bends to fit the data, but it also "knows" how much it is stretching.
The Magic: The sheet gives you two things at once:
1. The Prediction: The height of the sheet at any point (the estimated effect of the medicine).
2. The Uncertainty: How wobbly or shaky the sheet is at that point. If the sheet is very wobbly, it means "I'm not sure here; I need more data." If it's flat and steady, it means "I'm very confident."

4. The Secret Sauce: "Deconditioning"

The paper uses a clever mathematical trick called Deconditioning.

The Analogy: Imagine you are trying to hear a whisper (the true effect) in a noisy room (the confounders).
The Trick: Instead of trying to shout over the noise, the authors use a "noise-canceling headphone" algorithm. They mathematically reverse the way the noise (the confounders) mixes with the signal.
The Result: Their method recovers the exact same "best guess" as the old, popular methods (so it's just as accurate), but it also keeps the "wobble" information (the uncertainty) that the old methods threw away.

5. Why This Matters: "Knowing When to Say 'I Don't Know'"

The paper shows that their new method is a game-changer for two reasons:

Better Decisions: Because it tells you how confident it is, you can make safer decisions. If the "wobble" is high, you can choose not to make a decision yet (like not prescribing a risky drug) until you have more evidence. This is called "selective prediction."
Self-Correction: The method can automatically tune itself to find the best settings without needing a human to guess, making it easier to use.

Summary

Think of the old methods as a crystal ball that gives you a single, blurry number and says, "Trust me."
The new method is like a smart GPS that gives you the route and tells you, "I'm 95% sure this is the right way, but this next turn is foggy, so drive carefully."

By combining the accuracy of old tools with the "confidence meter" of modern AI, this paper provides a safer, more reliable way to figure out cause and effect in a messy, confusing world.

1. Problem Statement

Causal inference from observational data is often compromised by unobserved confounders ( $U$ ), which bias standard estimators and invalidate causal conclusions. Two primary frameworks address this:

Instrumental Variables (IV): Uses an instrument $Z$ that affects the treatment $X$ but is independent of the confounder $U$ and outcome $Y$ (conditional on $X$ ).
Proximal Causal Learning (Proxy): Uses a treatment proxy $Z$ and an outcome proxy $W$ that provide sufficient information about the unobserved confounder $U$ .

The Gap: While recent advances (e.g., Kernel IV, Kernel Negative Control) have provided strong point estimators for these settings, they lack reliable epistemic uncertainty (EU) quantification. Existing methods for uncertainty are often heuristic (e.g., bootstrapping), lack coherent probabilistic interpretations, or rely on computationally expensive Bayesian approximations with strong parametric assumptions. Furthermore, uncertainty quantification is crucial for risk-aware decision-making (e.g., selective inference, active learning), yet current causal methods do not adequately support this.

2. Methodology: Deconditional Gaussian Processes (DGP)

The authors propose a unified Bayesian nonparametric framework using Gaussian Processes (GP) for both IV and Proxy settings. The core innovation is leveraging Deconditional Mean Embeddings (DME) to solve the underlying Fredholm integral equations that define causal effects in the presence of confounding.

Core Theoretical Insight

In both IV and Proxy settings, recovering the structural function $f$ (the Average Treatment Effect) involves solving an ill-posed inverse problem (a Fredholm integral equation of the first kind).

IV: $E[Y|Z] = E[f(X)|Z]$ .
Proxy: $E[Y|X,Z] = \int h(X,W) dP(W|X,Z)$ , followed by marginalization.

The authors treat the conditional expectation operator as a linear mapping. They utilize Deconditional Mean Operators (DMO), which act as pseudo-inverses to conditional expectation operators, to recover the structural function from the observed conditional means.

The GP Framework

Instead of solving these equations via frequentist two-stage regression, the authors place a Gaussian Process prior on the structural function (or the bridge function in the Proxy setting).

Prior: $f \sim \mathcal{GP}(0, k)$ .
Likelihood: They model the observed outcomes as noisy observations of the conditional expectation process induced by the GP.
- For IV: $y | z \sim \mathcal{N}(E[f(X)|Z=z], \sigma^2)$ .
- For Proxy: $y | x, z \sim \mathcal{N}(E[h(x,W)|X,Z], \sigma^2)$ .
Posterior: By conditioning the GP on the observed data, they derive a closed-form posterior distribution for $f$ $f$ .
- Posterior Mean: Recovers the point estimates of popular frequentist kernel methods (Kernel IV and Kernel Negative Control).
- Posterior Variance: Provides a principled, well-calibrated measure of epistemic uncertainty.

Specific Models

GPIV (Instrumental Variable): Adapts the deconditional GP formulation to the IV setting. The posterior mean is shown to be mathematically equivalent to the Kernel IV (KIV) estimator.
GPProxy (Proximal Causal Learning): Extends the framework to the Proxy setting by placing a GP prior on the bridge function $h(x, w)$ and marginalizing over the outcome proxy $W$ . The posterior mean is equivalent to the Kernel Negative Control (KNC) estimator.

Hyperparameter Selection

Unlike frequentist approaches that rely on data splitting (which reduces effective sample size) and cross-validation, the Bayesian framework allows for marginal log-likelihood optimization. This enables end-to-end tuning of kernel hyperparameters (length-scales, noise variance) using the entire dataset, improving predictive performance.

3. Key Contributions

Unified Bayesian Framework: Introduced GPIV and GPProxy, the first unified GP-based frameworks for causal inference under unobserved confounding that provide closed-form solutions.
Equivalence to Frequentist Methods: Proved that the posterior means of their GP models exactly recover the estimators of state-of-the-art kernel methods (KIV and KNC), ensuring they inherit established asymptotic guarantees and predictive accuracy.
Principled Uncertainty Quantification: Provided a coherent probabilistic interpretation for uncertainty in causal settings. The posterior variance captures both data variability and model uncertainty regarding the unobserved confounders.
Improved Model Selection: Demonstrated that optimizing hyperparameters via marginal likelihood (without data splitting) yields superior predictive performance compared to methods relying on fixed heuristics or cross-validation.
Decision-Aware Evaluation: Moved beyond simple coverage metrics to evaluate uncertainty using Accuracy-Rejection Curves (ARC), showing that their uncertainty estimates are informative for downstream tasks like selective inference.

4. Experimental Results

The authors evaluated their methods on synthetic data and a realistic "airline ticket demand" dataset.

Predictive Accuracy (MSE):
- GPIV consistently achieved the lowest or second-lowest Mean Squared Error (MSE) compared to baselines (KIV, MMRIV, QBIV).
- GPProxy outperformed or matched kernel-based baselines (KPV, KNC, PMMR).
- Key Finding: The Bayesian framework's ability to use the full dataset for hyperparameter tuning (avoiding data splitting) significantly improved accuracy, especially in smaller sample sizes.
Uncertainty Quantification:
- Coverage: GPIV and GPProxy achieved coverage rates close to the nominal 95% level. In contrast, bootstrap-based baselines (KIV-BS, QBIV) produced overly narrow confidence intervals, leading to poor coverage (often < 80%).
- Informativeness (ARC): The authors used the Accuracy-Rejection Curve to test if high uncertainty correctly correlates with high error. Their methods showed a steeply rising ARC, indicating that the model effectively identifies instances where it is uncertain and should abstain. Bootstrap methods produced flatter curves, indicating uninformative uncertainty.
Active Learning: In a simulated active learning task, the GP posterior variance successfully guided the selection of the most informative data points, leading to faster convergence in causal effect estimation compared to random sampling.

5. Significance and Impact

This paper bridges a critical gap between high-performance causal estimation and reliable uncertainty quantification.

Trustworthiness: By providing calibrated uncertainty, the method enables safer deployment of causal models in safety-critical domains (e.g., healthcare, policy), where knowing when not to trust a prediction is as important as the prediction itself.
Efficiency: The elimination of data splitting for hyperparameter tuning makes the method more data-efficient, a crucial advantage for small-sample causal studies.
Theoretical Unification: It unifies frequentist kernel methods and Bayesian nonparametrics, showing that popular kernel estimators are essentially posterior means of specific GP models, thereby granting them a rigorous uncertainty quantification mechanism without sacrificing their predictive power.

In summary, the proposed Deconditional GP framework offers a practical, unified, and theoretically grounded solution for causal inference under unobserved confounding, delivering both state-of-the-art accuracy and reliable, decision-ready uncertainty estimates.