Partial Causal Structure Learning for Valid Selective Conformal Inference under Interventions

Imagine you are a detective trying to predict the future behavior of a specific suspect (let's call him Gene X) in a city with thousands of other people (genes). You have a database of past events (data) where different people were "intervened" upon—maybe they were arrested, given a new job, or moved to a different neighborhood.

Your goal is to draw a prediction circle around Gene X's future actions. You want this circle to be tight (precise) but also safe (guaranteed to catch the real outcome 95% of the time).

The Problem: The "Bad Apples" in the Database

Standard prediction methods say: "Look at everyone in the database to figure out how much things usually vary."

But here's the catch: In a causal world, if you mess with Gene A, it might change Gene X. But if you mess with Gene B, Gene X doesn't care at all.

If you mix data from "Gene A" (who affects X) and "Gene B" (who doesn't) together, your prediction circle becomes huge and useless because you're mixing two different realities.
The Ideal Solution: Only look at the people who don't affect Gene X. This gives you a tiny, super-precise circle. This is called Selective Conformal Inference.

The New Challenge: We Don't Know Who Affects Whom

The problem is, we don't have a map of the city. We don't know which genes are "ancestors" (affect X) and which are "strangers" (don't affect X).

If we try to draw the whole map (learn the full causal graph), it's like trying to map every street in a massive country while driving blind. It's too hard, too slow, and we'll make mistakes.
If we guess wrong and include a "bad apple" (a gene that does affect X) in our "safe" group, our prediction circle becomes too small, and we fail to catch the real outcome.

The Paper's Solution: A Three-Part Strategy

The authors propose a clever, practical way to solve this without needing a perfect map.

1. The "Safety Net" Theorem (The Insurance Policy)

They realized that even if we make a few mistakes and accidentally include some "bad apples" in our safe group, we can still save the day.

The Analogy: Imagine you are building a fence. You know you might accidentally leave a small gap (contamination). The authors proved a mathematical rule: "If you leave a gap of size $X$ , your fence will still hold up, provided you make the fence slightly taller to compensate."
They created a formula that tells you exactly how much to "widen" your prediction circle based on how many mistakes you think you made. This guarantees you never lose your safety guarantee, even with imperfect knowledge.

2. The "Task-Driven" Shortcut (Don't Map the Whole City)

Instead of trying to learn the entire city map (the full causal graph), they asked: "Do we really need to know everything?"

The Answer: No. We only need to know one specific thing for each pair: "Does this specific intervention affect this specific gene?" (Yes/No).
The Analogy: Instead of learning the entire subway system, you just need to know: "If I take the Red Line, will I end up at the Museum?" You don't need to know the schedule of the Blue Line. This makes the job much easier and faster.

3. The "Intersection Detective" (Finding the Truth)

How do we figure out who affects whom without a map? They used a clever trick called Perturbation Intersection.

The Analogy: Imagine you have three suspects: A, B, and C.
- When you mess with A, a list of 10 people get upset.
- When you mess with B, a list of 10 people get upset.
- When you mess with C, a list of 10 people get upset.
- If you look at the people who get upset in all three lists, those are likely the "true descendants" (the people connected to the root cause).
- If someone only appears in A's list but not B's or C's, they were probably a "false alarm" (a fluke).
By cross-referencing these lists (intersections), the algorithm filters out the noise and finds the true "safe" group of genes, even without a full map.

The Results: Does it Work?

They tested this on two things:

Fake Data (Simulations): They created a fake world and intentionally messed up the "safe" group by adding 30% "bad apples."
- Without the fix: The prediction failed (only caught the truth 86% of the time).
- With the fix (widening the circle): It caught the truth 95%+ of the time, exactly as promised.
Real Data (CRISPR Gene Editing): They used real data from a massive experiment where scientists cut genes in human cells.
- The "Corrected" method was the only one that stayed safe (above 90% accuracy). The other methods failed because real-world biology is messy, and the "bad apples" were real.

The Bottom Line

This paper is like a guide for a detective who doesn't have a perfect map. It says:

"You don't need to know the whole city to catch the criminal. Just find the people who don't know the criminal, use a smart trick to filter out the liars, and if you accidentally include a liar, just widen your net a little bit to stay safe. You'll still catch the criminal every time."

It turns a mathematically impossible problem (learning the whole causal graph) into a manageable, practical task that keeps predictions safe and precise.

1. Problem Statement

The paper addresses the challenge of performing Conformal Prediction (CP) in interventional settings (e.g., gene perturbation experiments like CRISPR screens) where data exchangeability is not global but holds only within specific subsets.

Context: In causal graphs, intervening on a node $a$ affects the distribution of a target variable $i$ if and only if $i$ is a causal descendant of $a$ . For non-descendants, the distribution remains unchanged (exchangeable with the control).
The Opportunity: By restricting calibration to "unaffected" interventions (where $i$ is not a descendant of the intervention), one can construct Selective (Mondrian) Conformal Prediction intervals. These are significantly tighter than pooled intervals because they exclude the high-variance residuals caused by true causal effects.
The Challenge: The causal structure (specifically, which nodes are descendants of which) is rarely known. Learning the full causal graph in high dimensions is computationally expensive and error-prone.
The Core Issue: If the set of "unaffected" calibration examples is misclassified (i.e., includes interventions that do affect the target), the calibration set becomes contaminated. This contamination breaks the exchangeability assumption, leading to under-coverage (the prediction intervals fail to contain the true value at the nominal rate).
Goal: The authors aim to quantify the statistical cost of imperfect causal knowledge and develop a framework that learns only the necessary partial structure to ensure valid, tight prediction intervals even in the presence of contamination.

2. Methodology

The proposed framework consists of three main components: a robustness theorem, a task-driven learning formulation, and specific algorithms for partial structure discovery.

A. $\delta$ -Robustness Theorem (Theoretical Foundation)

The authors derive a finite-sample lower bound for coverage when the calibration set is contaminated.

Contamination Fraction ( $\delta$ ): Defined as the fraction of calibration interventions in the selected set that are actually "affected" (misclassified as unaffected).
Theorem 1: Provides an explicit coverage bound:
$P(Y \in C) \geq 1 - \alpha - g(\delta, n)$
where $g(\delta, n) = \frac{\delta n}{(1-\delta)n + 1}$ .
Implication: The coverage loss is a function of the contamination fraction $\delta$ and the calibration set size $n$ . Crucially, this bound holds for arbitrary contaminating distributions.
Correction Strategy: To guarantee nominal coverage $1-\alpha$ , one can run the conformal procedure with a corrected, more conservative target miscoverage level $\alpha' = \alpha - g(\hat{\delta}, n)$ , where $\hat{\delta}$ is an estimated upper bound on contamination.

B. Task-Driven Partial Causal Learning

Instead of learning the full Directed Acyclic Graph (DAG), the authors reformulate the problem as a binary classification task.

Objective: Estimate binary indicators $Z_{a,i} = \mathbb{I}\{i \in \text{desc}(a)\}$ for relevant intervention-target pairs.
Error Asymmetry: The framework prioritizes minimizing the False Positive Rate (FPR) (classifying an affected node as unaffected) over the False Negative Rate. Misclassifying an affected node as unaffected introduces contamination ( $\delta$ ), directly degrading coverage. Misclassifying an unaffected node as affected merely reduces the calibration set size (wasting data) but preserves validity.
Complexity Reduction: This shifts the problem from estimating $O(p^2)$ edges to estimating specific binary labels, leveraging the sparsity of causal networks.

C. Algorithms for Partial Structure Discovery

Two complementary algorithms are proposed to estimate the descendant indicators:

Descendant Discovery via Perturbation Intersection (Algorithm 1):
- Input: Differentially affected sets (e.g., Differentially Expressed Genes, $S_a$ ) for each intervention $a$ .
- Logic: If intervention $b$ is upstream of $a$ (i.e., $a$ is a descendant of $b$ ), then the set of descendants of $a$ must be a subset of the descendants of $b$ ( $\text{desc}(a) \subseteq \text{desc}(b)$ ).
- Mechanism: For a target $a$ , identify upstream interventions $U(a)$ . The estimated descendant set is the intersection of $a$ 's own affected set and the affected sets of all upstream interventions:
  $\widehat{\text{desc}}(a) = S_a \cap \bigcap_{b \in U(a)} S_b$
- Benefit: This intersection prunes false positives (spurious genes that appear in $S_a$ but not in upstream sets), effectively controlling the FPR.
Local Invariant Causal Prediction (ICP) for Distance Estimation:
- Adapts ICP to estimate a "distance-to-intervention" metric $\hat{d}(a, i)$ without learning the full graph. This allows for weighted conformal calibration, prioritizing interventions that are causally closer to the target.

3. Key Contributions

Contamination-Robust Coverage Theorem: A novel finite-sample bound quantifying exactly how misclassification of calibration strata degrades conformal coverage, providing a mechanism to correct for it.
Task-Driven Formulation: A shift from full causal graph recovery to estimating specific binary descendant indicators, optimizing for the specific error metric (FPR) that matters for validity.
Recovery Conditions & Algorithms: Theoretical conditions (Propositions 1–2) under which the intersection-based algorithm controls contamination, along with practical algorithms for high-dimensional data.
Empirical Validation: Extensive experiments on synthetic Structural Equation Models (SEMs) and real genomic data (Replogle K562 CRISPRi screens).

4. Experimental Results

Synthetic Data (Linear SEMs, $p=200$ )

Contamination Sensitivity: As the injected contamination fraction $\delta$ increased from 0 to 0.30, the coverage of the uncorrected "Estimated" method degraded monotonically from 0.905 to 0.867, confirming the theoretical prediction of under-coverage.
Correction Efficacy: The "Corrected" procedure (using the $\alpha'$ adjustment) maintained coverage $\geq 0.95$ across all contamination levels, successfully restoring validity at the cost of slightly wider intervals (1.2–1.8 $\times$ ).
Algorithm Performance: The intersection-based algorithm kept the estimated contamination $\hat{\delta}$ very low (0.018) in sparse networks.

Real Data (Replogle K562 CRISPRi)

Application: Applied to a genome-scale screen with ~5,000 genes and 50 perturbations.
Results: The "Corrected" method was the only approach to exceed the nominal 0.9 coverage (achieving 0.906).
Trade-off: Due to the stringency of the correction and limited calibration data ( $n \approx 40$ ), the corrected method produced infinite intervals for ~40% of cases. However, for feasible cases, it provided valid coverage, whereas the "Oracle" (proxy) and "Pooled" methods failed to reach 0.9 coverage (0.864 and 0.888, respectively), highlighting the difficulty of defining true exchangeability in real biological data due to indirect effects and batch noise.

5. Significance

Bridging Causal Inference and Uncertainty Quantification: The paper provides a rigorous mathematical link between the accuracy of causal structure learning and the validity of predictive intervals. It proves that one does not need a perfect graph, only a "good enough" partial structure that controls false positives.
Practical Utility for Genomics: It offers a solution for single-cell perturbation screens (Perturb-seq), enabling researchers to generate tighter, more informative prediction intervals for gene effects without needing to fully resolve the complex gene regulatory network first.
Robustness to Real-World Noise: The framework acknowledges and mathematically handles the reality that "unaffected" groups in biological data may still contain subtle violations of exchangeability, providing a safety mechanism (the correction term) to ensure statistical guarantees are met.
Efficiency: By focusing on partial structure (binary indicators) rather than full DAG recovery, the approach is computationally scalable to high-dimensional problems where full causal discovery is intractable.