No More Maybe-Arrows: Resolving Causal Uncertainty by Breaking Symmetries

The Big Problem: The "Maybe" Map

Imagine you are trying to draw a map of how a city works. You have a lot of data about traffic, but you don't have a time machine to see what happens if you close a specific street.

Because of this, standard computer algorithms can only give you a Partially Oriented Graph (PAG). Think of this as a map where:

Some roads have clear arrows (A $\to$ B means "A causes B").
Some roads have double-headed arrows or question marks (A $\leftrightarrow$ B means "We know they are connected, but we don't know who is driving whom").

In the world of data science, these "maybe-arrows" are a nightmare. If you want to predict what happens if you change a variable (like "What if we ban smoking?"), you need a map with only one-way arrows. You can't make a decision based on a "maybe."

The Solution: CausalSAGE

The authors propose a new tool called CausalSAGE. Its job is to take that messy map full of "maybe-arrows" and turn it into a clean, fully directed map (a DAG) without breaking the rules of the data.

Here is how it works, broken down into three simple steps:

1. Zooming In: The "State-Level" Expansion

The Analogy: Imagine a light switch. Standard algorithms see it as just "On" or "Off." But CausalSAGE zooms in. It realizes that "On" isn't just one thing; it's a specific state. Maybe "On" at 60% brightness causes a different reaction than "On" at 100% brightness.

What they do: Instead of treating a variable as a single block, they break it down into its individual states (like a one-hot encoding). This gives the computer a much higher resolution view. It's like switching from a blurry, low-resolution photo to a 4K image. Suddenly, patterns that looked symmetric (ambiguous) start to look different because the specific details of the states reveal who is actually influencing whom.

2. The Rules of the Road: Structural Constraints

The Analogy: Imagine you are trying to solve a maze. You don't want to guess randomly; you want to know which walls are real. The original "Maybe Map" (the PAG) tells you which walls definitely exist and which paths are impossible.

What they do: CausalSAGE uses the original map as a strict rulebook.

If the original map says "A cannot connect to B," CausalSAGE builds a wall there.
If the original map says "A definitely causes B," CausalSAGE locks that arrow in place.
It only tries to guess the direction for the "maybe" roads, and even then, it only guesses within the boundaries of the original map.

3. Breaking the Tie: The "Symmetry Breaker"

The Analogy: Imagine two equally strong teams in a tug-of-war. If they pull with exactly the same force, the rope doesn't move. The computer gets stuck in the middle, unable to decide which way the arrow should point.

What they do: To get the rope moving, CausalSAGE gives one side a tiny, gentle nudge.

Random Nudge: It might randomly guess, "Hey, maybe A causes B," just to get the optimization started.
Smart Nudge (LLM): If the variables have names (like "Smoking" and "Lung Cancer"), it asks a Large Language Model (like an AI expert) for a hint. The AI says, "Logically, Smoking causes Cancer." CausalSAGE uses this hint as a starting bias.

Once the rope starts moving in one direction, the math takes over. The system looks at the data to see which direction explains the observations better. The "nudge" just helps it escape the stalemate; the data does the heavy lifting.

The Result: A Clear Map

After running this process, CausalSAGE produces a final map with only one-way arrows.

No more "Maybe": Every connection has a clear direction.
No new lies: It respects the original data constraints, so it doesn't invent fake connections.
Speed: It can handle huge maps (up to 700+ variables) in just a few minutes on a normal computer.

Why This Matters

In the past, scientists had to choose between:

Safe but useless: A map with "maybe-arrows" (PAGs) that is statistically correct but can't be used for decision-making.
Risky: Guessing the direction of arrows without enough data, which often leads to wrong conclusions.

CausalSAGE is the middle ground. It takes the safe, statistically sound map and uses smart math and a little bit of "nudging" to resolve the ambiguities, giving us a clear, actionable map of cause and effect. It turns a "maybe" into a "definitely."

1. Problem Statement

The Limitation of Partial Ancestral Graphs (PAGs):
Current causal discovery algorithms, particularly constraint-based methods like Fast Causal Inference (FCI), are limited by observational data. Due to the Markov equivalence class, observational data often cannot uniquely determine the direction of causal edges. Instead of a fully directed acyclic graph (DAG), these methods output a Partial Ancestral Graph (PAG).

The Issue: A PAG contains "maybe-arrows" (edges with ambiguous orientation) and undirected edges. This ambiguity prevents the graph from being used in many downstream tasks that require explicit causal directions, such as computing interventional effects or counterfactual reasoning.
The Goal: The paper aims to systematically refine a PAG into a fully directed DAG by resolving orientation ambiguities without violating the underlying causal constraints established by the PAG.

2. Methodology: CausalSAGE

The authors propose CausalSAGE (Causal State-Augmented Graph Estimation), a three-stage differentiable refinement framework.

A. State-Level Expansion (Representation)

Instead of treating discrete variables as single categorical nodes, CausalSAGE expands them into state-level representations.

Mechanism: Each discrete variable $V_i$ with $L_i$ states is converted into a one-hot vector $z_i$ .
Parameterization: A global weight matrix $W$ is partitioned into blocks $W_{ij}$ , where each block represents the influence from specific states of $V_i$ to specific states of $V_j$ .
Benefit: This fine-grained modeling captures asymmetries in data generation that are lost when aggregating at the variable level (e.g., state $a_1$ of $V_i$ might cause $b_2$ of $V_j$ , while other states do not).

B. Structural Constraints (Search Space)

The framework uses the input PAG to define hard constraints on the search space:

Variable-level Admissibility: Based on PAG semantics, directions are either forbidden (non-adjacent), forced (resolved), or ambiguous.
State-level Lifting: These constraints are lifted to the state level to create a binary mask $S$ .
Enforcement: The learnable adjacency matrix is computed as $A = \sigma(W) \odot S$ , ensuring that structurally forbidden connections remain zero throughout training.

C. Unified Differentiable Objective

The core of the method is a joint optimization of a unified loss function $\mathcal{L}_{total}$ :
$\mathcal{L}_{total} = \mathcal{L}_{recon} + \lambda_1 \mathcal{L}_{sparse} + \lambda_2 \mathcal{L}_{cycle} + \lambda_3 \mathcal{L}_{skeleton}$

Reconstruction Likelihood ( $\mathcal{L}_{recon}$ ): A cross-entropy loss where variables are reconstructed from their allowed incoming state connections. Directional preference emerges implicitly: if a specific orientation ( $V_i \to V_j$ ) better explains the data, its corresponding block receives stronger gradients.
Block-wise Sparsity ( $\mathcal{L}_{sparse}$ ): A weighted Group Lasso penalty applied to state-level blocks to remove weak or redundant connections and prevent dense bidirectional structures.
Cycle Penalty ( $\mathcal{L}_{cycle}$ ): A pairwise suppression term that penalizes the simultaneous activation of both directions ( $V_i \to V_j$ and $V_j \to V_i$ ) for ambiguous pairs, encouraging asymmetric edge selection without imposing hard acyclicity constraints during optimization.
Skeleton Preservation ( $\mathcal{L}_{skeleton}$ ): Ensures that edges identified as present in the original PAG skeleton are not entirely pruned, even if they are ambiguous.

D. Symmetry Breaking via Priors

To prevent the system from settling into a symmetric equilibrium where neither direction dominates, the authors introduce soft priors at initialization:

Random Prior: Assigns a random directional bias (e.g., 0.9/0.1 probability) to unresolved pairs.
LLM-based Semantic Prior: Uses Large Language Models (e.g., GPT-3.5) to estimate directional plausibility based on variable names, providing context-aware initialization.

E. Post-Processing

After optimization, the continuous adjacency matrix is thresholded to extract a variable-level graph. A final check enforces acyclicity by iteratively removing the weakest edge in any detected cycles.

3. Key Contributions

Novel Framework: Introduction of CausalSAGE, the first differentiable framework specifically designed to convert PAGs to DAGs by breaking symmetries in Markov equivalence classes.
State-Level Modeling: A shift from variable-level to state-level causal modeling for discrete variables, revealing hidden asymmetries in the data generation process.
Unified Optimization: A single differentiable objective that jointly optimizes data reconstruction, sparsity, cycle avoidance, and skeleton preservation, avoiding the need for separate heuristic steps.
Scalability: The method scales efficiently to large graphs (up to 724 nodes) using mini-batch optimization, overcoming the computational bottlenecks of traditional score-based DAG learners on large networks.

4. Experimental Results

The authors evaluated CausalSAGE on standard Bayesian network benchmarks (ranging from 11 to 724 variables) using observational data generated from ground-truth DAGs.

Refinement Effectiveness (Q1):
- CausalSAGE consistently reduced the Structural Hamming Distance (SHD) compared to raw FCI/RFCI outputs. For example, on the pigs dataset, SHD dropped from 276 to 20.
- It successfully resolved 100% of orientation ambiguities, reducing the unresolved direction ratio from 46–86% (in FCI) to 0%.
Comparison with Direct DAG Learners (Q2):
- CausalSAGE achieved competitive performance against standard DAG learners (PC, MMHC, Tabu, Hill Climbing).
- Crucially, it demonstrated greater stability on medium-to-large graphs where score-based methods often suffer from catastrophic structural degradation.
Sensitivity (Q3):
- Performance improved with larger sample sizes but remained robust across different random seeds.
- LLM-based priors significantly reduced variance in optimization for larger networks compared to random priors.
Scalability (Q4):
- Runtime scales approximately linearly with the number of variables.
- The method handled the largest benchmark (Link, 724 variables) in ~12 minutes on a single CPU, a scale where many classical DAG learners become computationally infeasible.

5. Significance

Bridging the Gap: CausalSAGE solves a critical bottleneck in causal discovery: the inability to use PAGs for tasks requiring explicit causal directions. It transforms "maybe-arrows" into definitive causal statements.
Practical Applicability: By leveraging observational data and structural constraints (PAGs) rather than requiring expensive interventional data, it makes causal inference more feasible for high-impact domains like biology and economics.
Efficiency: The differentiable, state-level approach offers a scalable alternative to combinatorial search methods, enabling causal discovery on large-scale networks previously considered too complex for refinement.
Future Directions: The paper suggests extending this framework to dynamical systems and incorporating more informed priors (e.g., from domain knowledge or advanced LLMs) for real-world applications like molecular networks and climate modeling.