Causal Direction from Convergence Time: Faster Training in the True Causal Direction

The Big Question: Which Way Does the Arrow Point?

Imagine you see two things happening together: Ice Cream Sales go up, and Drowning Deaths go up.

Does eating ice cream cause people to drown? (No.)
Do drowning deaths cause people to buy ice cream? (No.)
The Truth: Hot weather causes both.

For decades, scientists have struggled with this. If you just look at data (statistics), you can't tell which way the arrow points. You need a way to distinguish Cause from Effect.

This paper proposes a clever new trick: Don't just look at the data; watch how fast a computer learns it.

The Core Idea: The "Learning Speed" Test

The authors suggest a simple experiment:

Train a robot to guess the Effect based on the Cause (e.g., "Given the temperature, how much ice cream will be sold?").
Train a different robot to guess the Cause based on the Effect (e.g., "Given the ice cream sales, what was the temperature?").
Time them. Which robot learns the pattern faster?

The Rule: The robot that learns faster is looking in the Causal Direction. The one that struggles and takes longer is looking in the Reverse (Wrong) Direction.

The Analogy: The "Clean Kitchen" vs. The "Messy Kitchen"

Why does the causal direction learn faster? The paper uses a concept called the Additive Noise Model. Let's break it down with a kitchen analogy.

1. The Forward Direction (Cause $\to$ Effect): The Clean Kitchen

Imagine you are baking a cake.

The Recipe (The Cause): You follow a specific recipe (Temperature $\to$ Sales).
The Mistakes (The Noise): Sometimes you spill a little flour, or the oven is slightly off. These are random, small mistakes.
The Result: When you look at the finished cake (the data), the mistakes are just random sprinkles of flour. They don't tell you anything about the recipe.
Learning: A student trying to guess the recipe from the cake can easily ignore the random flour sprinkles. The path to the answer is clean and straight. They learn quickly.

2. The Reverse Direction (Effect $\to$ Cause): The Messy Kitchen

Now, imagine you are a detective trying to figure out the recipe just by looking at the finished cake.

The Problem: The cake is the result of the recipe PLUS all those random mistakes (spilled flour, oven quirks).
The Entanglement: The mistakes are now baked into the cake. You can't separate the "recipe" from the "spilled flour" anymore.
The Confusion: If you see a cake with a weird shape, is it because the recipe was weird, or because the oven was weird? You can't tell.
Learning: The student trying to guess the recipe is stuck in a messy, confusing landscape. Every time they try a guess, the random mistakes (noise) confuse them. They have to take many more steps, try many more guesses, and get stuck in "saddle points" (dead ends) before they get close to the truth.

The Paper's Insight: Because the "Reverse" direction is mathematically messier (the noise is tangled with the signal), the computer takes more steps to learn it. The "Forward" direction is cleaner, so it learns faster.

The "Speedometer" of Causality

The authors call this signal Causal Computational Asymmetry (CCA).

If the computer learns fast: You are looking at Cause $\to$ Effect.
If the computer learns slow: You are looking at Effect $\to$ Cause.

It's like a speedometer. The "speed" of learning tells you the direction of the arrow.

The Rules of the Game (Boundary Conditions)

The paper is very honest about when this trick doesn't work. It's like a magic trick that fails if you don't follow the instructions:

No Linear Relationships: If the relationship is a straight line (like $Y = 2X$ ), the "mess" looks the same in both directions. The trick fails. It needs a curve (non-linear) to work.
No "One-to-Many" Maps: If two different causes can produce the exact same effect (like a broken lock that opens with any key), the reverse direction becomes impossible to solve. The trick fails.
Must Normalize Data: You have to "level the playing field." If one variable is measured in "millions" and the other in "ones," the computer gets confused by the size of the numbers, not the logic. The paper insists on z-scoring (standardizing) the data first.

The Big Picture: CCL (The Full Toolkit)

The authors didn't just stop at the speed test. They built a whole framework called Causal Compression Learning (CCL).

Think of CCL as a Swiss Army Knife for causal discovery:

The Blade (CCA): The speed test we just discussed.
The Screwdriver (MDL): A tool that prefers simple explanations over complex ones (Occam's Razor).
The Pliers (Information Bottleneck): A tool that squeezes out useless information and keeps only the causal stuff.
The Handle (Reinforcement Learning): A tool that learns how to act in the world based on what it discovers.

They proved mathematically that if you use all these tools together, you can learn the structure of the world much faster and with less data than previous methods.

Why Does This Matter?

In the real world, getting the direction wrong is dangerous:

Medicine: If you think a biomarker causes a disease, you might try to lower the biomarker. But if the disease actually causes the biomarker, you are wasting time and potentially hurting patients.
Economics: If you think building more hospitals causes higher death rates, you might stop building them. But actually, sick people go to hospitals. You need to know the direction to fix the problem.

The Bottom Line:
This paper says: "Cause is easier to learn than Effect."
By simply measuring how fast a neural network learns a relationship, we can mathematically prove which way the arrow of time points. It turns a philosophical question ("What caused what?") into a practical engineering problem ("How fast does the computer learn?").

1. Problem Statement

The paper addresses the fundamental challenge of Causal Direction Discovery: determining which of two correlated variables, $X$ and $Y$ , causes the other, given only observational data.

The Core Difficulty: Under Judea Pearl's Causal Hierarchy, observational data (Rung 1) is mathematically insufficient to determine causal direction without structural assumptions. Standard statistical methods often fail because correlation is symmetric, while causation is not.
Limitations of Existing Methods: Current approaches rely on:
- Residual Independence: Testing if residuals are independent of the input (e.g., RESIT, ANM).
- Algorithmic Complexity: Comparing description lengths (e.g., IGCI, MDL).
- Skewness: Analyzing the skewness of score functions (e.g., SkewScore).
The Gap: These methods often struggle with specific data distributions (e.g., linear Gaussian, non-injective functions) or require specific noise assumptions. The paper proposes a novel signal based on optimization dynamics rather than data distribution properties.

2. Methodology: Causal Computational Asymmetry (CCA)

The core hypothesis is that training a neural network to predict the effect from the cause converges faster than predicting the cause from the effect.

The CCA Signal

The method trains two Multilayer Perceptrons (MLPs):

Forward Network ( $g_\theta$ ): Predicts $Y$ from $X$ .
Reverse Network ( $h_\phi$ ): Predicts $X$ from $Y$ .

The Causal Computational Asymmetry (CCA) score is defined as the difference in the number of gradient steps required to reach a fixed loss threshold $\tau$ :
$\text{CCA}(X \to Y) = T_{\text{fwd}} - T_{\text{rev}}$

If $T_{\text{fwd}} < T_{\text{rev}}$ (negative score), the direction is $X \to Y$ .
If $T_{\text{rev}} < T_{\text{fwd}}$ (positive score), the direction is $Y \to X$ .

Theoretical Foundation (Additive Noise Model)

The method assumes the true causal mechanism follows an Additive Noise Model (ANM): $Y = f(X) + \epsilon$ , where $f$ is a nonlinear, injective function and $\epsilon$ is independent noise ( $\epsilon \perp X$ ).

The paper proves this asymmetry via three lemmas:

Lemma 1 (Residual Dependence): In the reverse direction ( $Y \to X$ ), the prediction residuals remain statistically correlated with the input $Y$ regardless of network approximation quality. This is because recovering $X$ from $Y$ is ambiguous due to the noise $\epsilon$ being "baked in." In contrast, forward residuals converge to pure independent noise $\epsilon$ .
Lemma 2 (Landscape Complexity): The reverse optimization landscape has a higher irreducible loss floor ( $E[\text{Var}(X|Y)]$ ) and a non-separable gradient noise structure. The noise floor varies spatially with $Y$ , making the optimization problem structurally harder than the forward direction, where the noise floor is constant ( $\sigma^2_\epsilon$ ).
Lemma 3 (Convergence Time): Under the Polyak-Łojasiewicz (PL) condition, a harder landscape with a higher minimum loss and non-separable noise requires strictly more expected gradient steps to reach a convergence threshold.

Theorem 4.4 (CCA Asymmetry): Formally proves that $E[T_{\text{fwd}}] < E[T_{\text{rev}}]$ under the ANM assumptions.

Critical Preprocessing

A mandatory requirement for CCA is z-scoring (standardization) of both $X$ and $Y$ before training. Without this, differences in output scale (variance) can dominate gradient magnitudes, reversing the convergence order and leading to incorrect causal identification.

3. The CCL Framework

The paper embeds CCA into a broader framework called Causal Compression Learning (CCL), which unifies four theoretical traditions:

Causal Information Bottleneck (CIB): Compresses input $X$ into a representation $T$ that maximizes causal mutual information $I_c(Y | \text{do}(T))$ , filtering out confounded correlations.
MDL Graph Regularization: Uses Minimum Description Length to penalize complex graph structures, preventing spurious edges.
Causal Reinforcement Learning (CRL): Optimizes a policy $\pi$ to maximize reward under intervention, ensuring the learned graph supports interventional queries (Rung 2).
CCA Direction Scoring: Uses the convergence time gap to orient edges in the causal graph.

The joint objective function ( $L_{\text{CCL+}}$ ) combines these terms, with CCA specifically influencing the edge orientation during the graph search phase (XGES algorithm).

4. Key Contributions

First Formal Proof of Optimization-Time Causality: Proves that the causal direction always converges in strictly fewer expected gradient steps than the anti-causal direction under the ANM.
Novel Signal: Introduces a signal based on optimization-time space (convergence steps), distinct from data-space (residuals) or complexity-space (description length) signals.
Robustness: Demonstrates that the signal is architecture-robust, holding across different activation functions (Tanh, ReLU), optimizers (Adam, SGD, RMSProp), and network depths.
Boundary Condition Analysis: Theoretically predicts and experimentally validates three specific failure modes:
- Linear Gaussian Mechanisms: Symmetry makes forward/reverse indistinguishable (0% accuracy).
- Non-Injective Functions: (e.g., $Y=X^2$ ) Causes the reverse target to collapse to a constant (e.g., zero), leading to degenerate fast convergence in the wrong direction.
- Scale Contamination: Failure to z-score variables leads to scale-induced inversion of the signal.
Sample Complexity Bounds: Derives a PAC bound for CCL that scales linearly with the number of causal edges ( $d_c(G)$ ) rather than the VC dimension of the hypothesis class.

5. Experimental Results

The paper validates the theory on synthetic and real-world data:

Synthetic Data (Injective DGPs):
- Achieved 30/30 correct identifications on sine ( $Y=\sin(X)$ ) and exponential ( $Y=e^{0.5X}$ ) data generating processes (DGPs) across six different architectures.
- Achieved 26/30 correct on the cubic DGP ( $Y=X^3$ ) with z-scoring. Without z-scoring, accuracy dropped to 6/30, confirming the scale sensitivity.
Boundary Conditions:
- Linear Gaussian ( $Y=2X$ ): 0/30 correct (as predicted by theory).
- Non-Injective ( $Y=X^2$ ): 30/30 incorrect (reverse converges instantly to zero), confirming the theoretical failure mode.
Real-World Benchmark (Tübingen Cause-Effect Pairs):
- Achieved 96% accuracy (AUC 0.96) on 108 real-world pairs.
- Significantly outperformed ANM/RESIT (63%) and the majority-class baseline (72.2%).
- High-confidence predictions (large $|CCA|$) were almost uniformly correct.

6. Significance and Implications

Theoretical Breakthrough: This is the first work to formally link the speed of neural network convergence to causal direction, providing a new mathematical lens for causal discovery.
Practical Utility: The method is computationally efficient (requires only training two small MLPs) and does not require explicit knowledge of the functional form or noise distribution, provided the injectivity and non-linearity assumptions hold.
Foundation for Rung 2: The CCL framework moves AI systems from "seeing" (Rung 1) to "doing" (Rung 2) by enabling the learning of causal graphs that support interventional reasoning and policy optimization.
Future Directions: The paper identifies clear paths for extending CCA to multivariate high-dimensional settings, handling non-injective biological mechanisms, and eventually reaching Rung 3 (Counterfactuals) via twin-network abduction.

In summary, the paper establishes that causality is computationally easier to learn than anti-causality due to the structural independence of noise in the forward direction. This insight provides a robust, theoretically grounded tool for causal discovery that complements existing statistical and complexity-based methods.