OrthoFormer: Instrumental Variable Estimation in Transformer Hidden States via Neural Control Functions

Imagine you are trying to teach a robot how to drive a car. You show it thousands of videos of cars driving on sunny days. The robot learns that "when the sky is blue, the car moves forward."

But here's the problem: The robot didn't learn that pressing the gas pedal makes the car move. It learned that blue skies make the car move. Why? Because in all your training videos, the sky was always blue. The robot confused a background feature (the weather) with the actual cause (the gas pedal).

If you then ask this robot to drive on a rainy day, it panics. It thinks, "No blue sky? No driving!" and crashes. This is what happens with current AI models (Transformers). They are brilliant at spotting patterns, but they are terrible at understanding cause and effect. They see correlations (things happening together) and mistake them for laws of nature.

The Solution: OrthoFormer

The paper introduces OrthoFormer, a new type of AI architecture designed to fix this "confusion." It forces the AI to stop looking at the "blue sky" and start looking at the "gas pedal."

Here is how it works, using simple analogies:

1. The Problem: The "Static Background" vs. The "Dynamic Flow"

Think of a person telling a story.

The Static Background: Their accent, their voice pitch, and their personality. These don't change during the story.
The Dynamic Flow: The actual plot of the story. What happens next depends on what happened before.

Standard AI models get lazy. They notice that "People with a British accent tend to tell stories about castles." They learn the accent (static) predicts the castle (dynamic). But if you ask them to tell a story about a spaceship, they fail because they never learned the logic of storytelling, only the style of the speaker.

OrthoFormer is designed to separate the Accent (the noise/confounder) from the Plot (the true cause).

2. The Tool: "Time Travel" as a Detective

In economics, there is a clever trick called an Instrumental Variable (IV). Imagine you want to know if studying causes good grades. But smart kids might just naturally study and get good grades. It's hard to tell what causes what.

The trick? Look at something that happened before the studying, like "Did the kid have a quiet room yesterday?"

Having a quiet room doesn't directly make you smart (it's not the cause of the grade).
But it does make it more likely you will study.
Because the quiet room happened before the studying, it can't be influenced by the studying.

OrthoFormer uses Time Travel as its detective tool. It looks at the AI's memory from two steps ago (the "quiet room") to predict what happens now. By forcing the AI to use this "time-delayed" memory as a clue, it strips away the confusing background noise and isolates the true cause.

3. The Architecture: The "Two-Stage Interrogation"

OrthoFormer doesn't just guess; it runs a strict two-step interrogation process:

Step 1 (The Setup): The AI looks at the "time-travel clue" (the past state) and tries to predict the current state. It calculates the difference (the "residual"). Think of this as the AI saying, "Based on the past, I expected this. But the actual result was that. The difference is the 'noise' or the 'confusion'."
Step 2 (The Truth): The AI then tries to predict the final answer using the "time-travel clue" AND the "noise" it just calculated.

The Critical Rule (The "Gradient Detach"):
Here is the most important part. In Step 1, the AI must forget what it learned in Step 2.

The Analogy: Imagine a detective (Step 1) who writes a report. Then a judge (Step 2) reads the report and gives a verdict.
The Mistake: If the detective can see the judge's verdict while writing the report, the detective will cheat. They will write a report that makes the judge happy, rather than the truth.
OrthoFormer's Fix: The paper calls this the "Neural Forbidden Regression." It physically cuts the connection so the detective (Step 1) cannot change their report to please the judge (Step 2). This ensures the "noise" calculated is real, not faked to lower the error score.

The Big Trade-off: The "Trilemma"

The paper discovers a three-way struggle, like trying to balance a triangle:

Exogeneity (Purity): How "clean" is your time-travel clue? (Going further back in time makes it cleaner).
Relevance (Strength): How strong is the link between the clue and the answer? (Going too far back makes the link weak).
Variance (Stability): How much does the answer jump around?

OrthoFormer teaches us that you can't have it all. You have to pick the perfect "time delay" to get the best balance.

Why Does This Matter?

Robustness: If you train OrthoFormer on sunny days, it will still work on rainy days because it learned the mechanism (gas pedal), not the correlation (blue sky).
Reliability: It prevents the AI from making catastrophic mistakes when the world changes (Out-of-Distribution failure).
Truth: It stops the AI from lying to itself by finding easy shortcuts (spurious correlations) and forces it to learn the hard, true rules of how the world works.

Summary

OrthoFormer is a new AI architecture that acts like a skeptical scientist. Instead of just memorizing patterns, it uses "time-travel clues" to separate the real causes from the background noise. It enforces strict rules to ensure it doesn't cheat, resulting in an AI that can make better decisions even when the world changes in unexpected ways.

1. Problem Statement

The paper addresses a fundamental limitation in standard Transformer architectures: their reliance on correlational learning rather than causal inference.

The Core Issue: Transformers often conflate static background factors (e.g., intrinsic identity, style, or latent semantic themes) with dynamic causal flows (state evolution). When latent confounders ( $U_t$ ) are serially correlated and influence successive hidden states, they induce spurious autocorrelations.
The Consequence: Standard autoregressive models ( $h_t = f(h_{t-1}) + \epsilon_t$ ) suffer from endogeneity. If the error term $\epsilon_t$ contains a latent variable correlated with the input $h_{t-1}$ , Ordinary Least Squares (OLS) estimation becomes inconsistent. This leads to catastrophic failure in Out-of-Distribution (OOD) scenarios and counterfactual reasoning, as the model learns "systems with feature X tend to have pattern Y" rather than the true causal mechanism of state evolution.
The Gap: Existing remedies (data augmentation, regularization) fail to address the lack of inductive bias. Classical Instrumental Variable (IV) methods (like Arellano-Bond GMM) are difficult to integrate into deep learning due to non-differentiability and the challenge of constructing instruments from the model's own representations.

2. Methodology: OrthoFormer

The authors propose OrthoFormer, a Transformer architecture that embeds Instrumental Variable (IV) estimation directly into the model blocks using Neural Control Functions. The framework is built on four theoretical pillars:

A. Architectural Components

Instrumental Attention Mask:
- A modification to the standard causal mask.
- It restricts the query at time $t$ to attend only to keys at positions $\le t-k$ (where $k \ge 2$ ).
- This structurally enforces the lagged hidden state $Z_t = h_{t-k}$ to serve as the instrumental variable.
Neural Control Function Module (Two-Stage Estimation):
- Stage 1: Predicts the endogenous component of the current state using the instrumental context (lagged states).
- Residual Computation: Calculates the residual $R_t$ (the part of the state not explained by the instrument).
- Gradient Detachment: Crucially, the gradient of $R_t$ is detached before being passed to Stage 2. This prevents Stage 2 from influencing Stage 1 parameters, preserving the causal separation required for valid IV estimation.
- Stage 2: Takes the concatenation of the Stage 1 prediction, the detached residual, and the instrument to predict the target.
Loss Function: A weighted sum of the Stage 1 loss (instrument prediction accuracy) and Stage 2 loss (causal prediction accuracy).

B. Theoretical Framework

The method relies on Approximate Instrument Validity. Since $h_{t-k}$ is not perfectly exogenous (due to the persistence of latent confounders $\rho$ ), the bias is not zero but decays geometrically as $O(\rho^k)$ .

Control Function Decomposition: The endogeneity is decomposed by projecting the endogenous variable onto the instrument, isolating the noise component to be controlled for in the second stage.

3. Key Contributions

Architectural Innovation: The first design to embed Neural 2SLS (Two-Stage Least Squares) directly into Transformer blocks via an Instrumental Attention Mask and a Neural Control Function.
Theoretical Proofs:
- Approximate Identification: Proves that the IV estimator converges to the true parameter plus a residual bias of $O(\rho^k)$ , which is strictly less than the OLS bias for any valid lag $k \ge 2$ .
- MSE Decomposition: Derives a four-term Mean Squared Error decomposition, showing that instrument endogeneity (scaling as $\rho^{2k}$ ) is the dominant error source, not variance or neural approximation error.
- Monotonic Bias Reduction: Proves that increasing the lag $k$ monotonically reduces bias, subject to the relevance constraint.
Conceptual Insights:
- The Bias–Variance–Exogeneity Trilemma: Identifies a fundamental tradeoff where increasing the lag improves exogeneity (lower bias) but weakens instrument relevance (higher variance/lower F-statistic).
- Neural Forbidden Regression: A critical finding that removing gradient detachment paradoxically improves prediction loss (by allowing joint optimization) but destroys causal validity. This highlights that lower loss does not imply better causal estimates.

4. Experimental Results

The authors validated the framework on a synthetic AR(1) data-generating process with latent AR(1) confounders, comparing against OLS, DeepIV, CausalTransformer, and others.

Bias Reduction: OrthoFormer consistently achieved lower bias than OLS across all tested lags. The empirical bias decay matched the theoretical $O(\rho^k)$ rate.
Diagnostic Tests: AR(2) tests on second-stage residuals confirmed the absence of serial correlation ( $p > 0.1$ ), validating the approximate exogeneity of the lagged instruments.
OOD Generalization: Under distribution shifts (changing confounder persistence $\rho$ ), OrthoFormer showed significantly higher robustness and lower prediction error compared to OLS baselines.
Ablation Studies:
- Removing the control function caused the largest performance degradation.
- Removing the lag mask (reducing to $k=1$ ) degraded performance due to higher endogeneity.
- Gradient Detachment: Removing the detach() operation improved raw prediction loss but failed the Hausman test for endogeneity, confirming the "Neural Forbidden Regression" phenomenon.

5. Significance and Limitations

Significance:

Paradigm Shift: Moves sequence modeling from purely correlational learning to causally grounded learning.
Robustness: Provides a mechanism for reliable decision-making under distribution shifts and for counterfactual interventions.
Interpretability: Offers a theoretical framework to understand and separate static identity from dynamic evolution in deep models.

Limitations & Future Work:

Approximate Instruments: The bias is irreducible for finite lags if confounder persistence ( $\rho$ ) is high.
Synthetic Data: Current experiments use simple diagonal AR(1) dynamics; real-world data involves dense transition matrices and nonlinear confounding.
Parameter Recovery: The model learns nonlinear representations; extracting interpretable structural coefficients remains a challenge.
Scalability: Scaling to production-sized Transformers with high-dimensional cross-correlated confounders requires further research.

In conclusion, OrthoFormer represents a significant step toward causal deep learning, demonstrating that architectural constraints (like gradient detachment and lag masking) are essential for recovering invariant causal mechanisms, even at the cost of in-distribution predictive efficiency.