Machine Learning for Stress Testing: Uncertainty Decomposition in Causal Panel Prediction

Imagine you are a bank manager. Every year, regulators (like the Federal Reserve) ask you a scary question: "If the economy crashes and unemployment skyrockets, how much money will we lose on loans?"

This is called Stress Testing.

Currently, banks try to answer this by looking at past data and drawing a straight line into the future. They say, "Unemployment went up 1% last time, so losses went up 5%. If it goes up 5% this time, losses will go up 25%."

The Problem: This approach has a hidden flaw. It assumes that only unemployment causes the losses. But in reality, when unemployment rises, it's usually because of a mix of bad things happening at once: the stock market is crashing, people are scared, and interest rates are changing. These are "confounders"—hidden factors that pull both unemployment and loan losses up together.

If you ignore these hidden factors, your prediction is just a guess with a false sense of confidence.

This paper proposes a new, smarter way to do stress testing. Think of it as upgrading from a crystal ball (which just guesses) to a safety harness with three layers of protection.

Here is how their new framework works, explained through simple analogies:

1. The "What We Know" vs. "What We Assume" (Causal Set Identification)

Imagine you are trying to predict how fast a car will go if you press the gas pedal harder.

Old Way: You look at past data where the driver pressed the pedal harder, and the car went faster. You assume the driver only pressed the pedal because they wanted to go faster.
The Reality: Maybe the driver pressed the pedal harder because they were being chased by a bear (a hidden factor). If you don't account for the bear, your prediction is wrong.

The New Solution: Instead of pretending the bear doesn't exist, the authors say, "Okay, let's admit we don't know exactly how strong the bear is."

They calculate a range of possibilities (a "set") rather than a single number.
They give you a "Breakdown Value." This is like a warning label: "Our conclusion is safe, UNLESS the hidden 'bear' is stronger than X." This tells regulators exactly how much hidden chaos they can tolerate before the prediction breaks.

2. The "Domino Effect" (Horizon-Dependent Error)

Stress testing isn't just about next month; it's about the next year.

The Analogy: Imagine you are trying to predict the weather for next week. If you make a tiny mistake predicting tomorrow's weather, that mistake gets bigger the next day, and even bigger the day after. By day 10, your prediction is useless. This is called error compounding.
The New Solution: The authors built a mathematical "speed limit" for how far ahead you can look.
- If the economy is stable (the "dominoes" fall slowly), you can predict far into the future.
- If the economy is volatile (the "dominoes" are falling fast), they tell you: "Stop here."
- They provide a specific number (the "Amplification Factor") that tells you exactly how many months into the future your prediction is still reliable. If you try to go further, the math says, "Switch to a different method."

3. The "Stress Test for the Prediction" (Conformal Calibration)

Even if your math is perfect, what if you are predicting a scenario that has never happened before? (Like a pandemic).

The Analogy: Imagine you have a weather model trained on 100 years of data. You ask it to predict a hurricane that is 10 times stronger than any in history. The model might give you an answer, but it's guessing wildly because it's never seen anything like that.
The New Solution: They use a technique called Conformal Calibration.
- It acts like a "confidence meter."
- If the stress scenario is similar to the past, the meter says, "High Confidence."
- If the scenario is extreme (like a Black Swan event), the meter says, "I'm guessing. I'm not sure."
- Crucially, it has an "Abstention Mechanism." If the scenario is too weird, the system refuses to give a specific number and instead says, "This is too risky to predict; here is a wide safety net instead."

The Final Output: The "Three-Layer Cake"

Instead of giving you one scary number (e.g., "We will lose $1 billion"), this framework gives you a Three-Layer Cake of Uncertainty:

The Filling (Estimation Uncertainty): "We are pretty sure about the data we have." (This is the standard error from having limited data).
The Frosting (Confounding Uncertainty): "We are less sure because of hidden factors like the 'bear'." (This is the range caused by things we can't see).
The Wrapper (Extrapolation Risk): "We are very unsure because this scenario is new." (This is the warning if you are predicting something extreme).

Why Does This Matter?

Currently, banks often give regulators a single number and hope for the best. If they are wrong, it's a disaster.

This new framework is honest. It admits what it doesn't know. It separates the "guessing" from the "calculating."

For Regulators: It gives a clear language to say, "Your model is safe unless the hidden risks are this big."
For Banks: It prevents them from making dangerous bets based on false confidence.

In short, this paper turns stress testing from a magic trick (pulling a number out of a hat) into a transparent engineering process where every layer of risk is measured, labeled, and understood.

1. Problem Statement

Regulatory stress testing (e.g., CCAR/DFAST) requires projecting credit losses under hypothetical macroeconomic scenarios (e.g., a sharp rise in unemployment). Current industry practice treats this as a prediction problem: fitting models on historical data and extrapolating them under stress.

The Core Gap: This approach conflates observational correlation with causal effects. Unemployment is endogenous; unobserved factors (financial conditions, sentiment) drive both unemployment and credit losses.

The Dilemma: Standard causal panel methods (e.g., Synthetic Controls) require a control group of unaffected units, which does not exist in macro stress testing (all banks face the same macro path). Conversely, assuming conditional exogeneity (that unemployment is exogenous to credit losses) is unrealistic.
The Challenge: How to provide reliable causal estimates and uncertainty quantification for continuous macro paths without a control group, while explicitly accounting for endogeneity and the compounding of errors over time.

2. Methodology

The authors propose a framework for Policy-Path Counterfactual Inference that decomposes uncertainty into three distinct layers. The framework consists of four main components:

A. Observational Identification via Iterated Regression

Instead of assuming exogeneity, the framework identifies the observational path contrast ( $\tau^{obs}_h$ ) using pre-period data.

Mechanism: It assumes state sufficiency (a state vector summarizes history) and conditional stationarity (the relationship between state and outcome is time-invariant).
Process: It learns a one-step transition kernel $m(\iota, a)$ (where $\iota$ is the state and $a$ is the macro variable) and iterates it forward to predict outcomes under a hypothetical path. This is a conditional prediction problem requiring no causal assumptions.

B. Causal Set Identification (Partial Identification)

To address endogeneity, the framework abandons the search for a single point estimate of the causal effect ( $\tau^{do}_h$ ) and instead identifies a sharp set (interval).

Bounded Confounding: It introduces a sensitivity parameter $c_h$ that bounds the gap between the observational mean and the causal mean: $|\mu^{do}_h - \mu^{obs}_h| \leq c_h$ .
Output: The causal effect lies within the interval $[\tau^{obs}_h - 2c_h, \tau^{obs}_h + 2c_h]$ .
Breakdown Value: It calculates a "breakdown value" ( $c^*_h$ ), representing the minimum confounding strength required to make the causal effect zero. This communicates robustness in a single interpretable number.

C. Non-Asymptotic Error Bounds (Oracle Inequality)

The framework analyzes the error of recursive rollout (iterating the model $H$ steps forward).

Amplification Factor ( $\rho$ ): The error depends on the Lipschitz constants of the system dynamics. The authors define $\rho = L_u(1 + L_m)$ $ρ = L_{u} (1 + L_{m})$ .
- If $\rho < 1$ (contracting): Errors remain bounded.
- If $\rho > 1$ (expanding): Errors grow exponentially.
Horizon Dependence: The paper proves an oracle inequality showing that the recursive rollout error is bounded by $\epsilon_n \Gamma_h$ , where $\Gamma_h$ is a horizon-dependent amplification factor.
Diagnostic: If $\rho > 1$ , the framework recommends switching to a direct multi-horizon estimator beyond a specific crossover horizon $h^*$ , as recursive extrapolation becomes unreliable.

D. Importance-Weighted Conformal Calibration

To provide formal coverage guarantees for stress paths that extrapolate beyond historical data:

Weighted Conformal Prediction: The method uses importance weights based on the likelihood ratio between the stress path and historical paths to correct for distribution shift.
Diagnostics: It introduces metrics for extrapolation cost ( $R_{weight}$ ) and effective sample size ( $B_{eff}$ ). If the stress path is too extreme (low $B_{eff}$ ), the system triggers an abstention mechanism, flagging the output as a simulation rather than a calibrated prediction.

3. Key Contributions

Causal Set Identification for Continuous Paths: Adapts partial identification to continuous macro paths in panels, providing sharp identified sets without control groups.
Horizon-Dependent Error Theory: Proves an oracle inequality for recursive rollout, quantifying exactly how far ahead predictions can be trusted based on the system's amplification rate ( $\rho$ ).
Weighted Conformal Calibration: Extends conformal prediction to time-series stress testing with diagnostics for extrapolation risk.
Three-Layer Uncertainty Decomposition: The final output separates uncertainty into:
- Estimation Uncertainty: From finite data (calibration bands).
- Confounding Uncertainty: From endogeneity (identified set width).
- Extrapolation Risk: From distribution shift (diagnostics).

4. Experimental Results

The framework was validated through two layers of experiments:

Layer 1 (Fully Synthetic):
- Oracle Inequality: Confirmed that recursive error follows the theoretical bound $\epsilon_n \Gamma_h$ across contracting, near-critical, and expanding regimes.
- Bias Analysis: Demonstrated that nonlinear dynamics introduce a "Jensen gap" (mean-state bias) that grows with the horizon, validating the need for direct estimators in nonlinear systems.
- Set Identification: Verified that the true causal effect always falls within the identified set $[\tau^{obs} \pm 2c_h]$ across varying confounding strengths.
- Calibration: Showed that weighted conformal bands maintain coverage, with $R_{weight}$ correctly increasing as stress scenarios become more extreme.
Layer 2 (Semi-Synthetic with Real Data):
- Used real FRED unemployment data (2000–2023) combined with synthetic loan-level outcomes.
- COVID Retrospective: When testing on the 2020 unemployment spike, the framework correctly detected high prediction error ( $\epsilon_n \approx 0.8$ ) and triggered the abstention mechanism, signaling that calibrated guarantees did not apply to this "Black Swan" event.
- Robustness: Calibration against historical tail events (2008 GFC, 2020 COVID) produced conservative but reliable bands, a desirable property for regulatory stress testing.

5. Significance and Impact

Regulatory Relevance: The framework replaces the industry standard of "point estimate + ad hoc sensitivity check" with a principled, mathematically grounded decomposition of uncertainty. It provides a concrete language for communicating model risk to regulators (e.g., "The conclusion holds unless unobserved confounding exceeds $X$ ").
Methodological Advancement: It bridges the gap between causal inference (which usually requires control groups) and machine learning stress testing (which relies on extrapolation).
Transparency: By explicitly separating estimation error from confounding bias and extrapolation risk, the framework prevents the "false precision" common in current stress testing models. It honestly reports what can and cannot be concluded about causal effects under extreme scenarios.

In summary, this paper provides a rigorous mathematical foundation for stress testing that acknowledges the limitations of data and the reality of endogeneity, offering a transparent, diagnostic-driven alternative to current black-box predictive practices.