Robust Counterfactual Inference in Markov Decision Processes

Imagine you are a doctor who just treated a patient. The patient took a specific medicine, and unfortunately, they didn't get better. You want to know: "What would have happened if I had given them a different medicine instead?"

This is called counterfactual inference—thinking about "what if" scenarios. In the world of Artificial Intelligence (AI), specifically in systems that make decisions over time (like self-driving cars or medical treatment plans), we call these systems Markov Decision Processes (MDPs).

The problem with current AI methods is that they try to answer "what if" by guessing a single, specific story about how the world works. But the real world is messy. There are many different stories (causal models) that could explain why the patient got sick, and each story leads to a different "what if" answer. Relying on just one guess is risky, especially in life-or-death situations.

This paper introduces a new, smarter way to handle these "what if" questions. Here is the breakdown using simple analogies:

1. The Problem: The "One-Story" Trap

Imagine you are a detective trying to solve a crime. You have a witness who saw a car crash.

Old Method: The detective picks one theory (e.g., "The driver was texting") and builds the entire case on it. If that theory is wrong, the whole case collapses.
The Reality: There are many theories that fit the evidence (texting, speeding, a tire blowout). Each theory suggests a different outcome if the driver had acted differently.
The Risk: In safety-critical fields (like healthcare or aviation), guessing the wrong theory could lead to dangerous advice.

2. The Solution: The "Fog of War" Map

Instead of picking one story, the authors propose looking at all possible stories that fit the evidence at the same time.

Think of it like navigating through a thick fog.

Old Method: You draw a single, thin line on a map saying, "The road goes exactly here." If you step off that line, you might fall off a cliff.
New Method: You draw a wide, shaded corridor (an interval) on the map. You say, "The road is somewhere in this wide area." You don't know the exact path, but you know the boundaries.

This "corridor" is called an Interval Counterfactual MDP. It doesn't give you a single number for the future; it gives you a range (a best-case and a worst-case scenario) that covers every plausible reality.

3. The Secret Sauce: The "Magic Formula"

Usually, calculating the boundaries of all these possible worlds is like trying to solve a puzzle with a billion pieces. It takes supercomputers ages to figure it out.

The authors discovered a mathematical shortcut (a closed-form expression).

Analogy: Imagine trying to find the highest and lowest points in a mountain range.
- Old way: You hire a team to climb every single hill and valley to measure them. (Takes forever).
- New way: You realize the mountains follow a specific pattern. You use a simple formula to instantly calculate the highest peak and the deepest valley without climbing a single one.

This makes their method 4 to 251 times faster than previous methods, allowing it to be used on large, complex systems like hospital networks or aircraft control.

4. The Result: The "Cautious Captain"

Once they have this "foggy corridor" map, they use a strategy called Pessimistic Value Iteration.

The Metaphor: Imagine a ship captain navigating a stormy sea where the map is blurry.
- A reckless captain might sail straight for the treasure, hoping the map is right.
- A robust captain (the one this paper creates) assumes the worst-case scenario: "What if the fog is thickest right here? What if the rocks are closer than they look?"
- The captain charts a course that guarantees safety and success even in the worst possible version of the fog.

Why Does This Matter?

Safety: In healthcare, if an AI suggests a treatment change, we want to be sure it won't hurt the patient even if our understanding of the disease is slightly off. This method guarantees that.
Speed: Because it's so fast, it can be used in real-time systems, not just in slow research labs.
Trust: It admits what it doesn't know. Instead of pretending to have a single "truth," it gives a honest range of possibilities, which is much more useful for human decision-makers.

Summary

The paper teaches AI to stop guessing a single "what if" story and instead calculate the entire range of possible "what if" stories. It does this using a clever math trick that is incredibly fast, allowing the AI to make decisions that are safe and reliable, even when the future is uncertain. It's the difference between betting on a single horse and buying tickets for every horse in the race that has a chance of winning.

1. Problem Statement

The paper addresses a critical limitation in Counterfactual Inference (CF) for Markov Decision Processes (MDPs).

The Core Issue: Existing methods for generating counterfactual explanations (e.g., "What would have happened if the agent took a different action?") typically assume a specific Structural Causal Model (SCM), such as the Gumbel-max SCM. However, given observational data and an MDP's transition probabilities, the underlying causal model is generally non-identifiable. Many different SCMs are consistent with the observed data, yet they yield vastly different counterfactual probabilities.
The Risk: Relying on a single assumed causal model (like Gumbel-max) can lead to inaccurate or misleading counterfactuals, which is particularly dangerous in safety-critical domains (e.g., healthcare, autonomous systems).
The Gap: While "partial counterfactual inference" methods exist to compute bounds over all compatible models, they often rely on solving optimization problems where the number of constraints grows exponentially with the state and action space size, making them computationally intractable for large MDPs.

2. Methodology

The authors propose a non-parametric approach that computes tight, exact analytical bounds on counterfactual transition probabilities without solving large-scale optimization problems.

A. Canonical SCM Framework

The method utilizes the Canonical SCM framework (Zhang et al.), which represents all possible causal mechanisms consistent with the data. The goal is to find the minimum and maximum counterfactual probabilities for a transition $s, a \to s'$ given an observed transition $s_t, a_t \to s_{t+1}$ .

B. Incorporating Reasonable Assumptions

To avoid trivial bounds (e.g., $[0, 1]$ ) and ensure counterfactuals are plausible, the authors incorporate two key assumptions into the optimization constraints:

Counterfactual Stability: If an outcome changes under a counterfactual intervention, the relative likelihood of the new outcome must have increased compared to the observed outcome.
Counterfactual Monotonicity:
- If an outcome was observed, its probability in the counterfactual world cannot decrease below its nominal probability.
- If an outcome was possible but not observed, its probability in the counterfactual world cannot increase above its nominal probability.

C. Derivation of Analytical Bounds

The paper's primary theoretical breakthrough is proving that, in the MDP setting (Markovian, no unobserved confounders), the complex linear optimization problem reduces to exact closed-form expressions.

Instead of solving a linear program, the bounds are calculated directly based on the relationship between the observed transition and the counterfactual transition:
- Case 1: The counterfactual state-action pair is the same as the observed one.
- Case 2: The counterfactual pair has disjoint support (no overlapping next states) with the observed pair.
- Case 3: The counterfactual pair has overlapping support with the observed pair.
These closed-form formulas allow for the computation of bounds in $O(1)$ time per transition, independent of the size of the exogenous variable space.

D. Interval Counterfactual MDP (ICFMDP) & Robust Policies

Using these analytical bounds, the authors construct an Interval Counterfactual MDP (ICFMDP), where transition probabilities are intervals $[P_{LB}, P_{UB}]$ .

They solve this ICFMDP using Pessimistic Value Iteration to derive a Robust Counterfactual Policy.
This policy optimizes the worst-case reward across all valid causal models compatible with the data, ensuring robustness against causal model uncertainty.

3. Key Contributions

Analytical Bounds: Proved that partial counterfactual inference in MDPs reduces to exact closed-form solutions, eliminating the need for exponential optimization.
Robust Policy Synthesis: Introduced a framework to derive policies that are robust to uncertainty in the true causal model by optimizing for the worst-case scenario within the derived bounds.
Assumption Flexibility: Demonstrated that incorporating Counterfactual Stability and Monotonicity significantly tightens bounds without excluding valid causal models, and showed how to remove these assumptions if they do not hold in a specific domain.
Efficiency: Achieved a 4x to 251x speedup compared to the standard Gumbel-max SCM approach, making the method scalable to large MDPs.

4. Experimental Results

The authors evaluated their method on four environments: GridWorld (navigation), Frozen Lake, Sepsis (clinical decision-making), and Aircraft (safety-critical control).

Off-Policy Evaluation (OPE): The proposed method correctly bounded the true expected return of a target policy, whereas the Gumbel-max approach provided a single point estimate that, while often close, lacked guaranteed bounds.
Robustness: In worst-case scenarios (sampling the most adversarial causal model within the bounds), the Interval CFMDP policy consistently outperformed the Gumbel-max policy.
- In highly stochastic environments (e.g., GridWorld $p=0.4$ , Sepsis), the Gumbel-max policy often yielded catastrophic rewards in the worst case, while the robust policy maintained significantly higher minimum rewards.
Speed: The analytical approach generated counterfactual models in milliseconds, whereas the Gumbel-max approach (requiring sampling) took seconds to minutes.
Assumption Impact: Relaxing the stability/monotonicity assumptions widened the bounds and slightly reduced worst-case performance, but the robust policy still outperformed the Gumbel-max baseline.

5. Significance

Safety-Critical Applications: This work is crucial for domains like healthcare and autonomous driving where "what-if" analysis must be reliable. It prevents agents from making decisions based on a single, potentially incorrect causal assumption.
Scalability: By converting an intractable optimization problem into a closed-form solution, this method enables counterfactual reasoning in large-scale MDPs where previous partial inference methods were computationally impossible.
Theoretical Advancement: It bridges the gap between partial identification theory (bounds) and practical reinforcement learning, providing a rigorous mathematical foundation for robust counterfactual planning.

In summary, the paper provides a computationally efficient, mathematically rigorous framework for generating counterfactual explanations and policies that are robust to the inherent uncertainty of causal structures in sequential decision-making.