An Orthogonal Learner for Individualized Outcomes in Markov Decision Processes

Imagine you are a doctor trying to decide the best treatment plan for a cancer patient. You have a massive notebook of past patient records (observational data), but you never actually tried every possible treatment on every patient. Some patients got high doses, some got low doses, some got them early, some late.

Now, you want to predict: "If we had given this specific patient a different sequence of treatments, how would they have done?"

This is the core problem the paper solves. It's like trying to figure out the outcome of a game you didn't play, just by watching replays of games where the players made different moves.

The Problem: The "Horizon" Trap

In the world of AI and medicine, we call this a Markov Decision Process (MDP). It's a sequence of decisions where today's choice affects tomorrow's state.

The paper points out a huge problem with existing methods: The Curse of the Horizon.

Think of it like a game of "Telephone."

If you want to know what happens after one step, it's easy.
If you want to know what happens after ten steps, you have to guess the outcome of step 1, then use that guess to guess step 2, and so on.
By step 10, your guess is so full of errors that it's basically nonsense.

Existing AI methods try to fix this, but they often rely on "naive" tricks. They take the data they have and just plug it into a formula. The problem is, if their initial guess about how the world works is slightly wrong, that error gets amplified massively over time. It's like trying to build a skyscraper on a foundation that's slightly crooked; the higher you go, the more the building leans until it collapses.

The Solution: The "Orthogonal Learner" (DRQ-learner)

The authors introduce a new method called the DRQ-learner. To understand it, let's use a metaphor.

Imagine you are trying to hit a moving target (the true medical outcome) while standing on a wobbly boat (the noisy, imperfect data).

Old Methods: You try to aim your gun directly at the target. But because the boat is rocking (estimation errors), your aim is off. If the boat rocks a little, your bullet misses by a mile.
The DRQ-learner: This method is like a gyro-stabilized gun. It is designed so that the rocking of the boat (errors in the "nuisance" functions, like guessing the probability of a patient taking a certain drug) doesn't shake the aim of the gun.

The paper calls this Neyman-Orthogonality. In plain English, it means the method is insensitive to small mistakes in the parts of the model that aren't the main focus. Even if your guess about the "background noise" is slightly wrong, your final prediction for the patient's outcome remains accurate.

Why is it "Doubly Robust"?

The name DRQ stands for Doubly Robust Q-learner.

Think of it like a safety net with two layers:

Layer 1: You have a model that predicts how patients behave (e.g., "If a patient has symptom X, they usually get drug Y").
Layer 2: You have a model that predicts the outcome (e.g., "If they get drug Y, they recover").

In the old methods, if either model was wrong, your prediction failed.
In the DRQ-learner, you only need one of them to be right.

If your behavior model is perfect but your outcome model is sloppy? You still get the right answer.
If your outcome model is perfect but your behavior model is sloppy? You still get the right answer.
You only fail if both are wrong. This makes it incredibly reliable for high-stakes decisions like medicine.

The "Quasi-Oracle" Superpower

Finally, the paper claims the method achieves Quasi-Oracle Efficiency.

Imagine an "Oracle" is a magical being who knows the absolute truth about how the world works.

Old methods: Even with the Oracle's help, they might still be slow or inaccurate because of how they process the data.
DRQ-learner: This method performs as well as if you had the Oracle's help, even though you don't. It extracts the maximum possible information from the data you have, leaving no room for improvement without actually knowing the future.

Summary: What did they actually do?

Reframed the problem: They looked at medical decision-making not just as a math problem, but as a Causal Inference problem (figuring out cause-and-effect).
Found the flaw: They proved that current "plug-in" methods (just plugging data into formulas) are inherently biased and unstable over long time periods.
Built a better tool: They created a new algorithm (DRQ-learner) that uses a special mathematical "de-biasing" technique.
Proved it works: They showed mathematically that this tool is stable, accurate, and resistant to errors.
Tested it: They ran simulations (using a "Taxi" driving game and a "Frozen Lake" game) and proved their new tool beats the current state-of-the-art, especially in difficult situations where data is scarce or the time horizon is long.

In a nutshell: The paper gives doctors and AI researchers a new, super-reliable calculator for predicting the long-term effects of treatments, one that doesn't fall apart just because the data is a little messy or the time horizon is long. It's a major step toward safer, personalized medicine.

Here is a detailed technical summary of the paper "An Orthogonal Learner for Individualized Outcomes in Markov Decision Processes" (ICLR 2026).

1. Problem Statement

The paper addresses the challenge of estimating individualized potential outcomes in sequential decision-making, specifically focusing on estimating the Q-function ( $Q^{\pi_e}$ ) in Markov Decision Processes (MDPs) using observational data.

Context: In personalized medicine (e.g., determining optimal dosing sequences for cancer patients), one must predict the outcome of a specific treatment policy ( $\pi_e$ ) using data collected under a different behavioral policy ( $\pi_b$ ).
The Core Challenge:
- Off-Policy Evaluation: The data is generated by $\pi_b$ , but the goal is to evaluate $\pi_e$ .
- Curse of Horizon: In sequential settings, standard methods like Inverse Propensity Weighting (IPW) suffer from exponentially decaying overlap between the behavior and evaluation policies over long time horizons. This leads to high variance and instability (division by near-zero probabilities).
- Plug-in Bias: Existing state-of-the-art methods (e.g., Q-regression, Fitted Q-Evaluation) often rely on "plug-in" estimators. These suffer from plug-in bias, where errors in estimating nuisance functions (like the density ratio or value function) propagate directly and linearly into the final Q-function estimate, leading to inconsistent inference.
Goal: Develop a method to estimate $Q^{\pi_e}$ that breaks the curse of horizon while providing strong theoretical guarantees: Double Robustness, Neyman-Orthogonality, and Quasi-Oracle Efficiency.

2. Methodology: The DRQ-Learner

The authors propose a novel meta-learner called the DRQ-learner (Doubly Robust Q-learner). The method is grounded in causal inference and semiparametric efficiency theory.

A. Theoretical Foundation

The authors reframe Q-function estimation as a causal inference problem. They derive the Efficient Influence Function (EIF) for the standard Mean Squared Error (MSE) loss of the Q-function.

Identifiability: They establish that the causal estimand (potential outcome) is identifiable from observational data via the Bellman equation, provided standard assumptions (positivity, consistency, unconfoundedness) hold.
Debiasing: Instead of using a naive plug-in approach, they construct a Neyman-orthogonal loss function ( $L^3_{\pi_e}$ ). This loss is derived by "de-biasing" the standard MSE loss using the EIF.

B. The Loss Function

The proposed loss function consists of two components involving pseudo-outcomes ( $\phi_1$ and $\phi_2$ ):
$L^3_{\pi_e}(\eta, g) = \mathbb{E}_{O' \sim \pi_b} \left[ \sum_a \pi_e(a|S') (\phi_1 - g(S', a))^2 \right] + \mathbb{E}_{O' \sim \pi_b, s \sim \pi_b(s)} \left[ \sum_a \pi_e(a|s) (\phi_2 - g(s, a))^2 \right]$

Where the pseudo-outcomes are:

$\phi_1$ : Incorporates a standard temporal difference error scaled by an inverse propensity weight (IPW).
$\phi_2$ : Incorporates a temporal difference error scaled by a stationary density ratio ( $w_{e/b}$ ) and the policy ratio.

The nuisance parameters $\eta = (\pi_b, w_{e/b}, Q^{\pi_e})$ are estimated in a first stage (using any arbitrary ML model), and the target Q-function $g$ is estimated in a second stage by minimizing $L^3_{\pi_e}$ .

C. Key Algorithmic Steps (Algorithm 1)

First Stage (Nuisance Estimation): Estimate the behavioral policy $\hat{\pi}_b$ , the stationary density ratio $\hat{w}_{e/b}$ , and an initial Q-function estimate $\hat{Q}^1_{\pi_e}$ (using any existing method like FQE or Q-regression).
Second Stage (DR Adjustment): Minimize the orthogonal loss $L^3_{\pi_e}$ using the estimated nuisances to refine the Q-function estimate $\hat{Q}^{DR}_{\pi_e}$ .

3. Key Contributions

The paper makes three primary contributions:

New Theoretical Framework: It provides the first causal inference framework specifically for Q-function estimation in MDPs. It formally identifies the limitations of existing "plug-in" learners (Q-regression, FQE) as suffering from plug-in bias and instability under model misspecification.
The DRQ-Learner (Method): The authors propose the first meta-learner that simultaneously achieves:
- Double Robustness: Valid inference is maintained if either the nuisance model for the policy/density ratio or the nuisance model for the Q-function is correctly specified.
- Neyman-Orthogonality: The estimator is insensitive to first-order errors in the nuisance functions. Errors in the first stage only affect the final result via second-order terms (products of errors).
- Quasi-Oracle Efficiency: The estimator converges at the same asymptotic rate as if the true nuisance functions were known (i.e., it achieves the semiparametric efficiency bound).
Flexibility: The method is model-agnostic. It can be combined with arbitrary machine learning models (e.g., neural networks) for both the nuisance estimation and the final Q-function estimation, and it works for both discrete and continuous state spaces.

4. Experimental Results

The authors validate their theoretical claims through numerical experiments in the Taxi and Frozen Lake environments (OpenAI Gym).

Baselines: Compared against Q-regression (Liu et al., 2018), Fitted Q-Evaluation (FQE) (Le et al., 2019), and Minimax Q-learning (MQL).
Settings:
- Varying dataset sizes ( $n$ ).
- Varying effective horizons (by changing the discount factor $\gamma$ ).
- Varying overlap (by changing the greediness of the target policy $\pi_e$ relative to $\pi_b$ ).
Findings:
- Superior Performance: The DRQ-learner consistently outperforms plug-in baselines, especially in low-overlap settings where IPW-based methods fail due to high variance.
- Robustness to Horizon: Unlike methods that struggle with long horizons, the DRQ-learner maintains low error as the effective horizon increases, successfully breaking the "curse of horizon."
- Restricted Models: The method performs well even when the Q-function is constrained to a simple linear model class, demonstrating its applicability to interpretable settings.

5. Significance

Reliable Personalized Medicine: The paper addresses a critical gap in applying reinforcement learning to high-stakes domains like healthcare. By providing strong theoretical guarantees (orthogonality and double robustness), the DRQ-learner offers a more reliable tool for estimating individualized treatment effects from observational data.
Bridging Causal Inference and RL: It successfully imports advanced statistical learning concepts (Neyman orthogonality, EIF) into the domain of off-policy reinforcement learning, moving beyond heuristic approaches to principled, theoretically grounded estimation.
Practical Utility: The method allows practitioners to use flexible, modern ML models (like deep neural networks) while mitigating the risks of model misspecification and data sparsity that typically plague off-policy evaluation.

In summary, the DRQ-learner represents a significant advancement in off-policy Q-function estimation, offering a theoretically sound, robust, and efficient solution to the challenges of sequential decision-making under observational constraints.