Learning in Markov Decision Processes with Exogenous Dynamics

The Big Idea: The "Uncontrollable Weather" Problem

Imagine you are playing a video game where you control a character, but the weather changes randomly every second. Sometimes it rains, sometimes it snows, and sometimes the sun shines.

In standard Reinforcement Learning (RL), the computer tries to learn how to win by treating everything as if it depends on your buttons. It thinks, "If I press 'Jump' while it's raining, I get a reward. If I press 'Jump' while it's sunny, I get a different reward." It tries to memorize the outcome of every single combination of Action + Weather.

The problem? The weather is exogenous. It doesn't care what you do. The rain isn't caused by your jump; it just happens. By trying to learn the relationship between your jump and the rain, the computer wastes massive amounts of time and data. It's like trying to learn how to drive a car by also trying to learn how to control the traffic lights, even though you have no remote control for them.

The Solution: The "PCMDP" (Partially Controllable MDP)

The authors of this paper propose a new way to teach computers: Split the world into two parts.

The Endogenous Part (The Car): This is what you can control. Your steering wheel, your gas pedal, your inventory, your budget. You know how these work. If you turn the wheel left, the car goes left.
The Exogenous Part (The Weather): This is what you cannot control. Stock prices, traffic jams, rain, or the next card in a deck. These things happen on their own, regardless of what you do.

The paper introduces a framework called PCMDP (Partially Controllable Markov Decision Process). It tells the AI: "Hey, stop trying to figure out how your actions change the weather. You already know how the weather changes (it's random). Just focus on learning how to drive the car."

The Two New Algorithms: The Smart Drivers

The authors created two new "drivers" (algorithms) that use this split-brain approach:

1. EXAVI (The Planner)

The Metaphor: Imagine a chess player who knows the rules of the game perfectly (how pieces move) but doesn't know the opponent's strategy.
How it works: This algorithm assumes it already knows how the "car" moves (the controllable part). It only needs to learn the "weather" (the uncontrollable part).
The Magic: Because it doesn't need to guess how its actions change the weather, it doesn't need to be "optimistic" or guess wildly to explore. It just watches the weather happen, learns the pattern, and plans its moves.
Result: It learns much faster because it ignores the noise.

2. EXAQ (The Learner)

The Metaphor: Imagine a student taking a test. In a normal test, if you get a question wrong, you only learn about that specific question.
How it works: In this new method, if the student sees a "rainy day" (an exogenous event), they don't just learn what to do in that specific rainy situation. They instantly update their knowledge for every possible driving scenario that could happen on a rainy day.
The Magic: It's like "Counterfactual Reasoning." The AI thinks: "I saw it rain today. Even though I didn't drive on a hill today, I now know how to drive on a hill in the rain because I know how rain affects the car."
Result: One single experience teaches the AI about thousands of situations.

Why Does This Matter? (The Sample Efficiency)

In the world of AI, "Sample Efficiency" means: How many tries does it take to get good?

Standard AI (The Old Way): To learn a complex task, it might need 10,000 tries. It wastes 9,000 of them trying to figure out how the weather works.
New AI (EXAVI/EXAQ): Because it ignores the weather and focuses only on the driving, it might only need 100 tries.

The paper proves mathematically that this isn't just a lucky guess; it's the best possible way to learn when some things are out of your control. They even showed that you can't do any better than this.

Real-World Examples

The authors tested this on real-world scenarios:

The Taxi Driver:
- Scenario: A taxi needs to pick up passengers in a city.
- The Twist: There are random traffic jams (exogenous) that block roads.
- Result: The new algorithm learned the best routes in a few minutes. The old algorithm took hours and got stuck in traffic patterns it couldn't control.
The Stock Trader:
- Scenario: A trader needs to sell a huge amount of stock without crashing the price.
- The Twist: The stock price moves randomly based on the global market (exogenous). The trader only controls how much they sell (endogenous).
- Result: The new algorithm learned to sell smoothly and profitably very quickly. The old algorithm panicked and sold everything too fast, losing money.

The Takeaway

This paper is like giving a computer a pair of glasses that filter out the noise.

Instead of screaming at the universe to change the weather, the computer learns to drive the car perfectly despite the weather. By acknowledging that some things are out of our control, we can learn to control the things that matter much, much faster.

In short: Don't try to learn how to control the uncontrollable. Learn to adapt to it, and you'll win the game in record time.

1. Problem Definition

The paper addresses a fundamental limitation in standard Reinforcement Learning (RL) applied to real-world systems: the assumption that an agent's actions influence the entire state space. In many practical domains (e.g., algorithmic trading, reservoir management, energy systems), the state consists of two distinct components:

Endogenous (Controllable) Variables: State components directly influenced by the agent's actions (e.g., inventory levels, robot position).
Exogenous (Uncontrollable) Variables: State components evolving independently of the agent's actions due to external factors (e.g., market prices, weather, traffic congestion).

The Challenge: Standard Markov Decision Process (MDP) formulations treat the state as a monolithic block. This forces agents to learn transition dynamics for the entire state space, including the exogenous parts. This leads to:

Sample Inefficiency: The agent wastes samples trying to "explore" correlations between actions and exogenous variables that do not exist.
High Variance: Rewards are often noisy due to exogenous fluctuations, making it difficult to attribute credit to specific actions.
Curse of Dimensionality: The state space complexity scales multiplicatively ( $S = S_{endogenous} \times S_{exogenous}$ ), making tabular methods intractable for large exogenous spaces.

2. Methodology: The PCMDP Framework

The authors propose a new framework called the Partially Controllable Markov Decision Process (PCMDP).

Formal Definition:
A PCMDP decomposes the state space $S$ into $S = S^\diamond \times S^\bullet$ , where:

$S^\diamond$ : The endogenous (controllable) state space.
$S^\bullet$ : The exogenous (uncontrollable) state space.

The transition function $p_h(s_{h+1} | s_h, a_h)$ is factorized as:
$p_h(s_{h+1} | s_h, a_h) = p^\bullet_h(s^\bullet_{h+1} | s^\bullet_h, s^\diamond_h, a_h) \cdot p^\diamond_h(s^\diamond_{h+1} | s^\diamond_h)$
Crucially, the paper assumes the agent has full knowledge of the endogenous dynamics $p^\diamond_h$ (which are often deterministic or simple), while the exogenous dynamics $p^\bullet_h$ are unknown and must be learned.

Key Insight: Since the evolution of $S^\bullet$ is independent of the agent's actions, "active exploration" (trying different actions to see how $S^\bullet$ changes) is unnecessary. The agent only needs to observe the natural unfolding of the exogenous process.

3. Proposed Algorithms

The paper introduces two algorithms tailored to the PCMDP framework, providing theoretical regret bounds that depend only on the size of the exogenous state space ( $S^\bullet$ ), not the full state space ( $S$ ).

A. Model-Based: Exogenous-Aware Value Iteration (EXAVI)

Approach: An extension of UCBVI (Upper Confidence Bound Value Iteration).
Mechanism: Instead of estimating the full transition kernel $p(s'|s,a)$ , EXAVI only estimates the exogenous transition kernel $p^\bullet(s'^\bullet | s^\bullet)$ . The endogenous part $p^\diamond$ is treated as known.
Innovation: Unlike standard UCBVI, EXAVI does not require optimism bonuses (exploration bonuses). Because the exogenous dynamics are independent of actions, the agent learns them simply by observing the environment. The uncertainty is purely epistemic regarding the exogenous process, not strategic.
Regret Bound: $\tilde{O}(H^2 \sqrt{S^\bullet K} + H^3 (S^\bullet)^2)$ $\tilde{O} (H^{2} S^{∙} K + H^{3} (S^{∙})^{2})$ .
- Comparison: Standard UCBVI scales with $\sqrt{S^\bullet S^\diamond A K}$ . EXAVI removes the dependency on $S^\diamond$ and $A$ from the leading terms.

B. Model-Free: Exogenous-Aware Q-Learning (EXAQ)

Approach: An extension of standard Q-Learning.
Mechanism: EXAQ leverages the known endogenous dynamics to perform counterfactual updates. When the agent observes a transition $(s^\bullet_h \to s^\bullet_{h+1})$ , it updates the Q-values for all possible endogenous states $s^\diamond$ paired with that specific exogenous transition simultaneously.
Learning Rate: Uses a specific learning rate schedule $\alpha_t = \frac{H+1}{H+t}$ based on the visitation count of the exogenous state $s^\bullet$ , rather than the full state-action pair.
Regret Bound: $\tilde{O}(\sqrt{H^5 S^\bullet K})$ $\tilde{O} (H^{5} S^{∙} K)$ .
- Comparison: Standard Q-Learning scales with $\sqrt{H^5 S^\bullet S^\diamond A K}$ .

C. Theoretical Lower Bound

The authors prove a matching lower bound of $\Omega(\sqrt{S^\bullet K})$ , demonstrating that the dependence on the exogenous state size is information-theoretically optimal. No algorithm can achieve better regret scaling in this setting.

4. Experimental Results

The authors validated their approach on three environments: a modified Taxi environment, an Optimal Execution (trading) task, and an Elevator Dispatching task.

Taxi with Traffic:
- Setup: A grid world where traffic congestion (exogenous) blocks paths.
- Result: EXAVI converged to the optimal policy within the first few episodes, while UCBVI required thousands. EXAQ similarly outperformed standard Q-Learning by orders of magnitude in sample efficiency.
Optimal Execution (Trading):
- Setup: Liquidating a portfolio where asset prices are exogenous.
- Result: EXAQ achieved early convergence ( $10^1 - 10^2$ episodes) compared to standard Q-Learning and PPO (Proximal Policy Optimization). PPO showed high variance and instability, while EXAQ learned a smooth, theoretically sound liquidation curve.
Elevator Dispatching:
- Setup: Managing elevator queues with stochastic passenger arrivals (exogenous).
- Result: EXAVI and EXAQ solved the task almost instantly, whereas baselines struggled to escape local optima or required significantly more training time.

Key Finding: In all cases, the proposed methods demonstrated orders-of-magnitude improvements in sample efficiency compared to standard RL baselines.

5. Significance and Contributions

Theoretical Advancement: The paper establishes that explicitly modeling partial controllability fundamentally changes the learning complexity. It proves that regret can be bounded by the size of the uncontrollable state space rather than the total state space.
Algorithmic Innovation: The introduction of "counterfactual updates" in model-free learning (EXAQ) allows agents to learn from a single observation of an exogenous event to update values for all controllable configurations, effectively bypassing the need to visit every state-action pair.
Practical Impact: The framework is directly applicable to high-stakes real-world problems like finance, energy, and logistics, where exogenous noise is dominant. It suggests that for these domains, standard RL methods are suboptimal and that "structure-aware" algorithms are necessary for deployment.
Elimination of Exploration Bonuses: The work shows that in PCMDPs, the standard exploration-exploitation trade-off regarding exogenous variables is non-existent; the agent simply needs to observe, removing the need for complex optimism-based exploration mechanisms.

Conclusion

"Learning in Markov Decision Processes with Exogenous Dynamics" provides a rigorous framework for handling environments where agents cannot control all state variables. By separating controllable and uncontrollable dynamics, the authors derive algorithms (EXAVI and EXAQ) that achieve optimal sample efficiency, theoretically proving that learning complexity should scale with the exogenous state space, not the total state space. This offers a promising direction for deploying RL in complex, real-world stochastic systems.