Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

Imagine you are playing a high-stakes video game, like a racing simulator or a strategy game. In a normal game, when you press a button to turn left, you immediately see your car turn. You learn instantly: "Turning left worked!" or "Turning left made me crash."

But in this paper, the authors are tackling a much harder version of the game: The "Delayed Vision" Game.

The Problem: The Foggy Windshield

In the real world (like self-driving cars or robotics), sensors often take time to process data. You might press the brake, but the car's computer doesn't "see" the result of that brake for a few seconds.

In the paper's terms, this is called Delayed State Observation.

The Agent: The AI trying to learn.
The Delay: A "foggy windshield" or a "laggy internet connection."
The Consequence: The AI has to make a whole sequence of moves blindly before it gets any feedback. It's like playing chess where you have to plan your next 10 moves without seeing if your opponent captured your pawn.

If the delay is long, the number of possible move combinations explodes. It becomes a nightmare to figure out which move was actually the good one.

The Solution: The "Backpack" Strategy

The authors propose a clever way to solve this. Instead of trying to guess the future or wait for the fog to clear, they tell the AI to carry a mental backpack.

Augmentation (The Backpack):
Normally, an AI only looks at the current state (e.g., "I am at the red light").
With delays, the AI needs to remember: "I am at the red light, AND I just pressed 'Go', AND I pressed 'Go' again 2 seconds ago, AND I pressed 'Go' 3 seconds ago."

The authors create a new, bigger version of the game world (an "Augmented MDP") where the "state" includes this entire history of actions in the backpack. Now, even though the AI can't see the result yet, it knows exactly what it has done so far.
The "Optimistic Explorer" (UCB):
Once the AI has this backpack, the authors use a standard, proven strategy called Upper Confidence Bound (UCB).
- Analogy: Imagine you are in a dark forest with many paths. You don't know which path leads to treasure. The "Optimistic Explorer" assumes that every path you haven't tried yet might be the best one. It tries the unknown paths to learn about them, but it also sticks to the paths it knows are good.
- The paper's algorithm is "optimistic" about the delayed feedback. It says, "I haven't seen the result of my last 5 moves yet, but I'm going to assume they might lead to a great reward, so I'll keep exploring to find out."

The Big Win: Why This Matters

Previous methods for this problem were like trying to count every single grain of sand on a beach to find a specific one. They were slow and inefficient, especially when the delay was long.

The authors' method is like having a metal detector.

The Result: They proved mathematically that their method is the best possible (Minimax Optimal).
The Analogy: If the delay is $D$ $D$ steps long, previous methods got slower as the delay grew cubically (like $D^3$ $D^{3}$ ). This new method only gets slower as the square root of the delay ( $\sqrt{D}$ $D$ ).
- Think of it this way: If the delay doubles, old methods might take 8 times longer to learn. The new method only takes about 1.4 times longer. That is a massive speedup.

The "Secret Sauce": Partial Knowledge

The paper also introduces a general framework called "Partially Known Dynamics."

The Metaphor: Imagine you are learning to drive a car where you know exactly how the steering wheel works (the "known" part), but you don't know how the road surface changes (the "unknown" part).
The authors realized that in the "Delayed" game, the "backpack" part of the state (the history of actions) is fully known. Only the result of those actions is unknown.
By separating the "known" history from the "unknown" result, they could ignore the massive complexity of the history and focus only on learning the road. This is why their algorithm is so efficient.

Summary

This paper solves the problem of "learning while blindfolded."

The Problem: AI gets delayed feedback, making it hard to learn.
The Fix: Give the AI a "backpack" to remember its recent actions, turning a blindfolded game into a game with full memory.
The Result: They created the fastest possible algorithm for this scenario, proving it's impossible to do significantly better.

It's a blueprint for making robots and self-driving cars smarter in the real world, where sensors are never perfectly instant.

Here is a detailed technical summary of the paper "Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning" by Harin Lee and Kevin Jamieson.

1. Problem Statement

The paper addresses Reinforcement Learning (RL) with delayed state observations, a setting where an agent interacts with an environment but does not observe the current state $s_h$ immediately. Instead, the state is revealed after a random delay $D_h$ .

Context: This occurs in real-world applications like robotics (sensor processing delays), autonomous driving (data transmission latency), and online advertising (delayed user feedback).
Challenge: Standard RL algorithms assume immediate state observation. With delays, the agent must plan a sequence of actions based on outdated information. The number of possible action sequences grows exponentially with the delay length, making the problem computationally and statistically difficult.
Goal: Minimize the cumulative regret over $K$ episodes in a finite-horizon, tabular Markov Decision Process (MDP) with state space size $S$ , action space size $A$ , horizon $H$ , and maximum delay $D_{max}$ .

2. Methodology

The authors propose a novel algorithm, MVP-Delayed, which combines state augmentation with Upper Confidence Bound (UCB) techniques (specifically the MVP algorithm by Zhang et al., 2021).

A. Augmented MDP Construction

To handle the delay, the authors construct an equivalent Augmented MDP ( $M_{aug}$ ) that eliminates the delay by expanding the state space.

Augmented State: The state is defined as a tuple $(s_{th}, \mathbf{a}, \tilde{\Delta}_h, h)$ $(s_{t h}, a, \tilde{Δ}_{h}, h)$ , where:
- $s_{th}$ : The last observed state.
- $\mathbf{a}$ : The queue of unresolved actions taken since $s_{th}$ was observed.
- $\tilde{\Delta}_h$ : The number of time steps elapsed since the last observation (or a special flag indicating a new observation is imminent).
- $h$ : The current time step.
Transition Dynamics: The transition in $M_{aug}$ $M_{a ug}$ is decomposed into:
1. Known Dynamics: The evolution of the action queue (shifting elements) and the time step counter are deterministic and fully known.
2. Unknown Dynamics: The transition to the next state $s_{h+1}$ depends only on the last observed state and the action taken from it.
Intermediate States: The authors introduce specific intermediate states (e.g., tran and -1) to model the probabilistic nature of when the next state is revealed, allowing for a cleaner theoretical analysis.

B. Algorithm Design (MVP-Delayed)

Instead of learning the transition probabilities of the exponentially large augmented MDP directly, the algorithm exploits the structure of the augmented state:

Partial Knowledge: The algorithm recognizes that the unknown part of the transition (the next state distribution) depends only on a small subset of the augmented state (the original state-action pair).
Estimation Strategy:
- It maintains visit counts and empirical estimates for the original state-action pairs $(s, a)$ and the delay distribution parameters, rather than the augmented pairs.
- It uses a Bernstein-type bonus (variance-dependent) to construct optimistic value estimates.
- The algorithm updates values in a specific backward order to handle the dependencies introduced by the delay.

C. General Framework: MDPs with Partially Known Dynamics

A key theoretical contribution is the abstraction of the problem into a broader class of MDPs with Partially Known Dynamics.

Definition: The state space $S$ decomposes into $X \times Y$ . The transition of $Y$ is known, while the transition of $X$ is unknown but depends only on a feature map $\phi(s, a, y)$ .
Significance: This framework allows the authors to derive regret bounds that scale with the size of the effective unknown feature space ( $|Z_{eff}|$ ) rather than the full augmented state space, explaining why the algorithm avoids the "curse of dimensionality" associated with the exponential state space.

3. Key Contributions

Minimax Optimal Regret Bound:
- The paper derives an upper regret bound of $\tilde{O}(H\sqrt{D_{max}SAK})$ for tabular MDPs with delayed observations.
- This improves upon the previous best-known bound by Chen et al. (2023), which was $\tilde{O}(H^{3/2}D_{max}^{5/2}\sqrt{SAK})$ , by a factor of $H^{1/2}D_{max}^2$ .
- The bound demonstrates that the statistical complexity scales with the square root of the delay ( $\sqrt{D_{max}}$ ), not linearly or exponentially.
Matching Lower Bound:
- The authors prove a lower bound of $\Omega(H\sqrt{D_{max}SAK})$ (up to logarithmic factors).
- This establishes that their algorithm is minimax optimal. It rigorously proves that shorter delays reduce statistical complexity and that the $\sqrt{D_{max}}$ dependency is necessary.
Generalization to Partially Known Dynamics:
- By formulating the problem as MDPs with partially known dynamics, the authors provide a general framework that applies to any problem where transition dynamics decompose into known and structured unknown components. This is of independent interest beyond delayed RL.
Computational Hardness Analysis:
- The paper discusses the computational complexity, noting that solving delayed MDPs is NP-hard (reducible to 3-SAT) if the delay is large (comparable to the horizon). This justifies why the state space must be exponential in $D_{max}$ for exact solutions, but the proposed algorithm achieves optimal sample complexity despite this.

4. Results and Analysis

Regret Bounds:
- Known Delay Distribution: $\tilde{O}(H\sqrt{(D_{max} \wedge B)SAK} + HBSA)$ , where $B$ is the branching factor.
- Unknown Delay Distribution: $\tilde{O}(H\sqrt{(D_{max} \wedge B)SAK} + H\sqrt{\Delta_{max}SAK} + H(B + \Delta_{max})SA)$ .
- The term $(D_{max} \wedge B)$ indicates that if the branching factor $B$ is small, the regret does not suffer from large delays.
Theoretical Proof: The proof relies on a novel decomposition of the augmented MDP. By treating the known parts (action queue shifts) separately from the unknown parts (state transitions), the authors avoid the union bound over the exponential state space, instead bounding the error over the much smaller set of original state-action pairs.
Lower Bound Construction: The lower bound proof utilizes a "CodeMDP" structure combined with a tree-structured MDP. It reduces the problem to estimating the $\ell_1$ -norm of a vector under noisy observations, proving that any algorithm must incur the $\sqrt{D_{max}}$ penalty.

5. Significance and Impact

Closing the Theoretical Gap: Prior to this work, there was a significant gap between the theoretical understanding of delayed RL and empirical methods. This paper provides the first rigorous minimax optimal bounds for this setting.
Practical Implications: The results suggest that while delays make learning harder, the difficulty grows only as the square root of the delay length, not exponentially. This provides a theoretical guarantee that RL can be effective even in environments with significant latency, provided the algorithm accounts for the delay structure.
Framework for Future Work: The "Partially Known Dynamics" abstraction offers a powerful tool for analyzing other RL problems where parts of the system are deterministic or known, potentially leading to more efficient algorithms in other domains (e.g., RL with partial observability or specific structural constraints).

In summary, this paper establishes the fundamental limits of learning in delayed observation environments and provides an algorithm that achieves these limits by cleverly exploiting the structural decomposition of the augmented state space.