Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

This paper proposes a minimax optimal algorithm for tabular reinforcement learning with delayed state observations that combines augmentation and upper confidence bound methods to achieve a regret bound of O~(HDmaxSAK)\tilde{\mathcal{O}}(H \sqrt{D_{\max} SAK}), which is proven to be optimal up to logarithmic factors.

Harin Lee, Kevin Jamieson

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you are playing a high-stakes video game, like a racing simulator or a strategy game. In a normal game, when you press a button to turn left, you immediately see your car turn. You learn instantly: "Turning left worked!" or "Turning left made me crash."

But in this paper, the authors are tackling a much harder version of the game: The "Delayed Vision" Game.

The Problem: The Foggy Windshield

In the real world (like self-driving cars or robotics), sensors often take time to process data. You might press the brake, but the car's computer doesn't "see" the result of that brake for a few seconds.

In the paper's terms, this is called Delayed State Observation.

  • The Agent: The AI trying to learn.
  • The Delay: A "foggy windshield" or a "laggy internet connection."
  • The Consequence: The AI has to make a whole sequence of moves blindly before it gets any feedback. It's like playing chess where you have to plan your next 10 moves without seeing if your opponent captured your pawn.

If the delay is long, the number of possible move combinations explodes. It becomes a nightmare to figure out which move was actually the good one.

The Solution: The "Backpack" Strategy

The authors propose a clever way to solve this. Instead of trying to guess the future or wait for the fog to clear, they tell the AI to carry a mental backpack.

  1. Augmentation (The Backpack):
    Normally, an AI only looks at the current state (e.g., "I am at the red light").
    With delays, the AI needs to remember: "I am at the red light, AND I just pressed 'Go', AND I pressed 'Go' again 2 seconds ago, AND I pressed 'Go' 3 seconds ago."

    The authors create a new, bigger version of the game world (an "Augmented MDP") where the "state" includes this entire history of actions in the backpack. Now, even though the AI can't see the result yet, it knows exactly what it has done so far.

  2. The "Optimistic Explorer" (UCB):
    Once the AI has this backpack, the authors use a standard, proven strategy called Upper Confidence Bound (UCB).

    • Analogy: Imagine you are in a dark forest with many paths. You don't know which path leads to treasure. The "Optimistic Explorer" assumes that every path you haven't tried yet might be the best one. It tries the unknown paths to learn about them, but it also sticks to the paths it knows are good.
    • The paper's algorithm is "optimistic" about the delayed feedback. It says, "I haven't seen the result of my last 5 moves yet, but I'm going to assume they might lead to a great reward, so I'll keep exploring to find out."

The Big Win: Why This Matters

Previous methods for this problem were like trying to count every single grain of sand on a beach to find a specific one. They were slow and inefficient, especially when the delay was long.

The authors' method is like having a metal detector.

  • The Result: They proved mathematically that their method is the best possible (Minimax Optimal).
  • The Analogy: If the delay is DD steps long, previous methods got slower as the delay grew cubically (like D3D^3). This new method only gets slower as the square root of the delay (D\sqrt{D}).
    • Think of it this way: If the delay doubles, old methods might take 8 times longer to learn. The new method only takes about 1.4 times longer. That is a massive speedup.

The "Secret Sauce": Partial Knowledge

The paper also introduces a general framework called "Partially Known Dynamics."

  • The Metaphor: Imagine you are learning to drive a car where you know exactly how the steering wheel works (the "known" part), but you don't know how the road surface changes (the "unknown" part).
  • The authors realized that in the "Delayed" game, the "backpack" part of the state (the history of actions) is fully known. Only the result of those actions is unknown.
  • By separating the "known" history from the "unknown" result, they could ignore the massive complexity of the history and focus only on learning the road. This is why their algorithm is so efficient.

Summary

This paper solves the problem of "learning while blindfolded."

  1. The Problem: AI gets delayed feedback, making it hard to learn.
  2. The Fix: Give the AI a "backpack" to remember its recent actions, turning a blindfolded game into a game with full memory.
  3. The Result: They created the fastest possible algorithm for this scenario, proving it's impossible to do significantly better.

It's a blueprint for making robots and self-driving cars smarter in the real world, where sensors are never perfectly instant.