Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Imagine you are the captain of a ship navigating an endless ocean. You don't have a map, and you don't know where the treasure islands (high rewards) are or where the whirlpools (low rewards) hide. Your goal is to sail for as long as possible and collect as much gold as you can.

This is the essence of Reinforcement Learning (RL) in an infinite-horizon setting. Unlike a video game level that ends with a "Game Over" screen, this ocean never stops. You just keep sailing.

For a long time, computer scientists had a hard time teaching ships to sail these endless oceans efficiently. The old methods were like a captain who:

Wasted a lot of time just figuring out how to steer before they even started looking for treasure (high "burn-in" cost).
Treated every ocean the same, whether it was calm and predictable or chaotic and stormy. They didn't adapt to the specific conditions of the water.

This paper introduces a new, smarter captain named FOCUS (Fully Optimizing Clipped UCB Solver). Here is how it works, explained simply:

1. The Two Ways to Measure Success

The paper looks at two ways to judge the captain:

The Average Reward: "Over the next 1,000 years, how much gold did you make per day compared to the best possible captain?"
The Discounted Reward: "How much gold did you make right now compared to what you could have made if you knew the future?"

2. The Problem with Old Maps (Algorithms)

Previous algorithms were like captains who used a giant, heavy, one-size-fits-all map.

The Burn-In Problem: Before they could start sailing fast, they had to spend years just learning the basics. If you only sailed for a short time, they performed terribly.
The "Stupid" Adaptation: If the ocean was perfectly calm (deterministic), they still acted like it was a hurricane. They didn't realize, "Hey, the water is still; I don't need to check every wave!"

3. The FOCUS Solution: A Smart, Adaptive Captain

The authors created FOCUS, a captain who carries a special tool: a Variance Detector.

Variance = Chaos: Think of variance as how "jumpy" the ocean is.
- Low Variance (Calm Sea): The ship goes exactly where you point it.
- High Variance (Stormy Sea): You point the ship North, but a wave might push you East.

FOCUS's Superpower: It measures the chaos in real-time.

If the sea is calm, FOCUS realizes, "I don't need to waste time checking every wave!" It stops exploring and starts exploiting the treasure it already found. The regret (lost gold) becomes almost zero.
If the sea is stormy, FOCUS says, "Okay, it's chaotic. I need to be careful and explore more." It adjusts its strategy perfectly.

This is the first time an algorithm has achieved this "Best of Both Worlds" performance for infinite sailing.

4. The "Span" and the "Burn-In" Mystery

There is a tricky concept in this paper called the "Bias Span" (let's call it the Complexity Score).

Imagine the ocean has a "depth" of complexity. Some oceans are shallow and easy to navigate; others are deep and confusing.
The Big Discovery: The paper proves a fundamental rule about Prior Knowledge.
- Scenario A (You know the Complexity Score): If you tell FOCUS, "Hey, this ocean is shallow," it can sail perfectly almost immediately.
- Scenario B (You don't know the Score): If you don't tell FOCUS, it has to spend extra time "feeling" the water to figure out how deep it is.
- The Gap: The paper proves that there is a fundamental gap between these two. If you don't know the complexity beforehand, you cannot avoid paying a "tax" in time (burn-in cost) to figure it out. No amount of clever math can bypass this; you simply have to sail a bit longer to learn the depth of the ocean.

5. How FOCUS Thinks (The "Full Optimization" Trick)

Old captains would check the map, take one step, check the map again, take one step.
FOCUS is different. When it stops to update its map, it doesn't just take one step. It runs a simulation in its head, over and over again, until it has fully figured out the best path before it moves a single inch.

Analogy: Imagine a chess player. Old algorithms make a move, then think about the next move. FOCUS thinks, "If I make this move, then you make that move, then I make this move..." and solves the whole game in its head before touching a piece. This allows it to be incredibly efficient and accurate.

Summary

This paper is a breakthrough because it gives us a reinforcement learning algorithm that:

Adapts to the weather: It works perfectly in calm, predictable worlds and chaotic, random ones.
Starts fast: It doesn't waste years just learning to steer.
Knows its limits: It proves that if you don't know how "deep" or complex the problem is beforehand, you must pay a price in time to learn it. You can't cheat the laws of exploration.

In short, FOCUS is the first captain who knows exactly how much effort to spend based on how rough the ocean actually is, making it the most efficient sailor for endless voyages we've ever seen.

1. Problem Statement

The paper addresses Online Reinforcement Learning (RL) in infinite-horizon Markov Decision Processes (MDPs) with tabular state and action spaces. Unlike episodic settings (where episodes reset), infinite-horizon MDPs continue indefinitely without a built-in reset mechanism. The authors focus on two primary performance objectives:

Average-Reward Regret: The cumulative difference between the optimal long-term average reward ( $\rho^\star$ ) and the agent's instantaneous reward.
$\gamma$ -Regret: The cumulative difference between the $\gamma$ -discounted optimal value function ( $V^\star_\gamma$ ) and the agent's reward, introduced to avoid structural assumptions (like strong communication) required for average-reward regret.

Key Challenges Identified:

High Burn-in Costs: Existing minimax-optimal algorithms (e.g., PMEVI-DT) only achieve optimal regret rates for extremely large time horizons $T$ (e.g., $T \ge \|\boldsymbol{h}^\star\|_{sp}^{10} S^{40} A^{20}$ ), making them impractical for moderate $T$ .
Lack of Adaptivity: Prior algorithms fail to adapt to "easier" instances, such as deterministic or low-variance MDPs. They often incur regret scaling with $\sqrt{T}$ even when the environment is deterministic, whereas optimal bounds should be independent of $T$ (up to logarithmic factors) in such cases.
Dependence on Prior Knowledge: Many optimal algorithms require prior knowledge of the bias span ( $\|\boldsymbol{h}^\star\|_{sp}$ ), a measure of the MDP's mixing time and complexity. Algorithms without this knowledge suffer from significantly worse lower-order terms.

2. Methodology

The authors propose a single, tractable algorithm called FOCUS (Fully Optimizing Clipped UCB Solver) that applies to both the average-reward and $\gamma$ -regret settings.

Core Algorithmic Components:

Model-Based UCB Approach: FOCUS maintains empirical estimates of the transition kernel ( $\hat{P}$ ) and uses a doubling trick to define episodes (an episode ends when the visit count to a state-action pair doubles).
Clipped Optimistic Value Iteration:
- Instead of performing a single step of value iteration per update (as in UCBVI- $\gamma$ ), FOCUS fully optimizes the Q-estimate at the start of each episode. It iteratively applies the empirical Bellman operator until convergence.
- This ensures the Q-estimates fully exploit collected data, eliminating the dependency on $1/(1-\gamma)$ in lower-order terms that plagues single-step methods.
Span-Clipping: The algorithm employs a clipping operator ( $\text{Clip}_H$ ) to bound the span of value estimates by a parameter $H$ . This prevents over-optimism and controls the variance of the estimates.
Sharp Bernstein-Style Bonus: The algorithm incorporates a variance-dependent bonus term (similar to the MVP algorithm in episodic RL) rather than a Hoeffding-style bonus. This allows the regret bound to depend on the cumulative variance of the transitions rather than the worst-case variance.

Theoretical Reduction:

The authors utilize a reduction from the average-reward setting to the discounted setting. By setting the discount factor $\gamma = 1 - 1/T$ , they show that a tight $\gamma$ -regret bound implies an optimal average-reward regret bound. A critical technical contribution is proving that the cumulative variance term $\text{Var}^\star_\gamma$ scales with $T \|\boldsymbol{V}^\star_\gamma\|_{sp}$ rather than $T \|\boldsymbol{V}^\star_\gamma\|^2_{sp}$ , enabling minimax-optimal bounds.

3. Key Contributions

A. First Optimal Variance-Dependent Bounds

The paper establishes the first regret bounds for infinite-horizon MDPs that depend on the cumulative transition variance ( $\text{Var}^\star_\gamma$ ).

Regret Form: $\tilde{O}\left(\sqrt{\text{Var}^\star_\gamma SA} + \text{lower-order terms}\right)$ .
Implication: In deterministic MDPs, $\text{Var}^\star_\gamma = 0$ , leading to regret that is independent of $T$ (up to log factors). In stochastic settings, the bound interpolates to the minimax-optimal rate.

B. Improved Lower-Order Terms and Burn-in Costs

FOCUS significantly improves the dependence on the bias span $\|\boldsymbol{h}^\star\|_{sp}$ and the time horizon $T$ :

With Prior Knowledge: If $\|\boldsymbol{h}^\star\|_{sp}$ is known, the lower-order terms scale as $\|\boldsymbol{h}^\star\|_{sp} S^2 A$ . This is proven to be optimal in both $\|\boldsymbol{h}^\star\|_{sp}$ and $A$ .
Without Prior Knowledge (Prior-Free): The algorithm achieves lower-order terms scaling as $\|\boldsymbol{h}^\star\|^2_{sp} S^3 A$ .
Burn-in Cost: The algorithm achieves minimax optimality for $T \ge \|\boldsymbol{h}^\star\|^2_{sp} S^3 A$ . This is a massive improvement over previous prior-free algorithms (e.g., PMEVI-DT) which required $T \ge \|\boldsymbol{h}^\star\|^{10}_{sp} S^{40} A^{20}$ .

C. Fundamental Gap and Lower Bounds

The authors prove a lower bound demonstrating a fundamental gap between algorithms with and without prior knowledge of $\|\boldsymbol{h}^\star\|_{sp}$ :

Theorem 3.8: No prior-free algorithm can achieve lower-order terms better than $\|\boldsymbol{h}^\star\|^2_{sp} SA$ .
This establishes a "price of adaptivity": algorithms that do not know the span must incur a higher burn-in cost (scaling quadratically with the span) compared to those that do (scaling linearly).

4. Main Results Summary

Setting	Regret Bound (Leading Term)	Lower-Order Terms (Prior Knowledge)	Lower-Order Terms (Prior-Free)	Burn-in Cost (Prior-Free)
$\gamma$ -Regret	$\tilde{O}(\sqrt{SA \cdot \text{Var}^\star_\gamma})$	$\Gamma \\|\boldsymbol{V}^\star_\gamma\\|_{sp} SA$	$\Gamma \sqrt{T/(S^3A)} \cdot SA$	$T \ge \\|\boldsymbol{h}^\star\\|^2_{sp} S^3 A$
Avg-Reward	$\tilde{O}(\sqrt{SA \cdot \text{Var}^\star_{1-1/T}})$	$\Gamma \\|\boldsymbol{h}^\star\\|_{sp} SA$	$\Gamma \\|\boldsymbol{h}^\star\\|^2_{sp} S^3 A$	$T \ge \\|\boldsymbol{h}^\star\\|^2_{sp} S^3 A$

Note: $\Gamma$ is the maximum support size of the transition kernel (1 for deterministic MDPs).

Specific Achievements:

Deterministic MDPs: Regret is $\tilde{O}(\|\boldsymbol{h}^\star\|_{sp} SA)$ , which is $T$ -independent.
Stochastic MDPs: Regret matches the minimax lower bound $\tilde{O}(\sqrt{\|\boldsymbol{h}^\star\|_{sp} SAT})$ .
Computational Tractability: Unlike previous minimax-optimal algorithms (EBF, PMEVI-DT) which rely on Extended Value Iteration (EVI) and are computationally heavy or intractable, FOCUS is a UCB-style algorithm with polynomial runtime complexity $O(S^3 A^2 T)$ .

5. Significance

Bridging the Gap: This work resolves a long-standing gap in infinite-horizon RL theory, bringing it closer to the maturity of episodic RL by providing variance-dependent bounds and low burn-in costs.
Algorithmic Simplicity: It demonstrates that complex structural assumptions (like explicit bias estimation or EVI) are not strictly necessary for optimality; simple span-clipping combined with full optimization of the Bellman operator suffices.
Theoretical Limits: By proving the lower bound on the dependence of lower-order terms on $\|\boldsymbol{h}^\star\|_{sp}$ , the paper clarifies the fundamental limits of learning in infinite-horizon MDPs without prior structural knowledge.
Practical Impact: The reduced burn-in cost ( $T \ge \|\boldsymbol{h}^\star\|^2_{sp} S^3 A$ vs. $T \ge \|\boldsymbol{h}^\star\|^{10}_{sp} S^{40} A^{20}$ ) suggests that these algorithms could be practically effective in real-world scenarios with moderate time horizons, where previous "optimal" algorithms would still be in the exploration phase.

In conclusion, the paper presents FOCUS, a unified, tractable, and theoretically optimal algorithm for infinite-horizon RL that adapts to environment variance and significantly reduces the time required to achieve optimal performance, while rigorously characterizing the cost of lacking prior knowledge of the MDP's bias span.