Revisiting Value Iteration: Unified Analysis of Discounted and Average-Reward Cases

The Big Picture: The "Speed Limit" Myth

Imagine you are teaching a robot to navigate a maze to find the best treasure. The robot uses a method called Value Iteration (VI). Think of this as the robot making a map, checking every room, and updating its map with better information every time it walks a step.

For decades, computer scientists believed there was a strict "speed limit" on how fast this robot could learn:

The Discounted Case (The "Impatient" Robot): If the robot cares mostly about immediate rewards (like getting a cookie now rather than a cake later), theory said it would learn quickly, but the speed depended on how "impatient" it was.
The Average-Reward Case (The "Patient" Robot): If the robot cares about the long-term average (like getting a steady paycheck over a lifetime), the old theory said, "Good luck! This might take forever." It suggested the robot would learn very slowly, getting only a tiny bit of progress with every step.

The Problem: In the real world, when engineers actually run these robots, they don't move slowly. They zoom! The theory said "slow and steady," but the practice said "fast and furious." There was a mismatch between the math and reality.

The Paper's Discovery: The "Unichain" Key

This paper says, "The old math wasn't wrong, but it was looking at the problem through a foggy window."

The authors found that if the maze has a specific structure—where there is one clear "best path" that eventually loops back on itself (a concept they call a unichain)—the robot actually learns geometrically fast in both cases.

The Analogy: The "Heaven, Purgatory, and Hell" Maze
To prove this, the authors imagined a maze with three zones:

Heaven: A loop of high rewards.
Purgatory: A middle zone where you can slip out of Heaven.
Hell: A bad zone you might fall into.

The "old math" looked at the worst-case scenario where the robot gets stuck in a loop of bad information for a long time. But the authors realized: If the maze is connected enough (unichain), the "bad news" eventually gets washed out by the "good news."

They showed that even if the robot is "patient" (average reward), it doesn't crawl; it runs. It converges to the solution just as fast as the "impatient" robot, provided the maze isn't broken into isolated islands.

The Secret Weapon: A New Way to Look at the Map

How did they prove this? They changed the way they visualized the problem.

The Old View (The Foggy Window):
Imagine looking at a 3D landscape of the maze. The old math looked at the height of the mountains (the value of each state). When the robot became "patient" (average reward), the mountains flattened out into a single flat plain. It became impossible to see the differences, so the math got confused and predicted slow movement.

The New View (The "Outer Edge" Perspective):
The authors proposed looking at the edges of the landscape instead of the center.

Analogy: Imagine a group of people standing on a stage. The old math tried to measure the height of the person in the very center. If everyone stands on a flat floor, you can't tell who is who.
The New Trick: The authors said, "Let's measure the distance between the tallest person and the shortest person in the group." This is called the span. Even if the whole group is on a flat floor, if one person is slightly taller than the other, the difference (the span) tells you everything you need to know.

By focusing on this difference (span) rather than the absolute height, they found that the "flatness" of the average-reward case disappears. The robot's map still has distinct features, and it updates quickly.

The "Unified" Breakthrough

Before this paper, scientists treated "impatient" robots and "patient" robots as two completely different species requiring two different rulebooks.

This paper says: "They are actually the same species."

By using this new geometric perspective (looking at the span/differences), they created one single rulebook that explains why both types of robots learn fast, as long as the maze allows them to travel between all areas (the unichain assumption).

Why Should You Care?

It Fixes the Confusion: It explains why your AI models work better in practice than the textbooks say they should.
Better Algorithms: If we know the robot learns fast, we can stop wasting time waiting for it to converge. We can tune our systems to be more efficient.
Simplicity: It unifies two complex fields of math into one elegant geometric picture.

The Bottom Line

The paper is like finding out that a car you thought was stuck in mud (the average-reward case) is actually just driving on a smooth highway, provided you look at the road from the right angle. The "slow" speed limit was an illusion caused by looking at the wrong part of the map. Once you shift your perspective, you realize the robot is moving at full speed all along.

1. Problem Statement

Value Iteration (VI) is a foundational algorithm for solving Markov Decision Processes (MDPs). However, a significant gap exists between its theoretical convergence guarantees and empirical performance:

Discounted Reward ( $\gamma < 1$ ): Classical theory (Howard, 1960) guarantees geometric convergence with a rate bounded by the discount factor $\gamma$ . As $\gamma \to 1$ , this bound suggests convergence becomes arbitrarily slow (sublinear).
Average Reward ( $\gamma = 1$ ): Recent work (Lee & Ryu, 2025) suggests that in the average-reward setting, VI exhibits only sublinear convergence in the worst case, implying that geometric convergence is impossible.
Empirical Reality: In practice, VI often converges significantly faster than these worst-case bounds predict, even when $\gamma$ is close to 1 or in the average-reward setting.

The paper addresses the question: Why does VI converge geometrically in practice even when theory predicts sublinear convergence, and can we unify the analysis of discounted and average-reward cases?

2. Methodology

The authors propose a unified geometric analysis of MDPs that bridges the gap between the two settings.

A. Geometric Interpretation of MDPs

Building on previous work (Mustafin et al., 2025), the authors interpret MDPs in a high-dimensional "action space."

State-Action Pairs (SAPs) are represented as vectors.
Policies are represented as hyperplanes defined by these vectors.
Value Iteration is viewed as the movement of a hyperplane in this space.
Advantages are interpreted as the oriented vertical distance from an action vector to the policy hyperplane.

B. Unified Value Representation

A major hurdle in unifying the cases is that the classical value function $V^\pi$ becomes ill-defined or degenerate when $\gamma = 1$ (the matrix $I - P^\pi$ becomes singular).

New Value Function ( $v^\pi$ ): The authors introduce a new value representation defined by the linear system:
$v^\pi = C(I + \gamma E - \gamma P^\pi)^{-1} R^\pi$
where $C = n\gamma + (1-\gamma)$ , $E$ is the all-ones matrix, and $n$ is the number of states.
Properties:
- For $\gamma < 1$ , $v^\pi$ is a scaled version of the classical value function.
- For $\gamma = 1$ , if the policy is unichain (one recurrent class), the matrix is invertible, and $v^\pi$ is well-defined.
- Crucially, this new representation preserves the advantage function ( $adv(a, \pi)$ ) in both settings, allowing the geometric interpretation to remain consistent.

C. Normalization

The authors assume the existence of a unique, unichain optimal policy ( $\pi^*$ ). They utilize a normalization transformation ( $L_\delta$ ) to shift the MDP such that:

The optimal policy's value is 0.
Rewards for optimal actions are 0.
Rewards for non-optimal actions are strictly negative.
This simplifies the analysis by focusing on the error dynamics relative to the optimal policy.

3. Key Contributions

Geometric Convergence in Average-Reward Setting:
The paper proves that under the assumption of a unique, unichain optimal policy, VI converges geometrically in the average-reward setting ( $\gamma=1$ ). This contradicts the recent claim by Lee & Ryu (2025) that sublinear convergence is optimal.
Faster Convergence Rates in Discounted Setting:
The analysis shows that the convergence rate is strictly faster than the discount factor $\gamma$ . The rate is governed by a constant $\iota \in (0, 1)$ derived from the mixing properties of the transition kernels, leading to a rate of $\iota \gamma$ (or similar composite rates).
Unified Framework:
The authors demonstrate that the dynamics of VI are identical in both discounted and average-reward cases when viewed through the lens of the new value function $v^\pi$ and the span seminorm. This allows for a single proof structure for both settings.
Resolution of the Lee & Ryu (2025) Contradiction:
The paper clarifies why Lee & Ryu's sublinear lower bound does not apply to the practical performance of VI:
- Norm Difference: Lee & Ryu use the $\ell_\infty$ norm, while this paper uses the span seminorm ( $sp(V) = \max V - \min V$ ). The span seminorm is more appropriate for policy evaluation and can converge faster.
- Iteration Count: Lee & Ryu's bound applies to a short horizon ( $t \le n-2$ ), where information has not yet propagated across the entire state space. The authors show that after $T = n^2$ iterations, the system mixes sufficiently to exhibit geometric convergence.

4. Main Results

Theoretical Guarantees

Under Assumption 4.1 (Unique, unichain optimal policy), the span of the normalized value vector after $T = n^2$ iterations satisfies:
$sp(v_T) \le \gamma^T \iota \cdot sp(v_0)$
where $\iota \in (0, 1)$ is a constant dependent on the MDP structure.

Iteration Complexity

Discounted Case ( $\gamma < 1$ ): To obtain an $\epsilon$ -optimal policy, the number of iterations required is:
$O\left( \frac{\log(1/\epsilon) + \log(1/(1-\gamma))}{\log(1/\gamma) + \log(1/\iota)} n^2 \right)$
This is significantly better than the classical $O(\frac{\log(1/\epsilon)}{1-\gamma})$ bound as $\gamma \to 1$ .
Average-Reward Case ( $\gamma = 1$ ): To obtain an $\epsilon$ -optimal gain policy, the number of iterations is:
$O\left( \frac{\log(1/\epsilon)}{\log(1/\iota)} n^2 \right)$
This establishes geometric convergence (linear convergence rate) for the average-reward case, refuting the sublinear hypothesis.

5. Significance and Implications

Bridging Theory and Practice: The results explain why VI performs well in practice even for high discount factors or average-reward problems, resolving the long-standing mismatch between worst-case theory and empirical observation.
Algorithmic Confidence: Practitioners using VI (or VI-based critics in Actor-Critic methods) can now rely on geometric convergence guarantees under standard unichain assumptions, rather than fearing sublinear stagnation.
Theoretical Unification: By introducing a unified geometric framework and a new value representation, the paper simplifies the theoretical study of MDPs, treating discounted and average-reward cases as two sides of the same coin rather than distinct problems requiring separate analyses.
Limitations: The results rely on the unichain assumption. The paper notes that for multichain MDPs with isolated classes (where the optimal policy might not be unique or the chain is not connected), these specific guarantees may not hold, leaving that case for future work.

In summary, this paper fundamentally revises the understanding of Value Iteration's convergence, proving that geometric convergence is the norm (not the exception) for well-behaved MDPs, and providing a rigorous mathematical framework that unifies the analysis of discounted and average-reward reinforcement learning.