Revisiting Value Iteration: Unified Analysis of Discounted and Average-Reward Cases

This paper presents a unified geometry-based analysis demonstrating that Value Iteration achieves geometric convergence in both discounted and average-reward settings under a unique unichain optimal policy assumption, thereby resolving the discrepancy between classical theoretical bounds and observed empirical performance.

Arsenii Mustafin, Xinyi Sheng, Dominik Baumann

Published 2026-03-12
📖 5 min read🧠 Deep dive

The Big Picture: The "Speed Limit" Myth

Imagine you are teaching a robot to navigate a maze to find the best treasure. The robot uses a method called Value Iteration (VI). Think of this as the robot making a map, checking every room, and updating its map with better information every time it walks a step.

For decades, computer scientists believed there was a strict "speed limit" on how fast this robot could learn:

  1. The Discounted Case (The "Impatient" Robot): If the robot cares mostly about immediate rewards (like getting a cookie now rather than a cake later), theory said it would learn quickly, but the speed depended on how "impatient" it was.
  2. The Average-Reward Case (The "Patient" Robot): If the robot cares about the long-term average (like getting a steady paycheck over a lifetime), the old theory said, "Good luck! This might take forever." It suggested the robot would learn very slowly, getting only a tiny bit of progress with every step.

The Problem: In the real world, when engineers actually run these robots, they don't move slowly. They zoom! The theory said "slow and steady," but the practice said "fast and furious." There was a mismatch between the math and reality.

The Paper's Discovery: The "Unichain" Key

This paper says, "The old math wasn't wrong, but it was looking at the problem through a foggy window."

The authors found that if the maze has a specific structure—where there is one clear "best path" that eventually loops back on itself (a concept they call a unichain)—the robot actually learns geometrically fast in both cases.

The Analogy: The "Heaven, Purgatory, and Hell" Maze
To prove this, the authors imagined a maze with three zones:

  • Heaven: A loop of high rewards.
  • Purgatory: A middle zone where you can slip out of Heaven.
  • Hell: A bad zone you might fall into.

The "old math" looked at the worst-case scenario where the robot gets stuck in a loop of bad information for a long time. But the authors realized: If the maze is connected enough (unichain), the "bad news" eventually gets washed out by the "good news."

They showed that even if the robot is "patient" (average reward), it doesn't crawl; it runs. It converges to the solution just as fast as the "impatient" robot, provided the maze isn't broken into isolated islands.

The Secret Weapon: A New Way to Look at the Map

How did they prove this? They changed the way they visualized the problem.

The Old View (The Foggy Window):
Imagine looking at a 3D landscape of the maze. The old math looked at the height of the mountains (the value of each state). When the robot became "patient" (average reward), the mountains flattened out into a single flat plain. It became impossible to see the differences, so the math got confused and predicted slow movement.

The New View (The "Outer Edge" Perspective):
The authors proposed looking at the edges of the landscape instead of the center.

  • Analogy: Imagine a group of people standing on a stage. The old math tried to measure the height of the person in the very center. If everyone stands on a flat floor, you can't tell who is who.
  • The New Trick: The authors said, "Let's measure the distance between the tallest person and the shortest person in the group." This is called the span. Even if the whole group is on a flat floor, if one person is slightly taller than the other, the difference (the span) tells you everything you need to know.

By focusing on this difference (span) rather than the absolute height, they found that the "flatness" of the average-reward case disappears. The robot's map still has distinct features, and it updates quickly.

The "Unified" Breakthrough

Before this paper, scientists treated "impatient" robots and "patient" robots as two completely different species requiring two different rulebooks.

This paper says: "They are actually the same species."

By using this new geometric perspective (looking at the span/differences), they created one single rulebook that explains why both types of robots learn fast, as long as the maze allows them to travel between all areas (the unichain assumption).

Why Should You Care?

  1. It Fixes the Confusion: It explains why your AI models work better in practice than the textbooks say they should.
  2. Better Algorithms: If we know the robot learns fast, we can stop wasting time waiting for it to converge. We can tune our systems to be more efficient.
  3. Simplicity: It unifies two complex fields of math into one elegant geometric picture.

The Bottom Line

The paper is like finding out that a car you thought was stuck in mud (the average-reward case) is actually just driving on a smooth highway, provided you look at the road from the right angle. The "slow" speed limit was an illusion caused by looking at the wrong part of the map. Once you shift your perspective, you realize the robot is moving at full speed all along.