Ergodicity in reinforcement learning

Here is an explanation of the paper "Ergodicity in Reinforcement Learning" using simple language and creative analogies.

The Big Idea: The "Average" vs. The "Real Life"

Imagine you are a financial advisor. You have a client who wants to invest their life savings. You show them a chart of a specific stock. You say, "Look at this! If we look at 1,000 different people investing in this stock for one year, the average person makes a 50% profit."

Your client asks, "Great! So, if I invest my money, I will make 50%?"

You say, "Well, statistically, yes. But there's a catch."

This paper is about that catch. It explains that in many real-world situations (like life, biology, or robotics), the average result of a group is completely different from the result of a single person living through time.

In the world of Reinforcement Learning (RL)—where AI agents learn by trial and error—most AI is trained to maximize that "group average." The paper argues that for a single agent trying to survive and thrive over a long time, this is a dangerous mistake.

The Analogy: The Russian Roulette Investment

To understand why, let's look at the "Coin Toss" example from the paper.

Imagine you have $100. You play a game where you flip a coin every day:

Heads: You win 50% of your current money.
Tails: You lose 40% of your current money.

The "Group Average" View (What the AI usually does):
If you look at 1,000 people playing this game for one day:

500 people gain 50% (ending with $150).
500 people lose 40% (ending with $60).
The average is $105.
So, on average, you make 5% a day. The AI says: "This is a great game! Bet everything you have!"

The "Single Life" View (What actually happens to you):
Now, imagine you play this game for 100 days. You don't get to reset and try again 1,000 times. You just live through the sequence of heads and tails.

If you get a few tails in a row, your money shrinks.
Because the math is multiplicative (you lose a percentage of what you have now), a loss hurts more than a gain helps.
If you lose 40% twice, you have $36 left. If you win 50% twice, you have $225. But if you lose 40% and then win 50%, you end up with $36 × 1.5 = $54. You lost money!
The Result: If you play this game long enough, almost every single person will end up with $0. The "average" person is a fantasy that doesn't exist in reality.

The paper calls this Non-Ergodicity.

Ergodic: The average of the group = The average of one person over time. (Like rolling a die: if 1,000 people roll once, the average is 3.5. If one person rolls 1,000 times, the average is also 3.5).
Non-Ergodic: The average of the group $\neq$ The average of one person over time. (Like the coin toss game above).

Why This Matters for AI

Most AI robots and self-driving cars are trained using the "Group Average" method. They are told: "Maximize the expected reward."

The paper uses a Delivery Robot example:

Route A (Fast): Takes 10 minutes. But there is a 1% chance a crowd destroys the robot. If it gets destroyed, the game is over (0 future rewards).
Route B (Slow): Takes 20 minutes. 100% safe.

If the AI calculates the "average reward per trip," Route A looks better because 99% of the time it saves time.
But if the AI takes Route A, eventually (statistically almost surely), it will get destroyed. Once it's dead, it can't deliver anything ever again.
The "Average" AI chooses the fast route and dies. The "Real Life" AI chooses the slow route and lives forever.

The Three Solutions (How to Fix the AI)

The paper reviews three clever ways to teach AI to stop chasing the "Group Average" and start caring about "Real Life" survival.

1. The "Magic Lens" (Ergodicity Transformations)

Imagine looking at the world through a special pair of glasses that changes how you see numbers.

The Problem: The AI sees the raw money numbers, which are misleading.
The Fix: The researchers teach the AI to look at the logarithm of the money (a mathematical trick that turns multiplication into addition).
The Result: When the AI looks through this "Magic Lens," the game no longer looks like a trap. It sees that the safe route is actually the winning strategy. The AI learns to optimize for the "growth rate" of a single life rather than the average of a crowd.

2. The "Geometric Mean" (The Regularizer)

Imagine you are a coach training an athlete.

The Problem: The athlete only cares about the average score of the whole team.
The Fix: The coach adds a rule: "You must also care about your own personal streak."
The Result: The AI is given a new goal. It tries to maximize the usual reward, but it also gets a "bonus" for maintaining a steady, positive growth rate over its own long journey. This prevents it from taking reckless risks that might kill its future.

3. The "Time Traveler" (Temporal Training)

Imagine you are playing a video game, but instead of playing one level and restarting, you play the whole story in one go.

The Problem: The AI usually learns by taking one step, getting a reward, and forgetting the rest of the history.
The Fix: The researchers force the AI to simulate a long timeline inside its training. It has to make a decision today, then imagine making decisions tomorrow, and the day after, all in one go.
The Result: The AI realizes, "Oh, if I take this risky shortcut today, I won't be here to make decisions tomorrow." It learns to value the future of its own specific timeline, not just the average of all possible timelines.

The Takeaway

This paper is a wake-up call for the AI world.

We often build AI to be the "perfect statistical average." But in the real world, you only get one life. You don't get to play the game 1,000 times and average the results.

If an AI is going to drive your car, manage your money, or run a hospital, we don't want it to be the "average" hero who dies in a crash because the math said it was a good bet. We want an AI that understands Ergodicity: one that knows that for a single agent, survival and long-term growth matter more than short-term statistical averages.

The paper suggests that to build truly safe and effective AI for the real world, we need to stop optimizing for the "group average" and start optimizing for the "single journey."

Here is a detailed technical summary of the paper "Ergodicity in reinforcement learning" by Dominik Baumann et al.

1. Problem Statement

The paper addresses a fundamental flaw in standard Reinforcement Learning (RL) optimization objectives when applied to non-ergodic reward processes.

Standard Objective: Traditional RL aims to maximize the expected value of the sum of rewards ( $E[\sum r_t]$ ). This is an ensemble average (averaging over infinitely many parallel trajectories).
The Ergodicity Gap: In ergodic systems, the ensemble average equals the time average (the average over a single, infinitely long trajectory). However, in non-ergodic systems, these two averages diverge.
The Consequence: Optimizing for the expected value may lead to policies that perform well on average across a population of agents but result in catastrophic failure (e.g., bankruptcy, destruction) for an individual agent over time.
Real-World Relevance: This is critical in domains like finance, biology, and robotics, where a single agent operates over a long horizon. If an agent faces "fatal" states (absorbing states) or multiplicative reward dynamics (where wealth grows or shrinks based on previous outcomes), the expected value is an uninformative metric for individual long-term survival and performance.

2. Methodology and Theoretical Framework

A. Formal Definitions

The authors distinguish between two types of ergodicity:

Strong Reward Ergodicity: The limit of the ensemble average equals the limit of the time average for every time step starting from $t_0$ . This requires the system to start in a stationary distribution.
Asymptotic Reward Ergodicity: The limits align as $N, t_k \to \infty$ . This is more practical, allowing systems to start from arbitrary states and converge to a steady state.

The paper establishes that for a reward process to be ergodic, the underlying Markov Reward Process (MRP) must typically be a unichain (a single recurrent class) and aperiodic, and the initial state distribution must converge to the stationary distribution.

B. The "Coin-Toss" Illustrative Example

To demonstrate the failure of standard RL, the authors use a modified coin-toss game:

Setup: An agent starts with wealth $R_0 = 100$ $R_{0} = 100$ . At each step, it bets a fraction $\alpha$ $α$ of its wealth.
- Heads (50% prob): Wealth becomes $R_{t+1} = R_t(1 + 0.5\alpha)$ .
- Tails (50% prob): Wealth becomes $R_{t+1} = R_t(1 - 0.4\alpha)$ .
The Paradox:
- Expected Value: The expected growth per round is positive ( $+5\%$ if $\alpha=1$ ). Standard RL (maximizing $E[R_T]$ ) suggests $\alpha=1$ is optimal.
- Time Average: The geometric mean growth is negative. If $\alpha=1$ , the agent almost surely goes bankrupt ( $R_T \to 0$ ) as $T \to \infty$ .
Result: Standard algorithms like Proximal Policy Optimization (PPO) trained on this task learn policies that lead to near-zero returns, failing to discover the optimal strategy (which is to bet a smaller fraction, $\alpha \approx 0.2$ , to maximize the time-average growth rate).

C. Sources of Non-Ergodicity

The paper categorizes environments where ergodicity breaks:

Multiplicative Rewards: Rewards depend on the history of accumulated rewards (e.g., wealth), violating the Markov property unless wealth is explicitly included as a state.
Non-Stationary State Distributions: The state distribution evolves over time (e.g., distance traveled never converges).
Non-Stationary Transitions: The environment dynamics change over time (Continual RL, Multi-Agent RL).
Absorbing States: "Fatal" states (e.g., robot destruction) from which the agent cannot recover.
Multi-Chain MDPs: Systems with disconnected sub-MDPs where the optimal policy depends on the specific realization of the initial state.

3. Key Contributions

Conceptual Clarification: The paper rigorously defines non-ergodic reward processes in the context of RL, distinguishing them from the more common concept of ergodic Markov chains. It highlights that an ergodic MDP does not guarantee an ergodic reward process if rewards are path-dependent.
Demonstration of Failure: It provides empirical evidence that state-of-the-art RL algorithms (like PPO) fail to solve simple non-ergodic tasks because they inherently optimize for ensemble averages rather than time averages.
Review of Solutions: The paper synthesizes and explains three distinct strategies to handle non-ergodicity:
- Learning Ergodicity Transformations: Transforming the reward signal to make the process ergodic.
- Modified Geometric Mean Estimator: Incorporating the time-average growth rate directly into the objective function.
- Temporal Training: Explicitly modeling path dependence by repeating decision steps within a single training episode.

4. Proposed Solutions (Section 5)

The authors review three specific approaches to optimize for long-term individual performance:

A. Learning Ergodicity Transformations

Concept: Transform the non-ergodic return $R_t$ into an ergodic observable $h(R_t)$ such that optimizing the expected value of the transformed increments $\Delta h(R_t)$ maximizes the time-average growth rate.
Method: Uses LOESS (Locally Estimated Scatter-plot Smoothing) to learn the transformation function $h$ from data, inspired by variance-stabilizing transformations.
Result: When applied to the coin-toss game, the agent learns a winning policy (Figure 3 in the paper).
Limitation: Requires access to full return trajectories (Monte Carlo style) to learn the transformation; difficult to apply in complex, high-dimensional state spaces.

B. Modified Geometric Mean Estimator

Concept: Formulates the objective as a convex combination of the standard expected return and the time-average growth rate ( $G_\pi^\infty$ ).
$\max_\pi \left( (1-\lambda)E_\pi[\sum \gamma^k r_k] + \lambda G_\pi^\infty \right)$
Method: Uses a sliding window to estimate the geometric mean of returns along a single trajectory, avoiding the need for infinite history. It employs multi-step Q-learning to propagate these estimates.
Result: With $\lambda=1$ , the agent learns the optimal strategy for the coin-toss game (Figure 4). It outperforms standard multi-step Q-learning in benchmarks like Cart-Pole.
Limitation: Currently restricted to discrete action spaces and requires tuning of hyperparameters ( $\lambda$ and window size $N$ ).

C. Temporal Training and Path-Dependent Updates

Concept: Instead of changing the reward function, this approach changes the training procedure to explicitly account for path dependence.
Method: The agent faces the same decision problem multiple times within a single training episode, updating the state based on the cumulative effect of previous actions (e.g., updating wealth). This forces the agent to learn the temporal dynamics of the system.
Result: Agents trained this way shift their "indifference point" (risk preference) from the expected-value optimum to the time-growth optimum (Figure 5 & 6).
Limitation: Computationally expensive and requires the agent to learn complex temporal dynamics, which is challenging in high-dimensional environments.

5. Results and Significance

Empirical Validation: The paper demonstrates that standard RL fails on non-ergodic tasks (leading to bankruptcy in the coin-toss example), while the proposed methods successfully learn policies that maximize long-term survival and growth.
Theoretical Insight: It bridges the gap between statistical physics (ergodicity) and machine learning, arguing that the "expected value" objective is fundamentally flawed for individual agents in non-ergodic environments.
Practical Impact: The findings are crucial for safety-critical applications (robotics, autonomous driving) and economic applications (algorithmic trading), where a single catastrophic failure is unacceptable, regardless of the statistical average.

6. Open Challenges

The authors conclude that while these methods are promising, significant hurdles remain:

Scalability: Current methods have only been tested on simple environments (Coin-toss, Cart-Pole, Lunar Lander). Scaling to complex, high-dimensional continuous control tasks is an open problem.
Joint Learning: Separating the learning of the transformation (or growth rate) from the policy learning is suboptimal; joint learning is difficult.
Hyperparameter Sensitivity: Methods like the geometric mean estimator require careful tuning of window sizes and trade-off parameters.
Measurement: There is no principled empirical measure to quantify "how non-ergodic" a specific RL benchmark is.
Discounting: The relationship between the discount factor $\gamma$ and ergodicity remains an open research question.

In summary, the paper argues that for RL to be robust in real-world, non-ergodic environments, the field must move beyond maximizing expected ensemble rewards and adopt objectives that optimize the time-average growth rate of individual trajectories.