The Sample Complexity of Online Reinforcement Learning: A Multi-model Perspective

This paper analyzes the sample complexity of online reinforcement learning for general nonlinear dynamical systems with continuous state and action spaces, proposing algorithms that achieve specific policy regret bounds ranging from O(Nϵ2+duln(m(ϵ))/ϵ2)\mathcal{O}(N \epsilon^2 + d_\mathrm{u}\ln(m(\epsilon))/\epsilon^2) in the general case to O(duNp)\mathcal{O}(\sqrt{d_\mathrm{u}N p}) for parameterized models, while emphasizing the practical utility of these simple, prior-knowledge-incorporating methods.

Michael Muehlebach, Zhiyu He, Michael I. Jordan

Published 2026-03-02
📖 5 min read🧠 Deep dive

Imagine you are the captain of a ship navigating through a thick, uncharted fog. You have a destination (optimizing performance), but you don't know the map (the dynamics of the ocean). You can only steer by turning the wheel (taking actions) and seeing how the ship reacts.

This paper is about how to learn the map quickly and safely while steering, without crashing the ship or getting lost for too long. The authors, Michael Muehlebach, Zhiyu He, and Michael I. Jordan, propose a new way to teach computers (specifically Reinforcement Learning algorithms) how to do this in complex, real-world situations where the rules aren't simple straight lines.

Here is the breakdown of their idea using everyday analogies:

1. The Core Problem: The "Exploration vs. Exploitation" Dilemma

In Reinforcement Learning, you face a classic catch-22:

  • Exploitation: You steer the ship the way you think is best right now to get to the destination fast.
  • Exploration: You steer the ship in a weird, random direction just to see what happens, hoping to learn something new about the ocean.

If you only exploit, you might get stuck in a bad spot because you never learned the whole map. If you only explore, you'll circle the ocean forever and never arrive. Most existing methods struggle to balance this, especially when the "ocean" is complex and non-linear (like a stormy sea rather than a calm lake).

2. The Solution: The "Gambler's Menu" Approach

The authors propose a clever strategy that combines guessing with learning. Imagine you have a menu of 100 different maps (models) of the ocean. You don't know which one is the real map, but you know the real map is somewhere in that list.

Their algorithm works like this:

  1. The Menu: You have a list of candidate maps (models).
  2. The Scorecard: Every time you steer the ship, you check: "Which map predicted this movement correctly?"
    • If Map A predicted the turn perfectly, its score goes up.
    • If Map B predicted the ship would go left but it went right, its score goes down.
  3. The Gamble (Posterior Sampling): Instead of just picking the single "best" map and sticking with it, the algorithm randomly picks a map from the menu.
    • Maps with high scores (good predictions) have a higher chance of being picked.
    • Maps with low scores have a tiny chance, but they aren't completely eliminated yet.
  4. The "Nudge" (Excitation): Here is the secret sauce. Even when you are following a map, you add a tiny, random "nudge" to the steering wheel. This is like shaking the ship slightly to make sure the water reacts. This ensures that if your current map is wrong, the ship will react in a way that exposes the error, allowing you to learn faster.

3. The Three Scenarios

The paper tests this idea in three different levels of complexity:

  • Scenario A: The Finite Menu (Discrete Models)
    • Analogy: You have a physical stack of 50 printed maps.
    • Result: The algorithm quickly realizes which 1 or 2 maps are the best and stops wasting time on the bad ones. It learns very fast.
  • Scenario B: The Infinite Library (Continuous Models)
    • Analogy: You don't have a stack of maps; you have a library with infinite variations of maps (every possible curve and angle).
    • Result: The algorithm creates a "net" (a mathematical concept called a packing number) to catch the best map. It proves that even with infinite possibilities, you can still find the right path efficiently.
  • Scenario C: The Neural Network (Parametric Models)
    • Analogy: The map isn't a picture; it's a recipe with thousands of ingredients (parameters). You can tweak the amount of salt, sugar, or spice to change the map.
    • Result: This is the most practical scenario (like modern AI). The paper shows that even with thousands of "ingredients" to tune, the algorithm can find the perfect recipe and steer the ship safely.

4. Why This is a Big Deal

Previous methods had two main problems:

  1. They were too theoretical: They worked well in simple, linear worlds (like a straight road) but broke down in complex, non-linear worlds (like a winding mountain road).
  2. They were "Bayesian": They relied on "belief" and probability in a way that was hard to guarantee in the real world.

This paper's breakthrough:

  • It works in the real world: It handles complex, non-linear systems (like a drone flying in a storm or a robot arm moving heavy objects).
  • It provides a "Frequentist" guarantee: Instead of saying "We think we are good," it says "We guarantee that after NN steps, we will be this close to the optimal path." It's a mathematical promise of performance.
  • It's simple: The math behind it is surprisingly straightforward (using a "Hedge" update, which is like a smart way of betting), making it easier to implement in real engineering.

5. The "Benign Transient" Promise

In control theory, a "transient" is the messy, chaotic period at the beginning of a journey before things settle down.

  • Old methods: Might crash the ship or spin out of control while learning.
  • This paper: Guarantees that even while learning, the ship stays stable. The "nudge" is controlled, and the ship won't go off a cliff while trying to learn the map.

Summary

Think of this paper as a new navigation system for self-driving cars or robots. It says: "Don't just guess the map. Keep a list of possible maps, bet on the ones that look right, but occasionally wiggle the steering wheel to test your theories. This way, you will learn the true map faster than anyone else, and you won't crash while doing it."

The authors prove mathematically that this approach works for everything from simple linear systems to complex neural networks, offering a reliable path forward for AI in the real, messy world.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →