Online Robust Reinforcement Learning with General Function Approximation

Imagine you are teaching a robot to play a video game, like balancing a pole on a moving cart.

The Problem: The "Perfect Practice" Trap
Usually, we train robots in a perfect, simulated world. The robot learns that "if I push left, the cart moves left." It gets really good at this. But when you put the robot in the real world, things change. Maybe the floor is slippery, the wind is blowing, or the cart's wheels are slightly worn out. The robot, which was trained on "perfect" data, panics and fails. It's like a student who memorized the answers to a practice test but fails the real exam because the questions were slightly different.

The Old Solution: The "Safety Net" Approach
Previous methods tried to fix this by assuming we have a massive library of data or a "magic box" (a generative model) that can simulate every possible disaster scenario. They would say, "Okay, let's train the robot on 10 million different versions of the game, including ones where the floor is made of ice."

The Flaw: In the real world, we don't have infinite data or magic boxes. We only have the robot interacting with the real environment, one step at a time. Also, these old methods were like trying to solve a puzzle with a million tiny pieces (tabular), which breaks down when the world gets too big or complex.

The New Solution: "The Paranoid Optimist"
This paper introduces a new algorithm called RFL-ϕ. Think of it as a Paranoid Optimist.

The Optimist Part: Like standard AI, it wants to learn the best way to win. It tries things, learns from mistakes, and gets better.
The Paranoid Part: But, it assumes that every time it takes a step, the universe might try to trick it. It asks, "What is the worst possible thing that could happen right now, given that my sensors might be slightly off or the ground might be slippery?"

Instead of just learning "Push left = Move left," it learns "Push left = Move left, unless the floor is icy, in which case I might slide right, so I need to be ready for that."

How It Works (The Creative Analogy)

Imagine you are a chef trying to perfect a soup recipe.

Old Way (Offline/Tabular): You have a giant library of every soup ever made. You taste them all, write down the exact recipe for every possible variation, and then try to memorize the whole library. This is impossible if the library is infinite (like a real-world environment).
The New Way (RFL-ϕ): You are cooking in a real kitchen. You taste the soup as you go.
- The "Dual" Trick: Instead of just tasting the soup, you have a Skeptical Sous-Chef (the "Dual" part). Every time you think, "This soup tastes perfect," the Sous-Chef says, "Wait, what if the salt shaker was actually full of sugar? What if the heat was 10 degrees hotter?"
- The Sous-Chef doesn't just guess; it uses a mathematical formula to calculate the worst-case flavor profile based on how much the ingredients might vary (the "uncertainty set").
- You then adjust your recipe to make sure the soup tastes good even if the Sous-Chef's worst-case scenario happens.

The "Magic" Behind the Scenes

The paper uses some fancy math terms, but here is the simple translation:

General Function Approximation: Instead of memorizing every single state (like "Cart at position 1, speed 2"), the robot uses a flexible brain (like a neural network) to understand patterns. It learns the concept of balance, not just a list of rules.
Robust Bellman-Eluder Dimension: This is a fancy way of measuring how hard the puzzle is.
- Imagine a maze. Some mazes are simple; you only need to remember a few turns. Others are complex; you need to remember thousands of paths.
- This paper invented a new ruler to measure the complexity of the "worst-case" maze. It proves that even in a chaotic, changing world, the robot can learn efficiently if the "worst-case" maze isn't infinitely complex.
No Magic Data: The robot learns only by doing. It doesn't need a pre-collected library of disasters. It learns to be robust while it is learning to be good.

Why This Matters

Safety: If you are training a self-driving car, you don't want it to crash just because it rains. This method teaches the car to drive safely even if the rain is heavier than expected.
Scalability: It works for huge, complex problems (like controlling a robot with 100 joints) where old methods would crash because they tried to memorize every possibility.
Efficiency: It learns faster because it focuses on the structure of the problem (the "worst-case" patterns) rather than brute-forcing every single scenario.

In a Nutshell
This paper teaches robots to stop being "naive optimists" who assume the world is perfect, and start being "smart pessimists" who prepare for the worst while still trying to win. It does this without needing a supercomputer to simulate every possible disaster, making it practical for real-world robots, self-driving cars, and medical AI.

1. Problem Formulation

The paper addresses the challenge of Distributionally Robust Reinforcement Learning (DR-RL) in online settings with general function approximation.

Context: Standard RL assumes the environment dynamics during training (nominal) match those during deployment. In reality, distributional shifts (e.g., noise, unmodeled dynamics) cause performance degradation. DR-RL seeks policies that maximize performance under the worst-case transition dynamics within a prescribed uncertainty set.
The Gap: Existing DR-RL methods typically rely on:
1. Strong Data Assumptions: Access to generative models or large offline datasets with strong coverage.
2. Tabular Settings: They fail to scale to large or continuous state-action spaces.
3. Structural Constraints: They often require linear MDP structures or small discount factors.
Goal: The authors aim to design a purely online, sample-efficient DR-RL algorithm that:
- Learns solely through interaction with an unknown nominal environment.
- Handles general function approximation (e.g., neural networks) without assuming linear structure.
- Provides rigorous theoretical regret guarantees independent of state/action space sizes.

2. Methodology: RFL- $\phi$

The authors propose RFL- $\phi$ (Robust Fitted Learning with $\phi$ -divergence), an algorithm that combines fitted value iteration with a dual-driven robust Bellman operator.

A. Problem Setup

RMDP: Formulated as an episodic finite-horizon Markov Decision Process with an uncertainty set $\mathcal{P}$ centered around a nominal kernel $P^\star$ .
Uncertainty: The set is defined via $\phi$ -divergence (e.g., Total Variation, $\chi^2$ , KL), denoted as $U_{\phi, \sigma}(P^\star)$ .
Objective: Minimize cumulative robust regret: $\sum (V^{\star, \sigma}_1 - V^{\pi_k, \sigma}_1)$ .

B. Core Algorithmic Components

Dual Formulation of Robust Bellman Operator:
- Directly computing the worst-case expectation $E_{U_{\phi, \sigma}}[V]$ is computationally expensive (requires solving an optimization over the uncertainty set for every state-action pair).
- The authors utilize a dual representation (via convex conjugates) to transform the robust expectation into a minimization problem over dual variables $(\eta, \nu)$ .
- This allows the robust Bellman operator to be approximated by minimizing a dual loss function over a function class $\mathcal{G}$ .
Functional Optimization (Global vs. Pointwise):
- Instead of learning per-state bonuses (common in tabular UCB), RFL- $\phi$ learns a global uncertainty quantifier.
- It jointly learns a value function $f \in \mathcal{F}$ and a dual function $g \in \mathcal{G}$ by minimizing an empirical dual loss over collected data.
- The robust backup is computed as: $r(s,a) - \mathbb{E}_{P^\star}[l_\phi(f; s, a, s'; g)]$ , where $g$ is the minimizer of the dual loss.
Optimistic Exploration:
- The algorithm maintains a confidence set $\mathcal{F}^{(k)}$ containing functions that fit the data well (low empirical robust Bellman error) and are close to the best fit.
- It selects an optimistic policy $\pi^{(k)}$ corresponding to the function in $\mathcal{F}^{(k)}$ that maximizes the estimated return.

C. Complexity Measure: Robust Bellman-Eluder (BE) Dimension

The authors introduce the Robust Bellman-Eluder (BE) Dimension as the intrinsic complexity measure for learnability.
It is defined based on the Distributional Eluder (DE) dimension of the robust Bellman residual class $(I - T^{\phi, \sigma})\mathcal{F}$ under on-policy distributions.
Unlike previous measures, it does not require coverage or concentrability assumptions, making it suitable for online exploration.

3. Key Contributions

First Purely Online DR-RL with General Function Approximation:
- RFL- $\phi$ is the first algorithm to achieve sample efficiency in online DR-RL without generative models, offline datasets, or linear structure assumptions.
Dual-Driven Fitted Learning:
- Introduces a novel mechanism where the dual variables serve a dual purpose: they approximate the robust Bellman operator and drive exploration by defining global confidence sets. This differs from offline methods where duals do not influence data collection.
Theoretical Guarantees via Robust BE Dimension:
- Establishes sublinear regret bounds dependent solely on the Robust BE dimension ( $d$ ), horizon ( $H$ ), and uncertainty parameters ( $\sigma$ ).
- The bounds are independent of the state ( $S$ ) and action ( $A$ ) space sizes, proving scalability.
- The regret bound scales as $\tilde{O}(\sqrt{d H^2 B_\phi(\sigma)^2 K})$ , where $K$ is the number of episodes.
Specialization to Structured Problems:
- Tabular Setting: Recovers near-optimal rates (e.g., $\tilde{O}(S^2 A^2)$ ) and improves upon prior work by removing dependency on coverage coefficients ( $C_{vr}$ ).
- Linear RMDPs: Recovers rates comparable to existing linear DR-RL methods ( $\tilde{O}(d_{lin}^2)$ ) while strictly extending the framework to general function classes.

4. Results and Experiments

Theoretical Results

Regret Bound: For TV, $\chi^2$ , and KL divergences, the sample complexity to find an $\epsilon$ -optimal policy is $\tilde{O}(H^5 \cdot \text{divergence\_factor}^2 \cdot d \cdot \epsilon^{-2})$ .
Sharpness: The bounds match known lower bounds for tabular and linear cases, demonstrating tightness.
Generality: The framework unifies various robust RL problems under a single complexity measure (Robust BE dimension).

Empirical Results (CartPole-v1)

Setup: The authors implemented RFL-TV (using Total Variation) with neural networks (Double Q-learning + Dual Network).
Baselines: Compared against DQN, GOLF (non-robust), and GOLF-DUAL (non-robust with dual architecture).
Perturbations: Tested under Action Perturbation (random actions), Force-Magnitude Perturbation, and Pole-Length Perturbation.
Findings:
- Robustness: RFL-TV significantly outperformed non-robust baselines under distributional shifts (e.g., 30–60% higher returns under action noise).
- Oracle Comparison: RFL-TV matched or exceeded the performance of OPROVI-TV, a tabular algorithm that solves the robust Bellman equations exactly, despite RFL-TV using function approximation.
- Hyperparameters: Increasing the robustness radius ( $\sigma$ ) and dual network capacity improved robustness, confirming the trade-off between conservatism and performance.

5. Significance

This work represents a major step forward in making Robust RL practical for real-world, high-dimensional applications.

Scalability: By moving away from tabular methods and linear assumptions, it enables robust learning in complex environments (e.g., robotics, autonomous driving) where state spaces are continuous and large.
Data Efficiency: It eliminates the need for expensive offline datasets or simulators, relying instead on efficient online interaction.
Theoretical Foundation: It provides the first intrinsic complexity characterization (Robust BE dimension) for online DR-RL, bridging the gap between robust optimization and modern function approximation theory.

In summary, the paper presents a theoretically grounded, scalable, and empirically effective framework for learning robust policies in uncertain environments without prior data or restrictive structural assumptions.

Online Robust Reinforcement Learning with General Function Approximation

1. Problem Formulation

2. Methodology: RFL-ϕ\phiϕ

A. Problem Setup

B. Core Algorithmic Components

C. Complexity Measure: Robust Bellman-Eluder (BE) Dimension

3. Key Contributions

4. Results and Experiments

Theoretical Results

Empirical Results (CartPole-v1)

5. Significance

More like this

Convolutional Surrogate for 3D Discrete Fracture-Matrix Tensor Upscaling

Generating Counterfactual Patient Timelines from Real-World Data

LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning

SIEVE: Sample-Efficient Parametric Learning from Natural Language

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

2. Methodology: RFL- $\phi$