Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting

The Big Picture: The "Fake GPS" Problem

Imagine you are trying to learn how to drive a car, but you aren't allowed to get behind the wheel. Instead, you have a massive video library of other people driving (this is Offline Reinforcement Learning).

To get better, you build a simulation (a "dynamics model") based on those videos. You pretend to drive in this simulation to practice new moves. This is Model-Based Offline RL.

The Problem: Your simulation isn't perfect. It's like a GPS that sometimes makes up roads that don't exist.

If you trust the GPS too much, you might try to drive off a cliff because the GPS said, "Turn left here, it's a shortcut!" (This is called Model Exploitation).
To stop this, previous methods (like RAMBO) tried to be super pessimistic. They told the simulation: "Assume the worst possible outcome for every turn."
The Catch: Being too pessimistic is dangerous. It's like a GPS that says, "Don't drive anywhere, because you might crash." The car never moves, or the GPS starts glitching out (gradient explosion) because it's trying to predict a disaster that doesn't exist.

The Solution: ROMI (The Smart, Balanced Coach)

The authors propose a new method called ROMI. Think of ROMI as a smart driving coach who fixes the GPS without making it useless. They do this in two main ways:

1. The "Safety Bubble" (Robust Value-Aware Learning)

Instead of just saying "Assume the worst," ROMI creates a Safety Bubble around every prediction.

Old Way (RAMBO): "If you turn left, you might die. So, don't turn left." (Too scary, stops learning).
ROMI Way: "If you turn left, imagine you are in a small, fuzzy bubble of uncertainty. Inside that bubble, what is the worst thing that could happen? Okay, that's the value we use."
The Magic: The size of this bubble is adjustable.
- Small bubble = You are confident, take more risks.
- Big bubble = You are unsure, be very careful.
- Why it works: This lets the AI control exactly how "scared" it should be, preventing the GPS from crashing (gradient explosion) while still keeping it safe.

2. The "Smart Highlighter" (Implicitly Differentiable Adaptive Weighting)

Here is the second problem: The simulation is good at predicting what happens right now, but if you keep driving in the simulation for 10 steps, the errors pile up. It's like a game of "Telephone" where the message gets garbled after a few turns.

ROMI introduces a Smart Highlighter (a weighting network) that acts like a teacher grading your practice sessions.

The Inner Loop (The Student): The simulation tries to learn from the video data.
The Outer Loop (The Teacher): The "Highlighter" looks at the simulation's predictions.
- If the simulation predicts a future state that leads to a bad outcome (a crash), the Highlighter says, "Hey, pay extra attention to this specific video clip! We need to learn how to avoid this."
- If the simulation predicts a safe outcome, the Highlighter says, "Okay, that's fine, move on."
The Result: The simulation learns to focus on the dangerous parts of the data that matter most for safety, while still remembering how the car actually drives. It balances Dynamics Awareness (knowing how the car moves) and Value Awareness (knowing when it's dangerous).

The Analogy: Learning to Cook

Imagine you are learning to cook by watching old family recipes (Offline Data), but you don't have a kitchen to practice in. You build a mental model of how the food cooks.

The Risk: Your mental model might think, "If I add salt, the soup will explode." So you never add salt. The soup tastes terrible.
RAMBO's Approach: It tries to be super careful. "If I add salt, maybe it explodes. Maybe it burns. Maybe the kitchen catches fire." It gets so scared it stops cooking.
ROMI's Approach:
- The Bubble: It says, "Okay, let's assume the salt might make the soup too salty (the worst case in our safety bubble). We'll plan for that, but we won't assume the kitchen explodes."
- The Highlighter: As you mentally practice cooking, ROMI highlights the specific moments where you almost burned the soup. It tells your brain, "Focus on this step of the recipe. That's where the danger is."

Why This Matters (The Results)

The authors tested ROMI on many different "driving" and "robot" tasks (like the D4RL and NeoRL datasets).

RAMBO (the old method) often failed or crashed when they tried to make it safer.
ROMI was able to be safely conservative without breaking.
The Score: ROMI beat almost every other method, including the current state-of-the-art. It learned to drive faster and safer than the competition, especially on tricky tracks where other methods gave up.

Summary

ROMI is a new way for AI to learn from past data without trying new things. It fixes the problem of being too scared to learn by using a tunable safety bubble and a smart highlighting system that teaches the AI exactly where to be careful. It's the difference between a GPS that tells you to stay in bed and a GPS that tells you, "Drive carefully, but you can get there."

1. Problem Statement

Model-based Offline Reinforcement Learning (RL) aims to improve data efficiency and generalization by learning a dynamics model to facilitate policy exploration. However, a critical challenge is model exploitation, where the policy exploits inaccuracies in the learned model (regions not well-covered by the dataset), leading to performance degradation.

To mitigate this, conservatism (pessimism) is required. The state-of-the-art method, RAMBO (Rigter et al., 2022), uses an adversarial learning framework (minimax formulation) to force the dynamics model to reduce value estimates in Out-of-Distribution (OOD) regions.

Limitations of RAMBO identified by the authors:

Over-Conservatism & Instability: RAMBO relies on a trade-off coefficient ( $\lambda$ $λ$ ) to balance adversarial loss and Maximum Likelihood Estimation (MLE). Empirical analysis shows that $\lambda$ $λ$ is extremely sensitive.
- If $\lambda$ is too small (as in the original implementation), the adversarial effect is negligible.
- If $\lambda$ is slightly larger, it causes severe Q-value underestimation and gradient explosion, leading to training collapse.
Lack of Dynamics Awareness: Existing robust methods often focus solely on value conservatism, neglecting the accuracy of the dynamics model itself, which harms OOD generalization during multi-step rollouts.

2. Methodology: ROMI

The authors propose ROMI (RObust value-aware Model learning with Implicitly differentiable adaptive weighting), which replaces RAMBO's unstable gradient-based adversarial approach with a more robust and controllable framework.

A. Robust Value-Aware Model Learning (RVL)

Instead of minimizing value via model gradients (which causes instability), ROMI reformulates the adversarial objective to explicitly characterize the one-step value estimation error.

Core Concept: The dynamics model is trained to predict future states such that their values are close to the minimum Q-value within a state uncertainty set.
Wasserstein Uncertainty Set: The authors define a dynamics uncertainty set using the Wasserstein distance. By leveraging the dual form of this distance, the problem is transformed from optimizing over a distribution of dynamics models to optimizing over a state uncertainty set $U_\xi(s') = \{\hat{s} \mid d(\hat{s}, s') \leq \xi\}$ .
Loss Function: The model minimizes the difference between the expected value of the predicted next state and the minimum value found within the perturbed state uncertainty set:
$L_{RVL} = \mathbb{E}_{(s,a,s') \in D} \left[ \left( \mathbb{E}_{\hat{s}' \sim \hat{T}_\psi} [\hat{V}(\hat{s}')] - \min_{\tilde{s}' \in U_\xi(s')} \hat{V}(\tilde{s}') \right)^2 \right]$
Controllability: The scale of the uncertainty set, $\xi$ , directly controls the degree of conservatism. This allows for flexible and stable adjustment of the pessimism level without causing gradient explosions.

B. Implicitly Differentiable Adaptive Weighting

To address the issue that RVL focuses on value awareness but may ignore dynamics awareness (leading to compounding errors in multi-step rollouts), ROMI introduces a bi-level optimization scheme.

Inner Level (Dynamics Awareness): The dynamics model is updated via Weighted Supervised Learning (WSL). An adaptive weighting network $w_\nu(s, a, s')$ assigns weights to training samples to improve the reconstruction of environmental dynamics.
Outer Level (Value Awareness): The weighting network parameters $\nu$ are updated to minimize the $L_{RVL}$ loss.
Implicit Differentiation: The gradient of the outer loss with respect to the weighting network is computed using implicit differentiation (chain rule through the inner optimization step). This allows the system to adaptively re-weight samples that contribute most to minimizing the robust value-aware loss while maintaining accurate dynamics prediction.

3. Key Contributions

Identification of RAMBO's Instability: The paper empirically demonstrates that RAMBO's reliance on a hand-tuned $\lambda$ leads to a trade-off between being ineffective (too small $\lambda$ ) or unstable (too large $\lambda$ ), causing gradient explosions.
Robust Value-Aware Model Learning (RVL): A novel loss function that enforces conservatism by requiring the model to predict values close to the minimum within a Wasserstein-based state uncertainty set. This replaces the unstable adversarial gradient with a controllable scale parameter $\xi$ .
Implicitly Differentiable Adaptive Weighting: A bi-level optimization framework that simultaneously achieves dynamics awareness (via weighted MLE) and value awareness (via robust loss minimization), significantly improving OOD generalization.
Theoretical Guarantees: The authors provide proofs showing that the learned Q-function remains bounded and derive convergence rates for the proposed bi-level optimization framework.

4. Experimental Results

The method was evaluated on D4RL (MuJoCo and Antmaze) and NeoRL benchmarks.

Performance vs. RAMBO: ROMI significantly outperforms RAMBO on 11 out of 12 D4RL MuJoCo datasets. It achieves a total normalized score of 953.5 compared to RAMBO's 804.1 (an 18.6% improvement).
State-of-the-Art Comparison: ROMI matches or surpasses other SOTA methods like MOBILE and Count-MORL, particularly on datasets where RAMBO typically fails (e.g., hopper-medium-replay, walker2d-medium-expert).
NeoRL Benchmark: ROMI outperforms all baselines (including model-free methods like CQL and IQL) on 6 out of 9 NeoRL tasks.
Stability Analysis:
- Unlike RAMBO, which suffers from gradient explosion when $\lambda$ increases, ROMI maintains stable gradient norms across a wide range of $\xi$ values (from 0.01 to 10).
- Q-value estimates in ROMI are distinguishable and controllable based on $\xi$ , avoiding the divergence seen in RAMBO.
Ablation Studies: Removing the adaptive weighting component leads to higher prediction errors and lower performance, confirming the necessity of the bi-level optimization for OOD generalization.

5. Significance

This paper addresses a critical bottleneck in model-based offline RL: the trade-off between conservatism and training stability.

Practical Impact: It provides a more robust alternative to adversarial model learning, eliminating the need for fragile hyperparameter tuning that often leads to training collapse.
Theoretical Insight: It bridges the gap between robust control theory (Wasserstein uncertainty sets) and practical deep RL implementation via implicit differentiation.
Generalization: By explicitly balancing dynamics accuracy and value conservatism, ROMI enables agents to safely explore OOD regions, a crucial capability for real-world deployment where data coverage is limited.

In summary, ROMI offers a stable, theoretically grounded, and empirically superior approach to model-based offline RL, solving the instability issues of its predecessor (RAMBO) while enhancing generalization capabilities.