Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning

Imagine you are trying to learn how to drive a car, but you've never actually driven one before. You have two sources of information:

The Target (Real Life): You have a tiny, crumpled notebook with just a few pages of notes about driving a specific car on a rainy day in New York. This is your Target Domain. It's the real deal, but you don't have much data.
The Source (The Simulator): You have a massive, high-definition video game library with thousands of hours of driving footage. However, there's a catch: the cars in the game are slightly heavier, the tires are grippier, and the physics engine is a bit different from real life. This is your Source Domain.

The Problem:
If you just try to learn from the game (Source) and apply it to real life (Target), you might crash. The car in the game handles differently than the real car. This is called the "Dynamics Shift."

If you only try to learn from the tiny notebook (Target), you won't learn enough to be a good driver.

The Old Way (The "Rewrite" Strategy):
Previous methods tried to fix this by looking at the game footage and saying, "Okay, in this game, getting a 'perfect score' meant doing a specific turn. But in real life, a perfect turn looks different." So, they would go through the game footage and manually rewrite the scores (rewards) to match what a perfect turn looks like in real life.

Why the Old Way Failed for "Decision Transformers":
The paper focuses on a specific type of AI called a Decision Transformer. Think of this AI not as a student memorizing rules, but as a movie director.

The Director doesn't just learn how to drive; they learn to predict the next move based on a desired ending.
If you tell the Director, "I want a movie where the car ends up with a score of 100," the Director figures out the steps to get there.
The problem with the old "Rewrite" strategy is that it tried to change the score of the game footage to match the real world. But for this "Director" AI, the score is the script. If you change the script (the score) without changing the underlying story (the physics), the Director gets confused. It's like giving a director a script that says "The hero wins," but the movie they are watching shows the hero losing. The AI gets lost.

The New Solution: "Return Augmentation" (REAG)
The authors propose a new method called REAG (Return Augmented). Instead of trying to rewrite the script (the rewards), they change the expectation of the ending (the return) to match the real world.

Here is the analogy:
Imagine you are teaching a student (the AI) using a textbook from a different country (Source). The textbook uses a different currency.

Old Method: You go through the textbook and physically change every "$10" to "€9" so the numbers match your country. This is messy and breaks the logic of the math problems.
New Method (REAG): You keep the textbook exactly as it is, but you tell the student: "When you see a problem that says 'Goal: 100 points' in this book, imagine that goal is actually '100 points' in our currency."

You are augmenting the return. You are telling the AI: "The path to a '100' in the game is actually the path to a '100' in real life, even if the physics are slightly different."

How They Did It (Two Tricks):
The paper introduces two ways to do this translation:

The "Mathematical Translator" (REAG_Dara):* This uses complex math to figure out exactly how the game physics differ from real life and adjusts the "score" accordingly. It's like having a translator who knows exactly how to convert the game's physics into real-world physics.
The "Statistical Matchmaker" (REAG_MV):* This is the star of the show. Instead of complex physics math, it looks at the average and spread of scores.
- Analogy: Imagine the game scores are like a bell curve (a mountain shape). The real-life scores are also a bell curve, but maybe the mountain is taller or wider.
- This method simply stretches or shrinks the game's score mountain until it perfectly overlaps with the real-life score mountain. It's like taking a photo of the game's score distribution and using a filter to make it look exactly like the real world's distribution.

The Results:
The authors tested this on robot simulations (like a robot walking or running).

They gave the robots very little data from the "real world" (Target).
They gave them tons of data from the "game world" (Source) with different physics.
The Outcome: The robots trained with the new REAG method became much better drivers than those trained with old methods. They could take the massive amount of game data and successfully apply it to the real world, even when the physics were different.

In Summary:
This paper solves the problem of "learning from a simulator that isn't quite real." Instead of trying to force the simulator to look like reality (which breaks the AI's logic), they taught the AI how to interpret the simulator's goals as if they were reality's goals. It's like teaching a student to read a foreign language not by translating every word, but by teaching them how to understand the story regardless of the language.

The result is a smarter, more adaptable AI that can learn from massive datasets in simulations and apply that knowledge to the messy, limited-data real world.

1. Problem Definition

The paper addresses Offline Off-Dynamics Reinforcement Learning (RL). This setting involves learning a policy for a target environment ( $M_T$ ) where data is scarce, by leveraging a large dataset from a source environment ( $M_S$ ).

Key Challenge: The source and target environments share the same reward function but have different transition dynamics ( $P_S \neq P_T$ ). This "dynamics shift" causes policies trained on source data to fail or perform poorly when deployed in the target environment.
Specific Context: The authors focus on Return-Conditioned Supervised Learning (RCSL), specifically Decision Transformer (DT) architectures. Unlike traditional RL methods that rely on dynamic programming (e.g., Q-learning), RCSL treats RL as a sequence modeling task where actions are predicted conditioned on a desired return (target cumulative reward) and trajectory history.
Gap in Existing Work: Previous methods for off-dynamics RL (e.g., DARA) use reward augmentation to align trajectory distributions. However, these methods are not directly applicable to RCSL/DT because:
1. RCSL policies explicitly depend on the conditional return-to-go, making trajectory matching via reward augmentation insufficient.
2. There is no straightforward representation of the optimal trajectory distribution in the RCSL framework.

2. Methodology: Return Augmented (REAG)

The authors propose REAG (Return Augmented), a framework designed to align the return distributions of source trajectories with those of the target environment, enabling the learning of a target-optimal policy using primarily source data.

The core idea is to transform the returns ( $g$ ) in the source dataset ( $D_S$ ) using a transformation function $\psi$ such that the augmented source policy $\pi_S$ approximates the target policy $\pi_T$ . The paper introduces two practical implementations:

A. REAG $^*_{Dara}$ (Dynamics-Aware Reward Augmentation)

Basis: Derived from the DARA algorithm (Liu et al., 2022), which uses probabilistic inference to align source and target trajectory distributions.
Mechanism: It modifies the reward signal in the source domain by adding a term derived from the log-ratio of transition probabilities between target and source dynamics.
Implementation: Uses binary classifiers to estimate the likelihood of a transition belonging to the source or target domain. The return is augmented as:
$\psi(g(\tau)) = \sum r_t + \eta \sum \Delta r(s_t, a_t, s_{t+1})$
where $\Delta r$ is the log-likelihood ratio of the transition dynamics.
Limitation: It assumes a single fixed target policy and may not generalize well across the diverse family of policies induced by varying the return condition $f$ in DT.

B. REAG $^*_{MV}$ (Mean-Variance Matching)

Basis: A novel approach specifically designed for the return-conditioned nature of DT. It aims to match the distribution of returns directly rather than just the rewards.
Mechanism: It assumes the return distributions in both source and target domains follow a Gaussian distribution. The transformation $\psi$ maps a source return $g_S$ to a target return $g_T$ by aligning their means ( $\mu$ ) and standard deviations ( $\sigma$ ):
$\psi(g_S) = g_S \cdot \frac{\sigma_T(s, a)}{\sigma_S(s, a)} + (\mu_T(s, a) - \mu_S(s, a) \cdot \frac{\sigma_T(s, a)}{\sigma_S(s, a)})$
Estimation: The means and variances are estimated using value functions ( $Q$ -values) learned via Conservative Q-Learning (CQL) on both source and target datasets.
Stabilization: To prevent training instability caused by extreme variance ratios, a clipping technique is applied to constrain the ratio $\sigma_T / \sigma_S$ within a bounded range $[\theta_2, \theta_1]$ .

3. Theoretical Analysis

The authors provide a finite-sample analysis demonstrating that the RCSL policy learned via REAG on the source domain achieves a level of suboptimality comparable to a policy trained directly on the target domain (without dynamics shift).

Key Result: Under assumptions of bounded occupancy mismatch and return coverage, the suboptimality bound for the mixed dataset (Target + Augmented Source) scales with the total sample size ( $N_S + N_T$ ).
Implication: When the source dataset is significantly larger than the target dataset ( $N_S \gg N_T$ ) and there is sufficient domain overlap, REAG effectively leverages the abundant source data to overcome the scarcity of target data, achieving performance close to the optimal target policy.

4. Experimental Results

The methods were evaluated on the D4RL benchmark (MuJoCo environments: Walker2D, Hopper, HalfCheetah) under two types of dynamics shifts: BodyMass Shift (changing agent mass) and JointNoise Shift (adding noise to actions).

Baselines: Compared against traditional offline RL methods (BEAR, CQL, BCQ, etc.), their DARA-augmented variants, and other off-dynamics methods (H2O, IGDF).
Performance:
- REAG $^*_{MV}$ consistently outperformed all baselines, including non-augmented DT variants and DARA-augmented traditional RL methods.
- REAG $^*_{MV}$ vs. REAG $^*_{Dara}$ : The Mean-Variance matching approach (REAG $^*_{MV}$ ) was more robust and stable across different environments and dataset qualities (Medium, Medium-Replay, Medium-Expert) compared to the reward-augmentation approach.
- DT-Type Frameworks: The proposed methods significantly boosted the performance of DT, Reinformer, and QT frameworks in off-dynamics settings, with QT + REAG $^*_{MV}$ achieving state-of-the-art results.
Ablation Studies:
- Clipping: Essential for stabilizing REAG $^*_{MV}$ training, especially in high-variance scenarios.
- Consistency: Enforcing strict return consistency ( $R_{t+1} - R_t = r_t$ ) in the augmented returns was found to be unnecessary and sometimes detrimental; the method works best with flexible return relabeling.
- Q-Function Quality: The method is robust to variations in the quality of the underlying CQL value function estimates.

5. Key Contributions

Novel Framework (REAG): The first method explicitly designed to handle off-dynamics shifts within Return-Conditioned Supervised Learning (RCSL) and Decision Transformer architectures.
Two Implementations: Introduced REAG $^*_{Dara}$ (adaptation of reward augmentation) and REAG $^*_{MV}$ (direct return distribution matching), with the latter proving superior for sequence modeling approaches.
Theoretical Guarantees: Provided rigorous suboptimality bounds showing that REAG can achieve target-domain optimality using source data, provided there is sufficient domain overlap.
Empirical Validation: Demonstrated consistent improvements across multiple DT-type baselines and environments, establishing a new benchmark for off-dynamics offline RL.

6. Significance

This work bridges a critical gap in offline reinforcement learning. In real-world applications (e.g., autonomous driving, medical treatment), collecting large amounts of target data is often impossible or unethical, and simulators (source) inevitably differ from reality (target).

Efficiency: REAG allows practitioners to utilize massive, easily accessible source datasets to train robust policies for data-scarce target environments.
Paradigm Shift: It validates that Return-Conditioned Supervised Learning is a viable and powerful alternative to traditional dynamic programming-based RL for domain adaptation, provided the return distribution is correctly aligned.
Scalability: By leveraging the sequence modeling capabilities of Transformers, the method scales well and offers a stable, data-efficient solution for complex control tasks under dynamics shifts.

Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning

1. Problem Definition

2. Methodology: Return Augmented (REAG)

A. REAGDara∗^*_{Dara}Dara∗​ (Dynamics-Aware Reward Augmentation)

B. REAGMV∗^*_{MV}MV∗​ (Mean-Variance Matching)

3. Theoretical Analysis

4. Experimental Results

5. Key Contributions

6. Significance

More like this

Varying risk exposure in auto insurance: a weighted tweedie framework for experience rating an cancellation penalties

Remote, bivariate expert elicitation to determine the prior probability distribution for sample size calculation in a Bayesian non-inferiority multicenter randomized controlled trial (Croup Dosing Trial)

Sequentially-Rerandomized Switchback Experiments

Reinforcement Learning from Human Feedback: A Statistical Perspective

Applied Statistics Requires Scientific Context

A. REAG $^*_{Dara}$ (Dynamics-Aware Reward Augmentation)

B. REAG $^*_{MV}$ (Mean-Variance Matching)