Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning

This paper proposes the Return Augmented (REAG) method, which aligns return distributions between source and target domains to effectively adapt Decision Transformer frameworks for offline off-dynamics reinforcement learning, thereby achieving theoretical suboptimality guarantees and improved empirical performance.

Ruhan Wang, Yu Yang, Zhishuai Liu, Dongruo Zhou, Pan Xu

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you are trying to learn how to drive a car, but you've never actually driven one before. You have two sources of information:

  1. The Target (Real Life): You have a tiny, crumpled notebook with just a few pages of notes about driving a specific car on a rainy day in New York. This is your Target Domain. It's the real deal, but you don't have much data.
  2. The Source (The Simulator): You have a massive, high-definition video game library with thousands of hours of driving footage. However, there's a catch: the cars in the game are slightly heavier, the tires are grippier, and the physics engine is a bit different from real life. This is your Source Domain.

The Problem:
If you just try to learn from the game (Source) and apply it to real life (Target), you might crash. The car in the game handles differently than the real car. This is called the "Dynamics Shift."

If you only try to learn from the tiny notebook (Target), you won't learn enough to be a good driver.

The Old Way (The "Rewrite" Strategy):
Previous methods tried to fix this by looking at the game footage and saying, "Okay, in this game, getting a 'perfect score' meant doing a specific turn. But in real life, a perfect turn looks different." So, they would go through the game footage and manually rewrite the scores (rewards) to match what a perfect turn looks like in real life.

Why the Old Way Failed for "Decision Transformers":
The paper focuses on a specific type of AI called a Decision Transformer. Think of this AI not as a student memorizing rules, but as a movie director.

  • The Director doesn't just learn how to drive; they learn to predict the next move based on a desired ending.
  • If you tell the Director, "I want a movie where the car ends up with a score of 100," the Director figures out the steps to get there.
  • The problem with the old "Rewrite" strategy is that it tried to change the score of the game footage to match the real world. But for this "Director" AI, the score is the script. If you change the script (the score) without changing the underlying story (the physics), the Director gets confused. It's like giving a director a script that says "The hero wins," but the movie they are watching shows the hero losing. The AI gets lost.

The New Solution: "Return Augmentation" (REAG)
The authors propose a new method called REAG (Return Augmented). Instead of trying to rewrite the script (the rewards), they change the expectation of the ending (the return) to match the real world.

Here is the analogy:
Imagine you are teaching a student (the AI) using a textbook from a different country (Source). The textbook uses a different currency.

  • Old Method: You go through the textbook and physically change every "$10" to "€9" so the numbers match your country. This is messy and breaks the logic of the math problems.
  • New Method (REAG): You keep the textbook exactly as it is, but you tell the student: "When you see a problem that says 'Goal: 100 points' in this book, imagine that goal is actually '100 points' in our currency."

You are augmenting the return. You are telling the AI: "The path to a '100' in the game is actually the path to a '100' in real life, even if the physics are slightly different."

How They Did It (Two Tricks):
The paper introduces two ways to do this translation:

  1. The "Mathematical Translator" (REAGDara):* This uses complex math to figure out exactly how the game physics differ from real life and adjusts the "score" accordingly. It's like having a translator who knows exactly how to convert the game's physics into real-world physics.
  2. The "Statistical Matchmaker" (REAGMV):* This is the star of the show. Instead of complex physics math, it looks at the average and spread of scores.
    • Analogy: Imagine the game scores are like a bell curve (a mountain shape). The real-life scores are also a bell curve, but maybe the mountain is taller or wider.
    • This method simply stretches or shrinks the game's score mountain until it perfectly overlaps with the real-life score mountain. It's like taking a photo of the game's score distribution and using a filter to make it look exactly like the real world's distribution.

The Results:
The authors tested this on robot simulations (like a robot walking or running).

  • They gave the robots very little data from the "real world" (Target).
  • They gave them tons of data from the "game world" (Source) with different physics.
  • The Outcome: The robots trained with the new REAG method became much better drivers than those trained with old methods. They could take the massive amount of game data and successfully apply it to the real world, even when the physics were different.

In Summary:
This paper solves the problem of "learning from a simulator that isn't quite real." Instead of trying to force the simulator to look like reality (which breaks the AI's logic), they taught the AI how to interpret the simulator's goals as if they were reality's goals. It's like teaching a student to read a foreign language not by translating every word, but by teaching them how to understand the story regardless of the language.

The result is a smarter, more adaptable AI that can learn from massive datasets in simulations and apply that knowledge to the messy, limited-data real world.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →