ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning

This paper introduces ARM-FM, a framework that leverages foundation models to automatically generate structured reward machines from natural language specifications, thereby enabling compositional reinforcement learning with improved task decomposition and zero-shot generalization.

Roger Creus Castanyer, Faisal Mohamed, Pablo Samuel Castro, Cyrus Neary, Glen Berseth

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a very smart, but very literal, robot how to cook a complex meal.

If you just say, "Make dinner," the robot might stare at the fridge, confused. If you say, "Make dinner, and if you fail, I'll give you a cookie," it might just eat the cookie and do nothing else. This is the biggest problem in Reinforcement Learning (RL): telling a computer exactly what to do and how to know it's doing a good job is incredibly hard.

This paper introduces a new system called ARM-FM (Automated Reward Machines via Foundation Models). Think of it as a super-smart translator and coach that bridges the gap between human language and robot logic.

Here is how it works, broken down with simple analogies:

1. The Problem: The "Black Box" Reward

In traditional RL, you have to hand-code a "reward function." This is like a scoreboard that tells the robot if it's winning or losing.

  • The Issue: If the robot has to walk 100 steps to get a coin, and you only give the coin at the very end, the robot gets no feedback for 99 steps. It's like playing a video game where you only get a "Game Over" message if you win, but no "Good job!" for jumping over a pit. The robot gives up.
  • The Risk: If you try to give points for every small step, the robot might find a "cheat code" (like spinning in circles to get points) instead of actually solving the problem.

2. The Solution: The "Reward Machine" (The GPS)

The authors use something called a Reward Machine. Imagine a GPS navigation system for the robot.

  • Instead of just saying "Go to the store," the GPS breaks the trip down: "Turn left," "Drive 2 miles," "Stop at the red light," "Turn right."
  • Every time the robot completes a small step (like turning left), the GPS gives it a tiny "ding" of encouragement.
  • This turns a scary, long journey into a series of small, manageable tasks.

3. The Magic Ingredient: Foundation Models (The Creative Director)

Usually, a human expert has to build this GPS map by hand. It takes a long time and requires deep knowledge of the robot's world.

  • The Innovation: ARM-FM uses Foundation Models (like the AI behind this very explanation) to build the GPS automatically.
  • You simply tell the AI in plain English: "The robot needs to find a key, unlock a door, and get to the green square."
  • The AI acts as a Creative Director. It instantly understands the story, breaks it down into steps, and writes the code for the GPS (the Reward Machine) all by itself. It even writes the "rules" for when the robot gets a "ding."

4. The Secret Sauce: "Language Embeddings" (The Shared Vocabulary)

This is the most clever part.

  • Usually, if you teach a robot to open a red door, it doesn't know how to open a blue door. It has to relearn everything from scratch.
  • ARM-FM gives the robot a dictionary. Every time the robot is at a step (like "Find the key"), the AI gives it a special "language tag" (an embedding).
  • Because the AI understands language, it knows that "Find the red key" and "Find the blue key" are very similar concepts.
  • The Result: The robot learns a general skill called "finding keys." If you later ask it to find a green key (a task it has never seen before), it can use that same skill immediately. This is called Zero-Shot Generalization—solving a new puzzle without any new training.

Real-World Examples from the Paper

The team tested this in three very different worlds:

  1. MiniGrid (2D Mazes): The robot had to solve mazes with keys and doors. Standard robots got stuck; ARM-FM robots solved them easily.
  2. Craftium (Minecraft-like 3D World): The robot had to gather wood, stone, and iron to build a diamond pickaxe. This is a huge, complex task. The standard robot wandered aimlessly. The ARM-FM robot followed the AI-generated GPS and successfully mined the diamond.
  3. Robotics (Real Arms): They tested it on robotic arms that need to pick up objects. The AI generated the instructions, and the robot learned to move smoothly without needing a human to program every tiny motor movement.

The Bottom Line

ARM-FM is like giving a robot a smart, talking coach that can:

  1. Listen to your vague human goals.
  2. Break them down into a perfect, step-by-step plan.
  3. Give the robot encouragement at every step.
  4. Help the robot apply what it learned in one situation to a totally new situation.

It turns the difficult art of "programming a robot's motivation" into a simple conversation, allowing robots to learn complex, long-term tasks much faster and more reliably than ever before.