ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning

Imagine you are trying to teach a very smart, but very literal, robot how to cook a complex meal.

If you just say, "Make dinner," the robot might stare at the fridge, confused. If you say, "Make dinner, and if you fail, I'll give you a cookie," it might just eat the cookie and do nothing else. This is the biggest problem in Reinforcement Learning (RL): telling a computer exactly what to do and how to know it's doing a good job is incredibly hard.

This paper introduces a new system called ARM-FM (Automated Reward Machines via Foundation Models). Think of it as a super-smart translator and coach that bridges the gap between human language and robot logic.

Here is how it works, broken down with simple analogies:

1. The Problem: The "Black Box" Reward

In traditional RL, you have to hand-code a "reward function." This is like a scoreboard that tells the robot if it's winning or losing.

The Issue: If the robot has to walk 100 steps to get a coin, and you only give the coin at the very end, the robot gets no feedback for 99 steps. It's like playing a video game where you only get a "Game Over" message if you win, but no "Good job!" for jumping over a pit. The robot gives up.
The Risk: If you try to give points for every small step, the robot might find a "cheat code" (like spinning in circles to get points) instead of actually solving the problem.

2. The Solution: The "Reward Machine" (The GPS)

The authors use something called a Reward Machine. Imagine a GPS navigation system for the robot.

Instead of just saying "Go to the store," the GPS breaks the trip down: "Turn left," "Drive 2 miles," "Stop at the red light," "Turn right."
Every time the robot completes a small step (like turning left), the GPS gives it a tiny "ding" of encouragement.
This turns a scary, long journey into a series of small, manageable tasks.

3. The Magic Ingredient: Foundation Models (The Creative Director)

Usually, a human expert has to build this GPS map by hand. It takes a long time and requires deep knowledge of the robot's world.

The Innovation: ARM-FM uses Foundation Models (like the AI behind this very explanation) to build the GPS automatically.
You simply tell the AI in plain English: "The robot needs to find a key, unlock a door, and get to the green square."
The AI acts as a Creative Director. It instantly understands the story, breaks it down into steps, and writes the code for the GPS (the Reward Machine) all by itself. It even writes the "rules" for when the robot gets a "ding."

4. The Secret Sauce: "Language Embeddings" (The Shared Vocabulary)

This is the most clever part.

Usually, if you teach a robot to open a red door, it doesn't know how to open a blue door. It has to relearn everything from scratch.
ARM-FM gives the robot a dictionary. Every time the robot is at a step (like "Find the key"), the AI gives it a special "language tag" (an embedding).
Because the AI understands language, it knows that "Find the red key" and "Find the blue key" are very similar concepts.
The Result: The robot learns a general skill called "finding keys." If you later ask it to find a green key (a task it has never seen before), it can use that same skill immediately. This is called Zero-Shot Generalization—solving a new puzzle without any new training.

Real-World Examples from the Paper

The team tested this in three very different worlds:

MiniGrid (2D Mazes): The robot had to solve mazes with keys and doors. Standard robots got stuck; ARM-FM robots solved them easily.
Craftium (Minecraft-like 3D World): The robot had to gather wood, stone, and iron to build a diamond pickaxe. This is a huge, complex task. The standard robot wandered aimlessly. The ARM-FM robot followed the AI-generated GPS and successfully mined the diamond.
Robotics (Real Arms): They tested it on robotic arms that need to pick up objects. The AI generated the instructions, and the robot learned to move smoothly without needing a human to program every tiny motor movement.

The Bottom Line

ARM-FM is like giving a robot a smart, talking coach that can:

Listen to your vague human goals.
Break them down into a perfect, step-by-step plan.
Give the robot encouragement at every step.
Help the robot apply what it learned in one situation to a totally new situation.

It turns the difficult art of "programming a robot's motivation" into a simple conversation, allowing robots to learn complex, long-term tasks much faster and more reliably than ever before.

Here is a detailed technical summary of the paper "ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning."

1. Problem Statement

Reinforcement Learning (RL) algorithms are highly sensitive to the specification of reward functions. Designing effective rewards for complex, long-horizon tasks is a central bottleneck:

Sparse Rewards: Provide insufficient learning signals, making it difficult for agents to explore and learn (e.g., finding a diamond in a Minecraft-like world).
Reward Hacking: Hand-crafted dense rewards often contain loopholes that agents exploit without achieving the true objective.
Manual Design: While Reward Machines (RMs) offer a formal, automata-based solution to decompose tasks into sub-goals, their practical application is limited because they require complex, expert-driven manual design.
The Gap: Foundation Models (FMs) excel at interpreting natural language and decomposing tasks conceptually, but they struggle to translate this abstract understanding into the concrete, structured reward signals required for RL training.

2. Methodology: ARM-FM Framework

The authors propose ARM-FM (Automated Reward Machines via Foundation Models), a framework that bridges the gap between high-level natural language instructions and low-level RL control signals.

A. Language-Aligned Reward Machines (LARMs)

The core innovation is the LARM, an extension of standard Reward Machines. A standard RM is a finite-state automaton $\langle U, u_I, \Sigma, \delta, R, F, L \rangle$ that tracks sub-goals. A LARM augments this by:

Natural Language Instructions ( $l_u$ ): Each state $u$ in the automaton is associated with a natural language description of the sub-task.
Semantic Embeddings ( $z_u$ ): An embedding function $\phi(\cdot)$ maps the language instruction $l_u$ to a vector $z_u$ .
Policy Conditioning: The RL agent's policy $\pi$ is conditioned on the current RM state embedding: $\pi(s_t, z_{u_t})$ . This creates a shared skill space where semantically similar sub-tasks (e.g., "pick up blue key" vs. "pick up red key") are close in the embedding space, enabling transfer and generalization.

B. Automated Generation via Self-Improvement

The framework automates the creation of LARMs using a generator-critic loop with Foundation Models:

Input: High-level natural language task descriptions and visual observations of the environment.
Generator FM: Proposes a formal RM specification (states, transitions, rewards) and executable Python labeling functions (Boolean predicates connecting environment states to RM events).
Critic FM: Evaluates the generated RM for correctness, compactness, and logic.
Iterative Refinement: The generator and critic engage in $N$ rounds of self-improvement. Human intervention is optional but can be used to provide corrective feedback if the FM fails to capture edge cases.
Output: A complete LARM including the automaton structure, executable code, and language embeddings.

C. RL Training with LARMs

The agent operates in an augmented MDP $M' = S \times U$ .

State: The agent observes the environment state $s_t$ and the language embedding $z_{u_t}$ of the current RM state.
Reward: The total reward is the sum of the sparse environment reward $R_t$ and the dense RM reward $R^{RM}_t$ .
Dynamics: After an action, the labeling function $L(s_{t+1}, a_t)$ determines if an event occurred, triggering a transition in the RM ( $u_{t+1} = \delta(u_t, \sigma)$ ) and providing the intermediate reward.

3. Key Contributions

Automated LARM Generation: A novel framework to generate complete, executable task specifications (RMs + labeling functions) directly from natural language using FMs, eliminating the need for manual expert design.
Semantic Skill Space: The introduction of language-aligned embeddings for RM states allows policies to generalize across related sub-tasks. This enables zero-shot generalization to new tasks composed of previously seen sub-goals without retraining.
Empirical Validation: Extensive experiments demonstrating that ARM-FM solves complex, sparse-reward tasks across diverse domains (2D grids, 3D Minecraft, continuous robotics) where standard RL and baseline methods fail.

4. Experimental Results

The framework was evaluated on four distinct benchmarks:

MiniGrid & BabyAI (Sparse Rewards):
- Tasks: DoorKey, BlockedUnlockPickup, UnlockToUnlock, KeyCorridor.
- Results: ARM-FM (DQN+RM) consistently outperformed baselines (DQN, DQN+ICM, ReAct). It achieved near-perfect success on complex, long-horizon tasks (e.g., UnlockToUnlock) where all baselines failed to make progress.
Craftium (3D Procedural World):
- Task: Mining a diamond in a Minecraft-like environment requiring a sequence of gathering wood, stone, and iron.
- Results: Standard PPO failed completely. PPO augmented with FM-generated LARMs successfully learned the full sequence, demonstrating scalability to high-dimensional visual environments.
Meta-World (Continuous Control):
- Task: Robotic manipulation (e.g., shelf-place, stick-push).
- Results: ARM-FM (SAC) achieved significantly higher success rates than sparse-reward baselines by providing dense, structured supervision without manual reward engineering.
XLand-MiniGrid (Generalization):
- Task: Multi-task learning and zero-shot generalization.
- Results:
  - Multi-task: Only the full method (LARM rewards + embeddings) maintained high performance as the number of simultaneous tasks increased.
  - Zero-shot: Agents trained on Tasks A and B successfully solved a novel composite Task C without fine-tuning, proving that the semantic embeddings allowed the reuse of learned skills for new task compositions.

5. Significance and Impact

Bridging Semantics and Control: ARM-FM successfully translates high-level human intent (natural language) into low-level, verifiable learning signals (RMs), solving the "grounding" problem for FMs in RL.
Compositional Learning: By treating sub-goals as reusable skills grounded in language embeddings, the framework enables agents to learn complex behaviors through composition rather than learning from scratch for every new task.
Scalability: The approach scales from simple grid worlds to complex 3D environments and continuous control, suggesting a path toward general-purpose RL agents that can follow natural language instructions.
Human-in-the-Loop: The framework supports optional human verification, allowing for the refinement of task specifications, which enhances safety and interpretability compared to "black-box" reward models.

In conclusion, ARM-FM establishes Language-Aligned Reward Machines as a powerful interface between Foundation Models and Reinforcement Learning, enabling automated, compositional, and generalizable task solving in sparse-reward environments.