Reinforcement Learning with Symbolic Reward Machines

The Big Problem: The "Silent" Teacher

Imagine you are training a dog to perform a complex trick, like "fetch the newspaper, bring it to the living room, and drop it on the rug."

In standard Reinforcement Learning (RL), the dog (the agent) only gets a "treat" (reward) at the very end if it succeeds. If it drops the paper in the kitchen, it gets nothing. If it brings it to the wrong room, it gets nothing. The dog has to guess what it did wrong just by looking at the lack of a treat. This is hard, slow, and often impossible for complex tasks.

To help, researchers invented Reward Machines (RMs). Think of an RM as a specialized coach standing next to the dog.

How it worked before: The coach could only speak if the environment gave them a specific "flag." For example, the environment had to shout, "Hey Coach! The dog is at the newspaper!" (Label A) or "The dog is at the living room!" (Label B).
The Catch: To make this work, a human had to manually write a complex program (a "labeling function") to watch the dog and shout these flags. If you wanted to train a robot in a new factory, you'd have to hire a human expert to write a new flag-shouting program for that specific factory. It's like hiring a translator for every single conversation you have. It's too much work and doesn't scale.

The Solution: Symbolic Reward Machines (SRMs)

The authors of this paper say: "Let's cut out the middleman." They propose Symbolic Reward Machines (SRMs).

Instead of waiting for the environment to shout a flag, the SRM looks directly at the dog's eyes and the surroundings. It uses symbolic formulas (like math rules) to understand what's happening.

The Old Way (RM): The environment says, "Label: 'At Newspaper'." The Coach sees the label and says, "Okay, good job!"
The New Way (SRM): The Coach looks at the dog's coordinates $(x, y)$ and thinks, "Is $x > 5$ and $y < 10$ ?" If yes, the Coach knows the dog is at the newspaper and gives a reward.

Why is this a big deal?

No Manual Translation: You don't need a human to write a "flag-shouting" program. The SRM reads the raw data (like coordinates or speed) directly.
Plug-and-Play: You can take an SRM and drop it into any standard video game or robot simulation without changing the game's code. It just works.
Interpretability: The rules the SRM learns are written in plain math (e.g., "If speed is low and position is high..."). A human can read this and understand why the robot is getting rewarded.

The Two New Algorithms: QSRM and LSRM

The paper introduces two new tools to use these machines:

1. QSRM (The "Expert Assistant")

Imagine you already have the perfect rulebook (the SRM) written down. QSRM is the algorithm that uses this rulebook to teach the agent super-fast.

How it works: It's like a student who has the answer key. It learns much faster than a student guessing blindly.
The Result: It learns just as well as the old methods (QRM) but doesn't need the environment to be modified to shout labels. It respects the standard rules of the game.

2. LSRM (The "Detective")

This is the real magic. What if you don't have the rulebook? What if you don't know the task?

The Detective Analogy: LSRM starts with a blank notebook. It watches the agent try to solve the task.
- The agent tries something.
- The environment gives a reward (or not).
- LSRM checks its current "hypothesis" (its guess at the rulebook). If the hypothesis says "Good job!" but the environment gave a "Bad job" signal, LSRM says, "Aha! My rulebook is wrong."
- It then rewrites its rulebook to match what actually happened.
The Outcome: LSRM learns the task end-to-end. It figures out the hidden rules of the game just by watching the rewards, without a human telling it what the rules are.

The Experiments: Did it work?

The authors tested this in two worlds:

The "Office World" (Grid World): A robot moving through a grid of rooms.
The "Mountain Car" (Continuous): A car trying to drive up a hill (where position and speed are continuous numbers, not just grid squares).

The Results:

Beating the Basics: Standard AI (like Q-Learning) struggled because it couldn't remember the sequence of steps needed. The new SRM methods crushed it.
Matching the Pros: The new methods performed exactly as well as the old, complex methods that required manual labeling.
The Detective Wins: LSRM successfully figured out the rules of the game on its own. In the simple grid world, it learned the exact rules. In the complex continuous world, it learned rules that were slightly different mathematically but worked perfectly to get the car to the top of the hill.

Summary: Why Should You Care?

Think of Reinforcement Learning as teaching a child to cook.

Old RL: You give the child a pot and say, "Make dinner." If they burn it, you say "Bad." They try again. It takes forever.
Old Reward Machines: You hire a sous-chef who has to shout "Add salt!" or "Turn on stove!" based on a manual you wrote. It works, but writing the manual for every new recipe is a nightmare.
This Paper (SRMs): You give the child a smart tablet (the SRM) that can see the ingredients and the stove. The tablet says, "The water is boiling, so add the pasta."
- QSRM: You give the tablet the recipe, and the child learns instantly.
- LSRM: You give the child the tablet with a blank screen. The child tries things, the tablet observes the results, and writes the recipe itself as it goes.

This paper makes AI more flexible, easier to use in the real world, and gives us a way to understand how the AI is thinking by reading the math rules it learns.

1. Problem Statement

Reinforcement Learning (RL) traditionally relies on Markovian reward functions, where the reward depends solely on the current state and action. However, many real-world tasks require non-Markovian rewards, where the reward depends on the history of states and actions (e.g., "collect wood, then put it in the machine").

Existing solutions, such as Reward Machines (RMs), address this by using high-level labels (events) emitted by the environment via a labeling function. While effective, RMs suffer from significant limitations:

Manual Overhead: They require a user to manually design a labeling function for every specific environment and task.
Incompatibility: They deviate from the standard RL interaction scheme (State, Action, Reward), making them difficult to integrate with standard RL frameworks (like Gymnasium) "out of the box."
Scalability: Creating a labeling function that is generic enough for various tasks yet specific enough for a concrete task is technically challenging and often results in poor usability.

The paper aims to eliminate the need for manual labeling functions while maintaining the ability to represent and learn non-Markovian tasks in standard RL environments.

2. Methodology

The authors propose Symbolic Reward Machines (SRMs) and two associated learning algorithms: QSRM and LSRM.

A. Symbolic Reward Machines (SRMs)

SRMs are a hybrid of Reward Machines and symbolic automata.

Input: Unlike RMs, which take abstract labels, SRMs take the raw environment state directly.
Mechanism: Transitions in the SRM are guarded by symbolic formulas (guards) defined over the state variables (e.g., $x \ge 5 \land y < 10$ ).
Logic: The paper utilizes Linear Real Arithmetic (LRA) as the logic component, ensuring that validity and satisfiability problems are decidable.
Properties: SRMs are defined to be deterministic (guards for outgoing transitions from a state are mutually exclusive) and complete (a transition exists for any possible input).
Advantage: They operate directly on standard MDP definitions without requiring an external labeling function.

B. Learning Algorithms

1. QSRM (Q-Learning for SRMs)

Function: An algorithm that learns an optimal policy given a pre-defined SRM.
Mechanism: It extends the QRM approach. It maintains a Q-table for each state of the SRM. It uses a "multi-update" scheme where, upon observing a state transition, it updates Q-values for all possible SRM states consistent with the observed state and the SRM's transition logic.
Convergence: The paper proves that QSRM converges to an optimal policy under the same conditions as standard Q-Learning (bounded rewards, infinite visits to state-action pairs, appropriate learning rates).
Extension: The authors also introduce DQSRM, which uses neural networks (Deep Q-Learning) to handle continuous/infinite state spaces.

2. LSRM (Learning Symbolic Reward Machines)

Function: An end-to-end algorithm that infers the SRM structure and guards automatically during training, removing the need for user-provided SRMs.
Process:
1. Starts with a basic hypothesis SRM (one state, self-loop).
2. Trains a policy using (D)QSRM.
3. Detects counterexamples: If the reward predicted by the hypothesis SRM differs from the actual environment reward, the trajectory is stored.
4. Inference: The algorithm encodes the counterexamples into a Constraint Satisfaction Problem (CSP) using an SMT solver (Z3). It searches for a new SRM structure and guards that are consistent with all observed counterexamples.
5. Iterates until a consistent SRM is found.
Two Variants:
- LSRM-GF (Given Formulas): The user provides a set of candidate formulas for the guards. The algorithm selects the correct ones.
- LSRM-FT (Formula Templates): The user provides templates (e.g., $x \ge b_1 \land x < b_2$ ). The algorithm infers the specific values for the free variables ( $b_1, b_2$ ). This allows for fully automatic learning without prior knowledge of the specific reward thresholds.

3. Key Contributions

Symbolic Reward Machines (SRMs): A novel formalism that represents non-Markovian rewards using symbolic guards over raw states, eliminating the need for manual labeling functions.
QSRM & DQSRM: New learning algorithms that integrate SRMs into standard RL loops, proving convergence to optimal policies for both finite and infinite state spaces.
LSRM Algorithms: A framework for end-to-end learning of non-Markovian tasks. LSRM automatically infers the reward structure (the SRM) from experience, providing interpretable symbolic representations of the task to the user.
Theoretical Guarantees: Proofs showing that LSRM converges to an SRM that is "almost surely equivalent" to the true environment reward structure and that the resulting policy is optimal (in finite spaces).
Interpretability: The learned SRMs provide step-by-step, human-readable explanations of the task structure (e.g., "Go to region A, then region B").

4. Experimental Results

The authors evaluated their methods on Office World (discrete and continuous versions) and a modified Mountain Car environment.

Baseline Comparison (RQ1 & RQ2):
- Standard Q-Learning and DQN (even with large frame stacks) failed to learn the non-Markovian tasks effectively due to the lack of history.
- QSRM and DQSRM achieved performance identical to the existing QRM and DQRM methods, reaching optimal performance values.
- Significance: SRMs match the performance of RMs but work with standard environment definitions without labeling functions.
End-to-End Learning (RQ3 & RQ4):
- LSRM-GF and LSRM-FT successfully learned optimal policies in finite environments and near-optimal policies in continuous environments.
- SRM Inference: The inferred SRMs were found to be almost surely equivalent to the ground-truth SRMs used in the environment.
- Interpretability: The visualized learned SRMs (Figures 8-10) showed that LSRM-FT could successfully infer the correct intervals and logical conditions (e.g., specific coordinate ranges) required to complete the task, providing valuable insight into the reward structure.

5. Significance

This work represents a significant step toward making non-Markovian RL practical for real-world applications.

Usability: By removing the dependency on manual labeling functions, SRMs allow RL agents to be deployed in standard environments (like Gymnasium) without extensive engineering overhead.
Automation: LSRM enables the automatic discovery of complex task structures, moving beyond "black-box" learning to interpretable AI. The agent not only learns how to act but also reveals why certain actions are rewarded.
Flexibility: The ability to handle both discrete and continuous state spaces makes the approach applicable to a wide range of robotics and control problems where temporal dependencies are crucial.

In summary, the paper successfully bridges the gap between the theoretical power of non-Markovian rewards and the practical constraints of standard RL frameworks, offering a robust, automated, and interpretable solution.