RAMP: Hybrid DRL for Online Learning of Numeric Action Models

Imagine you are trying to teach a robot how to drive a car, but you don't have a driver's manual. You also don't have a human instructor sitting in the passenger seat to tell the robot what to do. The robot has to learn entirely by itself, through trial and error.

This is the challenge the paper RAMP tackles. It introduces a clever new way for robots to learn how to navigate complex, number-heavy worlds (like managing fuel, inventory, or crafting items) by combining three different "brains" into one super-learner.

Here is the breakdown of how RAMP works, using simple analogies:

The Three Brains of RAMP

Think of RAMP as a team of three specialists working together in a loop:

The Explorer (Deep Reinforcement Learning):
- What it does: This is the "trial-and-error" brain. It's like a curious child who keeps pressing buttons to see what happens. It tries actions, sees if it gets a reward (like reaching a goal), and learns from mistakes.
- The Problem: On its own, this explorer is often clumsy. It might try to drive a car with no gas, or crash into walls, because it doesn't understand the rules of the world yet. It takes a long time to learn.
The Rulebook Writer (Action Model Learning):
- What it does: This brain watches the Explorer. Every time the Explorer tries something and sees the result, the Rulebook Writer updates a "manual" of how the world works. It figures out: "Ah, if I push this button, the fuel goes down by 5," or "I can't open this door unless I have a key."
- The Catch: In the past, these writers needed a perfect video of an expert driving to learn the rules. RAMP's writer learns on the fly, just by watching the Explorer's messy attempts.
The Navigator (The Planner):
- What it does: Once the Rulebook Writer has a decent manual, the Navigator takes over. It looks at the manual and the current situation, then calculates the perfect, most efficient path to the goal. It's like a GPS that knows exactly which turns to take to avoid traffic.
- The Magic: The Navigator doesn't just drive; it teaches the Explorer. It says, "Don't guess anymore; follow this path I calculated."

The Positive Feedback Loop

The genius of RAMP is how these three talk to each other in a positive feedback loop:

Step 1: The Explorer tries to solve a problem. It makes mistakes, but it gathers data.
Step 2: The Rulebook Writer uses that data to write a better "manual" of the world.
Step 3: The Navigator reads the new manual and draws a perfect map (a plan) to the goal.
Step 4: The Explorer follows this perfect map. Because it is following a good plan, it succeeds faster and gathers better data.
Step 5: The Rulebook Writer gets even better data, writes an even better manual, and the cycle repeats.

It's like a student (Explorer) who gets a tutor (Navigator) based on a textbook (Rulebook) that the student helped write. As the student learns, the textbook gets better, which makes the tutor smarter, which helps the student learn even faster.

The "Numeric" Challenge

Most robots are good at simple things (like "Is the light on?"). But real-world problems involve numbers (like "Do I have 15 gallons of fuel?" or "Is the temperature below 30 degrees?").

The authors had to build a special translator called Numeric PDDLGym. Imagine taking a complex math textbook written in a language only engineers speak (PDDL) and translating it into a video game format (Gym) that standard AI can play. This allowed them to test their robot in realistic scenarios, like:

Sailing: Managing wind and fuel to reach an island.
Depot: Moving packages with trucks that have limited cargo space.
Pogo Stick (Minecraft style): Gathering specific amounts of wood and stone to craft a tool.

The Results: Why It Matters

When they tested RAMP against a standard AI (PPO) that tries to learn without a "Rulebook" or "Navigator":

Success Rate: RAMP solved way more problems. In the hardest scenarios, the standard AI gave up completely, while RAMP kept going.
Efficiency: RAMP didn't just solve the problems; it solved them with fewer steps. It didn't waste time wandering around; it took the efficient path.
Safety: The "Rulebook" RAMP writes is "safe." This means if the robot follows the plan, it is guaranteed to work in the real world, even if the robot is still learning.

The Bottom Line

RAMP is like giving a robot a self-updating instruction manual and a GPS while it learns to drive. Instead of blindly crashing into walls for hours, the robot learns the rules of the road, gets a perfect route, and drives straight to the destination. It solves the problem of teaching robots to handle complex, number-based tasks much faster and more reliably than before.

1. Problem Statement

Automated planning relies on accurate action models (defining preconditions and effects) to generate solutions. However, handcrafting these models, particularly for numeric planning (involving continuous state variables), is difficult and error-prone.

The Gap: Existing Action Model Learning (AML) algorithms for numeric domains are offline, requiring expert-provided execution traces. Conversely, Deep Reinforcement Learning (DRL) algorithms (like PPO) operate online but lack the structural reasoning of symbolic planning, often struggling with long-horizon problems and requiring massive amounts of data to converge.
The Challenge: There is currently no method for online learning of numeric action models that simultaneously learns a policy and refines a symbolic model through interaction with the environment, without prior expert data.

2. Methodology: The RAMP Strategy

The authors propose RAMP (Reinforcement learning, Action Model learning, and Planning), a hybrid framework that creates a positive feedback loop between three core components:

Deep Reinforcement Learning (DRL): Uses Proximal Policy Optimization (PPO) to explore the environment and collect data. It acts as a "fail-safe" when planning fails and gathers goal-oriented trajectories.
Action Model Learning (AML): Uses NSAM (Numeric Safe Action Model), an algorithm capable of learning numeric preconditions and effects from observations. Crucially, NSAM provides safety guarantees: any plan generated from the learned model is guaranteed to be sound (executable) with respect to the true environment.
Numeric Planner: Uses a classical planner (Metric-FF) to generate high-quality plans based on the currently learned action model. These plans accelerate DRL training by providing efficient demonstrations.

The Feedback Loop:

The DRL agent explores the environment to collect trajectories.
These trajectories are fed to the AML algorithm to refine the numeric action model ( $M$ ).
The Planner attempts to solve new problems using $M$ .
If a plan is found, the agent executes it (guiding the DRL policy toward efficient solutions). If not, the DRL agent explores.
The resulting trajectories update both the DRL policy and the AML model, improving future planning and exploration.

Implementation Nuances:

Action Masking: To handle the discrepancy between the planner's deterministic actions and the DRL policy's stochastic nature, the authors implement a masking mechanism. When following a plan, invalid actions are masked (logits set to zero) to prevent PPO's clipping mechanism from nullifying gradient updates.
Safety: The system prioritizes "safe" models where valid plans in the learned model are guaranteed to work in the real environment.

3. Key Contributions

A. The RAMP Framework

RAMP is the first strategy to integrate online learning of numeric action models with DRL and planning. It bridges the gap between symbolic planning (structure, safety) and DRL (online adaptability, data efficiency).

B. Numeric PDDLGym

The authors developed Numeric PDDLGym, an automated framework that converts PDDL 2.1 numeric planning domains into standard Gym environments.

Function: It flattens lifted symbolic representations (objects, fluents, functions) into fixed-size observation and action vectors required by standard DRL libraries (e.g., RLlib).
Significance: This enables the direct application of off-the-shelf DRL algorithms to numeric planning problems, which previously lacked standardized environments.

C. Empirical Validation

The paper provides a comprehensive evaluation across four domains (three from the International Planning Competition and one inspired by Minecraft), demonstrating the efficacy of the hybrid approach.

4. Experimental Results

The authors compared RAMP against a baseline PPO algorithm (equipped with action masking) across four domains: Counters, Sailing, Depot, and Pogo Stick (Minecraft-inspired).

Solvability: RAMP significantly outperformed PPO.
- In Counters and Sailing, RAMP reached near-perfect solvability much faster.
- In the complex Depot domain (Large instances), PPO failed to solve any problems, whereas RAMP solved over 90% of them by leveraging the planner's guidance.
Plan Quality: RAMP consistently found significantly shorter plans (lower cumulative solution length) than PPO. The planner effectively guided the RL agent away from suboptimal exploration paths.
Action Model Quality:
- Precision: Achieved 1.0 (perfect) for both effects and preconditions, thanks to NSAM's safety guarantees.
- Recall: Varied by domain (e.g., 96% in Counters, 12.8% in Depot). Crucially, the paper notes that perfect recall is not necessary for effective planning; a model sufficient to solve the task is enough. RAMP prioritizes task solvability over exhaustive model recovery.
Planner Utilization: In successful training instances, RAMP utilized the planner's generated plans in >85% of cases (up to 93% in Depot), proving the synergy between the components.

5. Significance and Future Work

Significance:

Solves the "Cold Start" Problem: RAMP allows agents to learn numeric planning domains from scratch without expert traces.
Data Efficiency: By using the planner to guide the RL agent, RAMP achieves high sample efficiency, solving complex problems where pure DRL fails.
Safety: The integration of NSAM ensures that the learned models do not lead to unsafe or invalid plans, a critical requirement for real-world deployment.
Infrastructure: Numeric PDDLGym lowers the barrier for researchers to apply DRL to numeric planning.

Future Work:
The authors plan to relax the assumption of noise-free observability. Future iterations aim to incorporate probabilistic state representations and noise-robust AML algorithms to deploy RAMP in realistic, partially observable environments.

Conclusion

RAMP represents a significant advancement in hybrid AI, demonstrating that combining the exploration capabilities of DRL with the structural reasoning of symbolic planning and the safety of model learning creates a robust system for online numeric planning. It outperforms state-of-the-art DRL baselines in both success rates and solution quality.