RAMP: Hybrid DRL for Online Learning of Numeric Action Models

The paper proposes RAMP, a hybrid framework that integrates Deep Reinforcement Learning and online numeric action model learning into a positive feedback loop to significantly outperform standard DRL algorithms in solving numeric planning problems.

Yarin Benyamin, Argaman Mordoch, Shahaf S. Shperberg, Roni Stern

Published 2026-04-13
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot how to drive a car, but you don't have a driver's manual. You also don't have a human instructor sitting in the passenger seat to tell the robot what to do. The robot has to learn entirely by itself, through trial and error.

This is the challenge the paper RAMP tackles. It introduces a clever new way for robots to learn how to navigate complex, number-heavy worlds (like managing fuel, inventory, or crafting items) by combining three different "brains" into one super-learner.

Here is the breakdown of how RAMP works, using simple analogies:

The Three Brains of RAMP

Think of RAMP as a team of three specialists working together in a loop:

  1. The Explorer (Deep Reinforcement Learning):

    • What it does: This is the "trial-and-error" brain. It's like a curious child who keeps pressing buttons to see what happens. It tries actions, sees if it gets a reward (like reaching a goal), and learns from mistakes.
    • The Problem: On its own, this explorer is often clumsy. It might try to drive a car with no gas, or crash into walls, because it doesn't understand the rules of the world yet. It takes a long time to learn.
  2. The Rulebook Writer (Action Model Learning):

    • What it does: This brain watches the Explorer. Every time the Explorer tries something and sees the result, the Rulebook Writer updates a "manual" of how the world works. It figures out: "Ah, if I push this button, the fuel goes down by 5," or "I can't open this door unless I have a key."
    • The Catch: In the past, these writers needed a perfect video of an expert driving to learn the rules. RAMP's writer learns on the fly, just by watching the Explorer's messy attempts.
  3. The Navigator (The Planner):

    • What it does: Once the Rulebook Writer has a decent manual, the Navigator takes over. It looks at the manual and the current situation, then calculates the perfect, most efficient path to the goal. It's like a GPS that knows exactly which turns to take to avoid traffic.
    • The Magic: The Navigator doesn't just drive; it teaches the Explorer. It says, "Don't guess anymore; follow this path I calculated."

The Positive Feedback Loop

The genius of RAMP is how these three talk to each other in a positive feedback loop:

  • Step 1: The Explorer tries to solve a problem. It makes mistakes, but it gathers data.
  • Step 2: The Rulebook Writer uses that data to write a better "manual" of the world.
  • Step 3: The Navigator reads the new manual and draws a perfect map (a plan) to the goal.
  • Step 4: The Explorer follows this perfect map. Because it is following a good plan, it succeeds faster and gathers better data.
  • Step 5: The Rulebook Writer gets even better data, writes an even better manual, and the cycle repeats.

It's like a student (Explorer) who gets a tutor (Navigator) based on a textbook (Rulebook) that the student helped write. As the student learns, the textbook gets better, which makes the tutor smarter, which helps the student learn even faster.

The "Numeric" Challenge

Most robots are good at simple things (like "Is the light on?"). But real-world problems involve numbers (like "Do I have 15 gallons of fuel?" or "Is the temperature below 30 degrees?").

The authors had to build a special translator called Numeric PDDLGym. Imagine taking a complex math textbook written in a language only engineers speak (PDDL) and translating it into a video game format (Gym) that standard AI can play. This allowed them to test their robot in realistic scenarios, like:

  • Sailing: Managing wind and fuel to reach an island.
  • Depot: Moving packages with trucks that have limited cargo space.
  • Pogo Stick (Minecraft style): Gathering specific amounts of wood and stone to craft a tool.

The Results: Why It Matters

When they tested RAMP against a standard AI (PPO) that tries to learn without a "Rulebook" or "Navigator":

  • Success Rate: RAMP solved way more problems. In the hardest scenarios, the standard AI gave up completely, while RAMP kept going.
  • Efficiency: RAMP didn't just solve the problems; it solved them with fewer steps. It didn't waste time wandering around; it took the efficient path.
  • Safety: The "Rulebook" RAMP writes is "safe." This means if the robot follows the plan, it is guaranteed to work in the real world, even if the robot is still learning.

The Bottom Line

RAMP is like giving a robot a self-updating instruction manual and a GPS while it learns to drive. Instead of blindly crashing into walls for hours, the robot learns the rules of the road, gets a perfect route, and drives straight to the destination. It solves the problem of teaching robots to handle complex, number-based tasks much faster and more reliably than before.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →