Model Space Reasoning as Search in Feedback Space for Planning Domain Generation

Imagine you are trying to teach a very smart, but slightly confused, robot how to play a new board game. You describe the rules to the robot in plain English: "You can move your piece forward if the square is empty," or "If you land on a red square, you lose."

The robot (an AI) tries to write down the official rulebook (called a PDDL domain) based on your description. But because the robot is still learning, it often makes mistakes. Maybe it forgets a rule, or it invents a rule that doesn't make sense, like "You can jump over walls." If you just let the robot write the rulebook once and stop, the game will be broken.

This paper is about a new way to help the robot fix its rulebook until it's perfect. The authors call their method "Model Space Reasoning as Search in Feedback Space." That's a mouthful, so let's break it down with some analogies.

The Problem: The Robot's First Draft is Messy

When Large Language Models (LLMs) try to turn your English description into a strict computer rulebook, they often get the syntax right (the words are spelled correctly) but the semantics wrong (the logic is broken). It's like a student who writes a perfect essay but gets the math wrong.

The Solution: The "Feedback Loop"

Instead of just asking the robot to "try again," the researchers give it specific clues about what is wrong. They use two main types of clues:

The "Landmark" Clue (The GPS Waypoints):
Imagine you are giving the robot directions to a treasure. You tell it, "You must pass the old oak tree before you reach the cave."
- In the paper: These are called Landmarks. They are critical steps that must happen in any valid plan. If the robot's rulebook allows a path that skips the oak tree, the system says, "Hey! You missed a mandatory stop!"
- Analogy: It's like a GPS telling you, "You missed a turn; you can't get to the destination without passing this specific intersection."
The "Plan Validator" Clue (The Trial Run):
Imagine you ask the robot to actually play a round of the game using its new rulebook.
- In the paper: They use a tool called VAL to try and run a plan. If the plan crashes (e.g., the robot tries to move a piece that doesn't exist), the system says, "This move is illegal because your rules are wrong."
- Analogy: It's like a test drive. If the car stalls, the mechanic knows something is wrong with the engine, not just the driver.

The Secret Sauce: "Search in Feedback Space"

Here is the clever part. The robot doesn't just get one clue and fix it. It gets many possible clues.

Imagine the robot is in a dark room trying to find the light switch.

Random Walk (The Old Way): The robot just picks a random wall, pokes it, and if it's not the switch, it picks another random wall. This is slow and inefficient.
Heuristic Search (The New Way): The robot uses a "smart compass." It looks at all the possible clues (feedback messages) it could receive. It asks, "Which clue is most likely to get me closer to the perfect rulebook?" It picks the best clue, fixes the rulebook, and repeats.

The researchers tested this by letting the AI try different combinations:

Just Landmarks.
Just Plan Validation.
Both together.
Randomly picking clues vs. using the "smart compass" to pick the best clues.

The Results: What Did They Find?

Feedback is Magic: Giving the robot clues (feedback) made the rulebooks much better than just letting it guess once.
The "Smart Compass" Works: Using a search strategy to pick the best clues generally worked better than just picking clues at random.
It's Not One-Size-Fits-All: Sometimes "Landmarks" were the best clue; sometimes "Plan Validation" was better. It depends on the specific game (domain) the robot is trying to learn.
The Winner: The best combination was using both types of clues and using the smart search to pick the best ones. With this method, they were able to generate a perfect rulebook (100% correct) for every single game they tested.

Why Does This Matter?

Currently, making these computer rulebooks requires expensive human experts who know complex coding languages. This paper shows that we can use AI to do the heavy lifting, as long as we give it the right kind of "corrections" and a smart way to choose which corrections to listen to.

In short: They taught an AI to write perfect game rules by letting it play, checking where it failed, and using a smart strategy to figure out exactly which rule to fix next. It's like having a tireless, super-smart editor who knows exactly how to turn a messy draft into a masterpiece.

1. Problem Statement

The automatic generation of Planning Domain Definition Language (PDDL) domains from natural language descriptions remains a significant challenge, even with advanced Large Language Models (LLMs). While LLMs can produce syntactically correct PDDL code, the resulting domains are often semantically flawed, failing to accurately model the intended logic, constraints, and action effects.

Existing approaches typically rely on:

Single types of feedback (e.g., only syntax or only plan validation).
Generating not just domains but also problems and plans simultaneously, leading to compounding errors.
Benchmarks limited to a few well-known domains, risking overfitting.

The authors aim to solve this by developing a framework that generates high-quality PDDL domains from natural language descriptions using agentic feedback loops and heuristic search over the space of possible feedback messages.

2. Methodology

The proposed framework treats domain generation as a search problem where the "state" is the generated domain model, and the "actions" are the application of specific feedback messages to refine the model.

2.1 Pipeline Overview

The process consists of two main phases:

Initial Domain Construction: The LLM generates an initial PDDL domain ( $D'$ ) based on a natural language description ( $D_{NL}$ ) containing object descriptions, predicate definitions, and action descriptions. This phase includes a loop for syntactic repair (using a PDDL parser) to ensure the output is valid code before proceeding.
Domain Refinement (Feedback Loop): The initial domain is iteratively refined using symbolic feedback until a stopping condition (timeout or perfect score) is met.

2.2 Feedback Mechanisms

The paper investigates two primary sources of symbolic feedback derived from a Ground Truth Domain ( $D$ ) and associated problems:

Plan Validation (VAL) Feedback: The system attempts to execute valid plans from the ground truth domain within the generated domain ( $D'$ ). If a plan fails, VAL provides specific error messages indicating missing preconditions, incorrect effects, or goal satisfaction failures.
Landmark Feedback: The system uses disjunctive action landmarks (facts or actions that must occur in any valid plan). If the generated domain fails to produce a plan containing a required landmark action, a feedback message is generated indicating the missing action.

2.3 Search Strategies

The core innovation is treating the selection of feedback as a search problem. The authors compare several pipelines:

No Feedback (N): Baseline using the initial generation.
Random Walk: Selects a single random feedback message (Plan, Landmark, or Combined) to apply in each iteration.
Heuristic Search (Best-First): Maintains a tree of generated domains. Nodes represent domains, and edges represent feedback messages. The search uses a heuristic function $f(n) = g(n) + h(n)$ $f (n) = g (n) + h (n)$ , where:
- $g(n)$ : Depth of the node (number of refinement steps).
- $h(n)$ : Number of invalid plans in the current domain.
- The algorithm expands the most promising nodes (those with the fewest invalid plans) to find a domain with $h(n)=0$ .

2.4 Evaluation Metric: Heuristic Domain Equivalence (HDE)

To avoid human evaluation, the authors use a modified Heuristic Domain Equivalence (HDE) metric. It compares the generated domain ( $D'$ ) against the ground truth ( $D$ ) using a set of evaluation problems:

Forward Direction: Measures how many ground-truth plans are valid in $D'$ .
Backward Direction: Measures how many plans generated in $D'$ are valid in $D$ .
Score: A normalized average of these two directions. A score of 100% indicates semantic equivalence.

3. Key Contributions

Feedback Space Search Framework: Proposes a novel method to navigate the space of possible feedback messages using heuristic search, rather than relying on random or single-step feedback.
Symbolic Feedback Integration: Demonstrates the efficacy of combining Plan Validation and Landmark feedback, showing they offer complementary strengths.
Automated Evaluation: Utilizes the HDE metric to automatically assess domain quality without human intervention, enabling large-scale experimentation.
Novel Benchmarking: Evaluates performance on a diverse set of domains, including classic IPC domains, obscure domains, and novel domains not present in LLM training data, ensuring generalization.

4. Experimental Results

The experiments were conducted across three LLMs (gpt-5-nano, gpt-5-mini, deepseek-chat) and eight distinct planning domains (e.g., blocks, hiking, miconic, pacman).

R1 (Feedback vs. Baseline): Feedback mechanisms significantly outperformed the "No Feedback" baseline across all models and domains.
R2 (Feedback Types): No single feedback type dominated. Landmarks and Plan Validation had complementary strengths; for instance, Landmarks excelled in some domains while Plan Validation worked better in others.
R3 (Combination): Combining feedback types generally improved results, though in specific cases (e.g., the hiking domain), the combination sometimes underperformed compared to a single type, suggesting context-dependent efficacy.
R4 (Search vs. Random Walk):
- Systematic Search generally yielded better average HDE scores than random walks.
- However, there were notable exceptions where Random Walk strategies achieved perfect (100%) HDE scores more frequently than search strategies (e.g., in the flow and hiking domains with gpt-5-mini). This suggests that while search is robust on average, random exploration can sometimes stumble upon the optimal solution faster in complex feedback landscapes.
Key Achievement: Using Landmark + Plan Feedback with Search (LVS) on gpt-5-mini, the system successfully generated a domain with a 100% HDE score at least once for every tested domain.

5. Significance and Future Work

Accessibility: The ability to generate high-quality planning domains using simple natural language and automated feedback reduces the barrier to entry for AI planning, making it accessible to non-experts.
Robustness: The framework proves effective even on novel domains unseen during model training, demonstrating strong generalization capabilities.
Implications: The finding that simple landmark feedback can be as effective as detailed plan validation suggests that less computationally expensive feedback mechanisms can be highly effective.

Future Directions:

Incorporating invariant-based feedback to further constrain the search space.
Developing search strategies for feedback spaces with large branching factors.
Conducting user studies to evaluate the framework's usability for non-expert users in real-world scenarios.

In conclusion, this work establishes that iterative refinement guided by symbolic feedback and heuristic search is a viable and powerful approach for automating the creation of reliable AI planning domains from natural language.