GenePlan: Evolving Better Generalized PDDL Plans using Large Language Models

Imagine you are trying to teach a very smart, but slightly chaotic, robot how to play a complex board game. You don't want to write the rules for every single possible game setup (which would take forever). Instead, you want the robot to learn a general strategy that works for any setup of that game.

This is exactly what the paper GenePlan is about. It's a new method for teaching Large Language Models (LLMs)—like the AI behind ChatGPT or Claude—to become expert game masters for "PDDL" planning problems (a formal language used to describe logic puzzles, robot tasks, and logistics).

Here is the breakdown of how it works, using some everyday analogies.

1. The Problem: The "Lazy" Genius

Current AI models are like brilliant students who can write a great essay on a specific topic if you give them a prompt. But when you ask them to solve a logic puzzle (like moving blocks, delivering newspapers, or organizing a warehouse), they often get stuck. They might:

Make up rules that don't exist.
Get lost in the middle of the plan.
Give you a solution that works but takes 100 steps when it could be done in 10.

They are "satisficing"—they find a solution, not the best one.

2. The Solution: GenePlan (The Evolutionary Coach)

The authors created GenePlan. Think of GenePlan not as a single teacher, but as a coach running a training camp for a team of AI students.

Instead of asking the AI to "write the perfect plan" once, GenePlan sets up an evolutionary tournament. Here is how the camp works:

Step 1: The Initial Drafts (The "Seed" Population)

The coach asks the AI to write a few different Python code scripts (strategies) to solve the puzzle. Some are terrible, some are okay, and maybe one is decent.

Analogy: Imagine asking 10 people to draw a map to a treasure. Most maps are wrong, but one is close.

Step 2: The Test Run (Fitness Evaluation)

The coach takes these 10 maps and tests them on 5 or 10 different versions of the treasure hunt.

If a map leads to a dead end, it gets a "failure" score.
If a map finds the treasure but takes a long, winding path, it gets a "slow" score.
If a map finds the treasure quickly, it gets a "gold star."

Step 3: The "Survival of the Fittest" (Evolution)

This is the magic part. The coach doesn't just pick the winner. The coach takes the best maps and mixes them together.

Crossover: Imagine taking the "turn left at the oak tree" part from Map A and the "cross the river at the bridge" part from Map B to create a new, super Map C.
Mutation: The coach makes tiny, random tweaks to Map C (e.g., "What if we skip the bridge and swim?"). Maybe this new idea is even better!

Step 4: The Loop

The coach discards the worst maps, keeps the new "hybrid" maps, and asks the AI to refine them again. This happens over and over (generations).

The Twist: The AI isn't just guessing; it's being told, "Hey, your last map failed here because you forgot the bridge. Try to fix that."

3. The Result: A Master Strategist

After a few hours of this "training camp," GenePlan produces a single, highly optimized Python script.

It's Fast: Once this script is written, it can solve new puzzles in less than half a second.
It's Cheap: The whole process costs about $1.82 per domain (a specific type of puzzle).
It's Smart: In tests, this evolved AI performed just as well as the world's best traditional planning software (which has been refined for decades), but it did it by learning the strategy rather than being hard-coded.

Why is this a big deal?

Think of it like this:

Old Way: You hire a human to write a specific instruction manual for every single new warehouse layout.
GenePlan Way: You hire a coach to train a robot to invent its own instruction manual that works for any warehouse layout, and then the robot gets faster and smarter every time it tries.

The "Gotcha"

The paper admits that this doesn't work for every problem. If a puzzle is like Sokoban (a game where you push boxes and can get stuck in a corner with no way out), a simple "general strategy" doesn't exist. In those cases, the AI tries to build a complex search engine (like a GPS recalculating the route every second), which is slower. But for most standard logistics and planning problems, GenePlan is a game-changer.

Summary

GenePlan is a framework that uses Large Language Models as a "breeding ground" for code. It treats planning as a game of evolution: generate many strategies, test them, keep the best ones, mix them, and repeat. The result is a fast, cheap, and highly intelligent planner that can solve complex logic puzzles better than standard AI prompts and almost as well as the most powerful traditional computers.

Here is a detailed technical summary of the paper "GenePlan: Evolving Better Generalized PDDL Plans using Large Language Models."

1. Problem Statement

The paper addresses the challenge of Generalized Planning in classical AI planning domains defined by the Planning Domain Definition Language (PDDL).

The Core Challenge: Traditional planners solve specific problem instances (a specific initial state and goal) but do not generalize to new instances within the same domain. Conversely, existing Large Language Model (LLM) approaches for planning often suffer from:
- Sub-par performance: Direct LLM planning often fails to produce valid or optimal plans.
- Lack of Optimization: Previous LLM-based generalized planning methods (e.g., Chain-of-Thought prompting) focus on generating satisficing solutions (any valid plan) without optimizing for plan quality (e.g., minimizing plan length).
- Integration Costs: Hybrid approaches often require translating natural language to PDDL and then running expensive search algorithms, which is computationally heavy and prone to translation errors.
The Goal: To generate a domain-dependent generalized planner (a Python function) that can solve any instance within a specific PDDL domain efficiently and with high-quality (short) plans, without requiring a full search algorithm at inference time.

2. Methodology: GenePlan Framework

The authors propose GenePlan, a framework that treats generalized planning as an optimization problem solved via an Evolutionary Algorithm (EA) assisted by an LLM.

A. Formulation as Optimization

The problem is defined as finding a planner function $\Phi$ (written in Python) that minimizes the average plan length across a set of training instances $\Pi_{train}$ :
$\arg \min_{\Phi} \frac{1}{|\Pi_{train}|} \sum_{\Pi \in \Pi_{train}} |\Phi(\Pi)|$
Where $|\Phi(\Pi)|$ is the number of actions in the plan generated for instance $\Pi$ .

B. The Evolutionary Loop

GenePlan evolves a population of candidate Python planners through the following steps:

Initialization: The population is seeded with initial planners generated via Chain-of-Thought (CoT) prompting or provided manually.
Selection: Planners are selected as "parents" based on their fitness score (average plan length on training tasks). The authors use a Boltzmann selection strategy with a hyperbolically decaying temperature. This encourages exploration (sampling diverse planners) early in the process and exploitation (focusing on high-performing planners) as the population size grows.
Prompt Construction: Selected parent planners (their code and performance feedback) are inserted into a prompt template along with the PDDL domain definition. The prompt explicitly instructs the LLM to perform crossover (combining code segments) and mutation (strategic modifications like efficiency improvements) to generate a new "offspring" planner.
Generation & Validation:
- The LLM generates new Python code.
- AST Parsing: The code is validated using an Abstract Syntax Tree (AST) parser to ensure it adheres to a safe subset of Python (preventing execution of arbitrary code).
- Execution: Valid code is compiled and executed on training tasks.
- Fitness Evaluation: The resulting plans are validated using a PDDL plan validator. The fitness score is the average plan length. If a planner fails to solve an instance, it receives a high penalty score.
Replacement: The population is updated using an elitist replacement strategy ( $\mu + \lambda$ selection). The worst-performing planners are pruned, and the best candidates from the current generation and the new offspring are retained for the next generation.

C. Key Technical Features

Interpretability: The output is executable Python code, making the planning logic transparent and debuggable.
No Search at Inference: Once evolved, the planner is a direct function mapping state to actions, eliminating the need for expensive search algorithms (like A*) during deployment.
Cost-Efficiency: The framework uses API-based LLMs (GPT-4o) but optimizes the number of calls via the evolutionary process.

3. Key Contributions

Novel Framework: Introduction of GenePlan, which integrates LLMs into an evolutionary optimization loop specifically for generating generalized PDDL planners.
Optimization Focus: Unlike prior LLM planning work that focuses on feasibility, GenePlan explicitly optimizes for plan quality (minimizing action count).
Performance Parity: Demonstrated that evolved LLM planners can match the performance of state-of-the-art classical planners (Fast Downward) while being significantly faster at inference.
Ablation Studies: Provided empirical evidence on the importance of:
- Context: Providing full PDDL domain definitions is crucial; ablated names (generic names) severely degrade performance.
- Evaluation: An evaluator (fitness function) is essential; removing it leads to random performance.
- Model Choice: GPT-4o significantly outperforms GPT-4o mini for complex reasoning tasks.

4. Experimental Results

The authors evaluated GenePlan on 8 domains (6 standard benchmarks: Heavypack, Hiking, Manyferry, Manygripper, Manymiconic, Trapnewspapers and 2 new domains: Research, Trading).

Plan Quality (SAT Score):
- GenePlan achieved an average SAT score of 0.91.
- This closely matches Fast Downward (fd_1800) with a 30-minute time limit (SAT score 0.93).
- It significantly outperformed LLM baselines like Chain-of-Thought (CoT) prompting with GPT-4o (SAT score 0.64).
Inference Speed:
- The generated planners solve new instances in an average of 0.49 seconds per task.
- This is orders of magnitude faster than Fast Downward (which takes seconds to minutes per task).
Cost:
- The average cost to generate a high-quality planner using GPT-4o was $1.82 per domain.
Break-even Analysis:
- For domains with recurring planning needs, the one-time generation cost is offset quickly. For example, in the Trading domain, GenePlan becomes more efficient than Fast Downward after solving just 1.66 instances.
Limitations:
- In domains with no simple generalizable strategy (e.g., Sokoban or Blocksworld with irreversible states), GenePlan failed to find a solution, whereas search-based planners succeeded. This highlights that GenePlan is best suited for domains with exploitable structural patterns.

5. Significance and Future Work

Paradigm Shift: GenePlan shifts the paradigm from "LLM as a planner" (generating a plan for one instance) to "LLM as an optimizer" (generating a planner for a whole domain).
Practicality: The approach offers a highly efficient, interpretable, and low-cost solution for domains where planning tasks are repetitive (e.g., logistics, resource management).
Future Directions:
- Developing early stopping criteria to reduce generation costs.
- Exploring optimization metrics beyond plan length (e.g., robustness, energy).
- Using LLMs as orchestrators to dynamically switch between GenePlan (for structured domains) and traditional search-based planners (for complex, non-generalizable domains).

In conclusion, GenePlan demonstrates that combining evolutionary algorithms with LLMs is a powerful method for synthesizing high-quality, domain-specific planning algorithms that rival traditional solvers in quality while vastly outperforming them in inference speed.