Agentic LLM Planning via Step-Wise PDDL Simulation: An Empirical Characterisation

This paper introduces PyPDDLEngine, an open-source PDDL simulation engine that enables agentic LLM planning via step-wise interaction, demonstrating that while this approach yields a modest 3% success rate improvement over direct LLM planning on Blocksworld tasks, it incurs significantly higher costs and lacks the external verification mechanisms that drive success in other coding agent applications.

Kai Göbel, Pierrick Lorang, Patrik Zips, Tobias Glück

Published 2026-03-09
📖 5 min read🧠 Deep dive

The Big Question: Can AI "Think" Its Way Through a Puzzle?

Imagine you have a robot that needs to solve a complex puzzle, like stacking blocks to build a tower in a specific order. This is called Task Planning.

For decades, we've used "Classical Planners" (like a super-organized librarian) to solve these puzzles. They follow strict rules and math to find the perfect path.

Recently, we've started using Large Language Models (LLMs) (like advanced chatbots) to do the same thing. These chatbots are great at writing stories and coding because they've read almost everything on the internet. But can they actually plan a robot's moves, or are they just guessing based on what they've read before?

This paper asks: If we let the AI chatbot "play" the game step-by-step, checking its moves as it goes, will it get better at planning?


The Experiment: Two Ways to Play

The researchers set up a test using 102 different "Blocksworld" puzzles (ranging from easy to very hard). They compared four different players:

  1. The Speedster (Classical Planner): A traditional, math-based robot that solves the puzzle instantly and perfectly.
  2. The "One-Shot" Chatbot: An AI that tries to write down the entire solution in one go, like writing a whole essay without editing. If it gets it wrong, it starts over from scratch.
  3. The "Agent" Chatbot: An AI that plays the game one move at a time. After every move, it asks the game engine, "Did that work? What does the world look like now?" If it messes up, it can say, "Oops, let's reset and try a different path."
  4. The "Refined" Speedster: A classical robot that finds a solution quickly, then spends the rest of the time trying to make it shorter and better.

The Tools: PyPDDLEngine

To make the "Agent" Chatbot work, the researchers built a new tool called PyPDDLEngine.

  • Analogy: Imagine a video game console that doesn't just show you the screen, but also lets you press buttons to pause, rewind, and check your inventory after every single step. This tool gave the AI a "live" view of the game so it could react in real-time.

The Results: What Happened?

1. The Success Rate (Did they finish the puzzle?)

  • The Classical Speedster: Won 85% of the time. It's the reliable champion.
  • The "One-Shot" Chatbot: Won 64% of the time.
  • The "Agent" Chatbot: Won 67% of the time.

The Takeaway: The "Agent" Chatbot did slightly better (3% more wins) than the "One-Shot" version. But it cost 5.7 times more in computer processing power (tokens) to get those extra wins. It's like hiring a team of 5 people to solve a puzzle that one person could solve, just to get 3 extra points.

2. The Plan Quality (How efficient were the moves?)

Here is the weird part.

  • The Classical "Refined" Speedster spent its whole time trying to make the solution shorter.
  • Surprisingly, both Chatbot versions often found shorter solutions than the Speedster, even though the Speedster was trying harder to optimize.
  • Why? The researchers think the Chatbots aren't actually "thinking" or "planning." They are memorizing. Since Blocksworld puzzles are famous and appear in millions of books and websites, the AI is likely just recalling the answer from its training data, like a student who memorized the textbook answers rather than learning the math.

3. The "Early Exit" Problem

The "Agent" Chatbot had a strange habit. Sometimes, it would look at the puzzle, decide, "This is impossible," and quit before the time ran out.

  • The Twist: On some of these puzzles, the "One-Shot" Chatbot actually solved it!
  • The Lesson: The Agent Chatbot was bad at judging its own progress. It thought it was stuck when it wasn't.

The Big Insight: Why Coding Agents Win, but Planning Agents Struggle

The paper makes a crucial distinction between Coding Agents and Planning Agents.

  • Coding Agents (The Success Story): When an AI writes code, it runs the code. If it fails, the computer gives a clear error message: "Line 5 has a missing semicolon." This is external feedback. The computer tells the AI exactly what is wrong. The AI can fix it, run it again, and get better.
  • Planning Agents (The Struggle): In this Blocksworld game, the AI moves a block. The game says, "Okay, the block is moved." It doesn't say, "You are moving away from the goal!" or "You are stuck in a loop."
    • Analogy: Imagine playing a maze game where the walls don't tell you if you're going the right way. You just have to guess if you're getting closer. Without a clear "You're wrong!" signal, the AI is just guessing in the dark.

The Conclusion

The researchers conclude that interaction alone isn't enough.

Giving an AI a "live" game to play (step-by-step) helps a little bit, but not as much as we hoped. The reason coding agents are so successful is that the computer gives them honest, objective feedback (errors). In pure planning, the AI has to judge its own progress without a referee.

The Final Metaphor:

  • Classical Planners are like a GPS: They calculate the perfect route mathematically.
  • LLMs are like a tourist who has read a travel guide. If the destination is famous (like Blocksworld), they can recite the route from memory.
  • Agentic LLMs are like that tourist walking the streets, asking for directions. But if the locals (the game engine) only say "You walked forward" without telling you if you're getting closer to the hotel, the tourist will eventually get lost, even if they are trying their best.

The Future: For robots to truly plan in the real world, we need to build systems that give the AI clear, honest signals about whether it is making progress, not just raw data about what happened.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →