AI Planning Framework for LLM-Based Web Agents

This paper proposes an AI planning framework that maps LLM-based web agent architectures to traditional search paradigms (BFS, DFS, Best-First) to enable principled failure diagnosis, introducing new evaluation metrics and a human-labeled dataset to demonstrate that different agent strategies excel in distinct aspects of task execution, such as human-like trajectory alignment versus technical element accuracy.

Orit Shahnovsky, Rotem Dror

Published 2026-03-16
📖 6 min read🧠 Deep dive

Imagine you are trying to teach a very smart, but slightly confused, robot how to do your chores on the internet. You tell it, "Go to the store, buy milk, and then check the weather."

In the past, these robots were like rigid robots following a script: "Click here, then click there." If the website changed slightly (like moving the "Buy" button), the robot would crash.

Now, we have AI Agents powered by Large Language Models (LLMs). These are like super-smart interns who can read and understand your request. But there's a problem: they often act like a black box. You see them click things, but you don't know why they clicked them, or why they suddenly forgot they were supposed to buy milk and started looking at cat videos instead.

This paper is like a diagnostic manual for these AI interns. It gives us a new way to understand how they think, how to spot when they are going wrong, and how to measure if they are actually doing a good job.

Here is the breakdown of their ideas using simple analogies:

1. The Three Ways to Plan a Trip

The authors say that all these AI agents are essentially trying to solve a maze. They categorize them into three distinct "planning styles," comparing them to classic search methods:

  • The "Step-by-Step" Agent (The Breadth-First Explorer):

    • Analogy: Imagine you are walking through a forest. At every single step, you stop, look at the three paths right in front of you, pick the best one, take a step, and then stop again to look at the new three paths.
    • How it works: The AI looks at the current webpage, decides on one action (like "click the login button"), does it, sees what happens, and then decides the next move.
    • Pros: It's very flexible. If the website changes, it adapts immediately.
    • Cons: It can get "lost" easily. It might forget the big picture (buying milk) because it's too focused on the immediate next step.
  • The "Tree Search" Agent (The Branching Detective):

    • Analogy: This agent is like a detective who draws a map of every possible path before moving. It imagines "If I click here, then what? If I click there, then what?" It explores many branches of the future at once to find the best route.
    • How it works: It keeps a mental tree of possibilities and picks the most promising branch to follow.
  • The "Full-Plan-in-Advance" Agent (The Depth-First Planner):

    • Analogy: This is the ultimate planner. Before leaving the house, it writes down the entire itinerary: "1. Open browser. 2. Go to Amazon. 3. Search milk. 4. Click buy. 5. Go to weather site." It commits to this entire list before taking a single step.
    • How it works: The AI generates a full list of steps first, then tries to execute them one by one.
    • Pros: It has a clear goal and is less likely to wander off.
    • Cons: If the website is slightly different than expected (e.g., the "Buy" button is in a different spot), the whole plan falls apart, and the agent might get stuck or give up.

2. The Problem with "Pass or Fail"

Currently, we judge these agents like a teacher grading a test with only two options: Pass or Fail.

  • The Flaw: If the agent was supposed to find 5 reviews and it found 4, it gets a "Fail." But that's unfair! It did 80% of the work.
  • The Old Way: We only looked at the final result.
  • The New Way: The authors say we need to grade the process, not just the result. Did the agent take a weird detour? Did it repeat the same mistake 10 times? Did it recover when it made a mistake?

3. The New Report Card (5 New Metrics)

To fix the "Pass/Fail" problem, they created a new report card with five specific grades:

  1. Recovery Rate: If the agent takes a wrong turn, how quickly does it realize its mistake and get back on the right path? (Like a GPS recalculating your route).
  2. Repetitiveness Rate: Does the agent keep clicking the same button over and over because it's confused? (Like a dog chasing its tail).
  3. Step Success Rate: How many of the steps the agent took actually matched what a human would have done?
  4. Partial Success Rate: If the task was "list 3 movies," did it list 1, 2, or all 3?
  5. Element Accuracy: Did the agent say it was going to click "Submit," but actually clicked "Cancel"? This measures if the agent's intent matches its action.

4. The Experiment: The Human vs. The Robot

The authors built a new dataset of 794 tasks where humans actually solved the problems step-by-step. This became the "Gold Standard" (the perfect answer key).

They then tested two agents:

  1. The Step-by-Step Agent (The WebArena Agent): The standard, flexible one.
  2. The Full-Plan-in-Advance Agent (Their new creation): The rigid planner.

The Results:

  • The Step-by-Step Agent was better at following the human path. It was more flexible, recovered from mistakes better, and finished more tasks overall (38% success vs 36%).
  • The Full-Plan-in-Advance Agent was actually more precise when it did click things. It rarely clicked the wrong button (90% accuracy), but it was too rigid. If the plan didn't match the reality perfectly, it got stuck or gave up.

The Big Takeaway

There is no "one size fits all" AI agent.

  • If you are navigating a chaotic, changing website (like a social media feed or a live dashboard), you want the Step-by-Step agent. It's like a hiker who looks at the terrain as they go.
  • If you are navigating a strict, predictable system (like a bank portal or an e-commerce checkout), the Full-Plan-in-Advance agent might be better. It's like a train on a track; it's fast and efficient, but it can't handle a broken bridge.

In short: This paper gives us the vocabulary to stop treating AI agents like magic black boxes. It helps us understand how they think, why they fail, and which type of "thinking style" is best for the specific job we need them to do.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →