Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

Imagine you hire a very smart, well-read but slightly clumsy robot assistant to do your online shopping or research for you. You give it a big, complex instruction like, "Find me the cheapest winter coat for my dog in my size, and make sure it's on sale."

This paper is essentially a "autopsy report" on why these robot assistants (called LLM Web Agents) often fail at these tasks, even though they are getting smarter.

The authors argue that we've been looking at the wrong thing. Instead of just asking, "Did the robot finish the job? Yes or No?", they decided to look at how the robot thinks and acts. They broke the robot's brain down into three distinct layers, like a construction crew building a house.

Here is the breakdown using simple analogies:

1. The Three Layers of the Robot's Brain

The authors say every web agent has three jobs, and they fail at different points in each:

Layer 1: The Architect (High-Level Planning)
- The Job: This is the part that reads your request and makes a master plan. It says, "First, go to the store. Second, find the dog section. Third, filter by size."
- The Problem: Sometimes the Architect gets confused. It might write a plan that is too vague ("Find a coat") or too specific ("Click the third button on the left"), missing the big picture.
- The Fix: The paper found that if you force the Architect to write the plan in a strict, structured language (like a computer code called PDDL) rather than just chatting in normal English, the plans become much clearer and more logical. It's like switching from a messy handwritten note to a precise blueprint.
Layer 2: The Builder (Low-Level Execution)
- The Job: This is the part that actually touches the screen. It has to click the right button, type the right words, and scroll to the right spot.
- The Problem: This is where the robot fails the most. Even if the Architect draws a perfect blueprint, the Builder is clumsy. It might click the wrong link, get stuck in a loop clicking the same button over and over, or get distracted and wander off to a different website (like searching for "dog coats" on a flight tracker site).
- The Metaphor: Imagine a brilliant architect who designs a perfect bridge, but the construction crew keeps tripping over their own shoelaces and dropping bricks. The paper found that this clumsiness is the biggest bottleneck. The robot just isn't good enough at "seeing" the website and clicking the right things yet.
Layer 3: The Fixer (Replanning)
- The Job: When things go wrong (the website crashes, or the button doesn't work), this part has to say, "Okay, Plan A failed. Let's try a different way."
- The Finding: The paper found that giving the robot a "second chance" to rethink its strategy after it gets stuck actually helps a lot. It's like telling a lost hiker, "Okay, that path is blocked. Let's look at the map again and find a new route." This simple step of "re-planning" significantly improved success rates.

2. The Big Takeaways

The authors ran experiments with three different "brains" (AI models) and found three main things:

Structure helps the Planner: When you force the AI to write its plan in a structured format (PDDL) instead of free-flowing text, it makes better, more concise plans. It stops rambling and gets straight to the point.
The "Hands" are the problem: The biggest issue isn't that the AI can't think; it's that it can't do. It struggles to translate a good idea into a specific click or type action. It's like having a genius chef who can describe a recipe perfectly but burns the toast every time they try to use the toaster.
Getting stuck is normal, but fixing it works: Robots get confused easily. But if you let them pause, realize they are stuck, and try a new plan, they succeed much more often.

3. What Should We Do Next?

The paper suggests that to make these robots truly reliable (like a human), we need to stop trying to make one giant AI do everything perfectly. Instead, we should:

Separate the jobs: Have a specialized module for planning (the Architect) and a different, specialized tool for clicking and typing (the Builder).
Teach them to admit confusion: Instead of guessing and clicking the wrong thing, the robot should be able to say, "I'm not sure where to click, let me ask for help or try a different approach."
Stop just looking at the final score: We need to grade them on how they did the task, not just if they finished it. This helps us see if they failed because they were stupid, or just because they were clumsy.

In short: These AI web agents are like brilliant but clumsy interns. They can understand the assignment, but they trip over their own feet when trying to do the actual work. To fix them, we need to give them better instructions (structured planning) and better training on how to use a mouse and keyboard (perceptual grounding), rather than just hoping they get smarter at thinking.

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

1. The Three Layers of the Robot's Brain

2. The Big Takeaways

3. What Should We Do Next?

1. Problem Statement

2. Methodology

A. The Three-Layer Framework

B. Experimental Setup

3. Key Contributions

4. Key Results & Findings

A. High-Level Planning

B. Low-Level Execution (The Dominant Bottleneck)

C. Replanning

5. Significance and Recommendations

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

1. The Three Layers of the Robot's Brain

2. The Big Takeaways

3. What Should We Do Next?

1. Problem Statement

2. Methodology

A. The Three-Layer Framework

B. Experimental Setup

3. Key Contributions

4. Key Results & Findings

A. High-Level Planning

B. Low-Level Execution (The Dominant Bottleneck)

C. Replanning

5. Significance and Recommendations

More like this

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition

LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

Linear Programming for Multi-Criteria Assessment with Cardinal and Ordinal Data: A Pessimistic Virtual Gap Analysis

Seven simple steps for log analysis in AI systems