Enhancing Web Agents with a Hierarchical Memory Tree

The Big Problem: The "Copy-Paste" Mistake

Imagine you hire a very smart intern (the AI Agent) to help you book flights, buy clothes, or manage your calendar on the internet. You teach this intern how to do these tasks on Website A (let's say, Expedia).

Now, you ask the intern to do the exact same thing on Website B (let's say, Trip.com).

The Old Way (Flat Memory):
The intern looks at their notes from Website A. The notes say: "Click the blue button with the ID number #btn-123 to search."
The intern goes to Website B, looks for a button with the ID #btn-123, and... nothing happens. That button doesn't exist on the new site. The intern gets confused, clicks the wrong thing, or gives up.

This is what the paper calls "Intention-Execution Entanglement." The intern mixed up the goal (book a flight) with the specific details of the old website (the button ID). It's like trying to open a new car door using the key from your old car; the shape of the key (the goal) is right, but the specific teeth (the website details) don't fit.

The Solution: The "Hierarchical Memory Tree" (HMT)

The authors propose a new way for the AI to remember things. Instead of a flat list of notes, they build a Tree of Knowledge with three distinct levels. Think of it like a Master Chef teaching a new kitchen assistant.

Level 1: The "Intent" (The Menu)

What it is: The high-level goal.
Analogy: Instead of remembering "Buy a flight to NYC on Expedia," the AI remembers the abstract concept: "Book a Flight."
Why it helps: It doesn't matter if the website is Expedia, Trip.com, or a tiny local agency. The goal is always "Book a Flight." This is the root of the tree.

Level 2: The "Stage" (The Recipe Steps)

What it is: The logical phases of the task.
Analogy: The Chef breaks the task down into steps:
1. Find the search form. (Pre-condition: The form must be visible).
2. Enter the destination.
3. Click "Search".
4. Select a flight.
Why it helps: The AI checks the current website. "Okay, I see a search form. I am at Stage 1." It doesn't jump straight to "Click Search" if the form isn't there yet. It aligns the memory with the current state of the page, not just the first instruction.

Level 3: The "Action" (The Ingredients & Tools)

What it is: How to actually do the step, but described by what the thing looks like, not its secret code.
Analogy:
- Old Way: "Click the button with ID #btn-123." (Useless on a new site).
- New Way (HMT): "Click the button that says 'Search', is located at the bottom right of the form, and looks like a magnifying glass."
Why it helps: Even if the new website uses a different code for the button, the button still says "Search" and is in the same spot. The AI can find it by description, not by a secret ID.

How It Works: The "Planner" and the "Actor"

The system splits the brain of the AI into two roles to make sure it doesn't get confused:

The Planner (The Manager):
- Job: Looks at the current website and asks, "Where are we in the recipe?"
- Action: It checks the "Stage" level of the tree. "Is the search form visible? Yes. Okay, we are ready to search." It prevents the AI from trying to "Book" a flight before it has even "Found" one.
- Safety Net: If the AI isn't sure which stage it's in, the Planner has a "panic button" (Confidence Fallback) to stop and try a different strategy instead of guessing wrong.
The Actor (The Doer):
- Job: Actually clicks the buttons.
- Action: The Manager hands the Actor a description: "Find the 'Search' button." The Actor scans the new website, finds the button that matches that description, and clicks it. It ignores the old, useless button IDs.

Why This Matters (The Results)

The researchers tested this on two big test suites (Mind2Web and WebArena).

The Result: The new system (HMT) was much better at working on websites it had never seen before compared to the old "Flat Memory" systems.
The Analogy:
- Old System: Like a tourist who memorized the exact GPS coordinates of a restaurant in Paris. If they go to London, the coordinates lead them to a park.
- New System (HMT): Like a tourist who knows how to ask for directions: "Find a place that serves croissants." They can find a bakery in Paris, London, or Tokyo, even if the buildings look different.

Summary

This paper introduces a smarter way for AI to learn from the internet. Instead of memorizing exact clicks (which break when websites change), it learns logical steps and visual descriptions. By separating the "What" (Goal) from the "How" (Action), the AI can generalize its skills to any website, making it a true "Web Agent" rather than just a script for one specific page.

1. Problem Statement

Large Language Model (LLM) based web agents struggle to generalize to unseen websites, despite their strong reasoning capabilities. The core issue identified is Intention-Execution Entanglement:

Flat Memory Limitations: Current methods store interaction trajectories as flat, linear sequences of observations and actions.
The Entanglement: These flat structures conflate high-level task logic (transferable) with low-level, site-specific action details (non-transferable), such as specific DOM IDs or coordinates.
Consequence: When an agent retrieves a memory from a source website to apply to a target website, the high-level intent matches, but the specific action details (e.g., clicking #btn-123) are invalid on the new site. This leads to workflow mismatch (executing steps out of order) and context pollution (retrieving irrelevant actions), causing execution failures in cross-website and cross-domain scenarios.

2. Methodology: Hierarchical Memory Tree (HMT)

The authors propose HMT, a structured framework that explicitly decouples logical planning from action execution using a three-level abstraction hierarchy.

A. Memory Structure (The Tree)

HMT transforms raw interaction trajectories into a tree with three distinct levels:

Intent Level (Root): Maps diverse natural language instructions to standardized task goals (e.g., "Book Flight") and constraints. This normalizes phrasing variations to ensure consistent retrieval.
Stage Level (Intermediate): Defines reusable semantic subgoals (e.g., "Search Flights"). Crucially, each stage is characterized by observable pre-conditions and post-conditions (e.g., "Search form visible" $\to$ "Results list visible"). This allows the agent to align retrieval with its current progress based on the visual state of the page, not just the initial instruction.
Action Level (Leaf): Stores abstract action patterns paired with transferable semantic element descriptions. Instead of storing raw IDs (e.g., #btn-123), it stores semantic attributes like role, label, relative position, and structural context (e.g., "Button labeled 'Search' at the bottom-right of the form").

B. Construction Pipeline

The memory is built via an automated pipeline:

Instruction Normalization: LLMs rewrite raw instructions into standardized intents.
Subgoal Segmentation: Trajectories are partitioned into contiguous segments based on semantic shifts, with strict consistency checks.
Step Abstraction: Raw actions are converted into abstract patterns with semantic descriptions, discarding site-specific identifiers.

C. Stage-Aware Inference Mechanism

At test time, HMT employs a Planner-Actor decomposition:

Planner: Performs state abstraction and Stage Selection. It matches the current observation against the pre/post-conditions of retrieved subgoals to ensure temporal consistency. It includes a Confidence-Aware Fallback mechanism: if the confidence in stage selection is low, it expands the search scope or reverts to a baseline policy to avoid error propagation.
Actor: Performs Action Grounding. It receives the selected stage and abstract semantic descriptions. It scans the current page's DOM to find the element that best matches the semantic description (e.g., finding the "Search" button) rather than relying on stored IDs.

3. Key Contributions

Hierarchical Memory Architecture: Introduced HMT to solve intention-execution entanglement by organizing memory into Intent, Stage, and Action levels.
Semantic Grounding: Developed a step-level abstraction method that stores transferable semantic descriptions instead of raw identifiers, enabling grounding on new websites without prior exposure.
Stage-Aware Inference: Created a Planner-Actor framework with a confidence-aware fallback to handle uncertain retrieval and ensure workflow alignment.
Empirical Validation: Demonstrated significant performance gains in cross-website and cross-domain generalization compared to flat-memory baselines.

4. Experimental Results

The authors evaluated HMT on Mind2Web (offline generalization) and WebArena (online interactive execution).

Mind2Web (Cross-Website): HMT significantly outperformed flat-memory methods.
- Step Success Rate (StepSR): Improved by 6.0% over the best baseline (AWM) in the Cross-Website split (39.7% vs. 33.7%).
- Cross-Domain: Showed consistent improvements, proving the ability to generalize to entirely new domains.
WebArena: HMT achieved the highest total task success rate (38.7%), with substantial gains in logic-heavy domains like GitLab (+5.8%) and CMS (+5.0%).
Ablation Studies:
- Removing the hierarchy (Flat Memory) caused a significant drop in performance.
- Replacing semantic descriptions with raw identifiers caused a catastrophic failure in cross-website settings (StepSR dropped from 39.7% to 12.4%), proving the necessity of semantic abstraction.
- Removing the Planner (state verification) degraded performance, highlighting the need for stage alignment.
Efficiency: HMT reduced average context tokens by 72.7% and inference latency by 32.7% (3.5s vs 5.2s) by compressing raw HTML into semantic nodes.

5. Significance and Limitations

Significance: HMT establishes a new standard for transferable web agents. By decoupling what needs to be done (logic) from how it is done on a specific site (execution), it enables robust lifelong learning and generalization across the open web. It addresses the "brittleness" of current agents that fail when DOM structures change.
Limitations:
- Ambiguous Grounding: In cases where semantic descriptions are too generic (e.g., "Click 'more'"), the agent may struggle to distinguish between distractors without hierarchical structural constraints.
- State Verification: In Single Page Applications (SPAs) where actions trigger visual updates (modals) without URL changes, the Planner's rigid post-condition checks (relying on URL changes) may fail, causing unnecessary retry loops.

In conclusion, the paper argues that structured, hierarchical memory is essential for the next generation of web agents to achieve true generalization beyond fixed environments.