WebExpert: domain-aware web agents with critic-guided expert experience for high-precision search

Imagine you are trying to solve a very tricky puzzle, like figuring out the best way to invest your money or understanding a complex medical diagnosis. You decide to ask a very smart, fast robot to search the internet for you.

The Problem with the "Generic" Robot
Most current web-search robots are like generalist tourists. They are fast and can read a lot, but they don't know the local customs.

If you ask, "What's the best time to buy stocks?", a tourist might just search for "best time to buy stocks."
They might miss crucial details like which country you are in, what the current laws are, or what season it is.
As a result, they wander around the internet, visit too many irrelevant websites, and often give you an answer that sounds good but is actually wrong or outdated. They waste time and get confused by "noise."

The Solution: Meet "WebExpert"
The researchers behind this paper built a new kind of robot called WebExpert. Think of WebExpert not as a tourist, but as a seasoned local guide who has read thousands of expert manuals and learned from the mistakes of others before you.

Here is how WebExpert works, broken down into three simple steps using analogies:

1. The "Experience Library" (Critic-Guided Extraction)

Before WebExpert even starts searching, it opens a special library of "Expert Notes."

How it works: The team took thousands of real-world questions and answers (like "How does asset correlation affect diversification?") and asked a super-smart AI to read them.
The Magic: Instead of just saving the whole answer, the AI acts like a critic editor. It cuts out the fluff and extracts the core rule.
- Example: Instead of saving a 5-page article, it saves a sticky note that says: "Diversification works best when assets are uncorrelated. Always check the time period and region."
The Result: WebExpert has a mental cheat sheet of "rules of thumb" for specific topics like finance, medicine, or law.

2. The "Smart Search Plan" (Schema-Light Facet Induction)

When you ask a question, WebExpert doesn't just type it into Google. It first checks its Expert Notes to see what specific details (facets) matter.

The Analogy: Imagine you are ordering a pizza. A generic robot just orders "Pizza." WebExpert asks itself: "Wait, the expert notes say for 'Finance' questions, I need to know the Region, the Time, and the Policy."
The Action: It automatically builds a better search query. Instead of "best stocks," it searches for "best stocks for US investors in 2024 under current SEC regulations."
The Safety Net: If the robot isn't sure which expert note to use, it has a "fallback" mode. It doesn't guess; it switches to a safe, general search so it doesn't get stuck in a loop.

3. The "Deep Dive" (Preference-Optimized Planning)

Once it has a great search plan, WebExpert goes out and explores the web.

The Difference: Because it started with a precise plan, it doesn't need to click through 10 different websites to find the answer. It goes straight to the right page.
The Training: The robot was trained using a "reward system." If it finds the right evidence quickly, it gets a "gold star." If it wanders off-topic, it gets a "red card." Over time, it learned to be extremely efficient.

Why Does This Matter? (The Results)

The researchers tested WebExpert on hard puzzles involving finance, science, and general knowledge.

Generic Robots: Often got the answer wrong or took too many steps (like walking 8 miles to find a cup of coffee).
WebExpert: Got the answer right 1.5% to 3.6% more often (which is huge in AI!) and took fewer steps (only 5.2 steps instead of 8.1).

The Big Picture

Think of WebExpert as upgrading from a flashlight to a GPS with a local guide.

The flashlight (generic AI) just shines light everywhere, hoping to see something useful.
The GPS (WebExpert) knows the terrain, knows the traffic rules, and knows exactly which turn to take to get you to your destination without getting lost.

This paper shows that by giving AI access to "expert experience" and teaching it to ask better questions before it starts searching, we can make it much smarter, faster, and more reliable in specialized fields like medicine and finance.

1. Problem Statement

Specialized web tasks in high-stakes domains (finance, biomedicine, pharmaceuticals) pose significant challenges for current Large Language Model (LLM) web agents. Generic agents often fail in these scenarios due to:

Missing Domain Priors: They lack specific knowledge regarding seasonality, regional regulations, or industry-specific granularity.
Query Drift: Agents formulate off-target queries, leading to irrelevant search results.
Noisy Evidence: Agents wander through irrelevant pages, missing critical evidence.
Brittle Reasoning: Without structured domain constraints, reasoning chains collapse when evidence is ambiguous.

Existing Retrieval-Augmented Generation (RAG) and "Reason-then-Search" systems rely on generic priors and static hand-written lexicons, which are insufficient for dynamic, domain-specific contexts.

2. Methodology: WebExpert

WebExpert is an end-to-end domain-aware web agent designed to inject sentence-level expert priors into the search process. It operates via a three-step pipeline:

A. Offline: Critic-Guided Expert Experience Extraction

The system constructs a reusable "Expert Experience Base" ( $E$ ) from annotated QA pairs and curated expert materials. This process involves:

Question Harvesting & Canonicalization: Normalizing surface forms of questions to extract canonical intents.
Multi-View Clustering: Using HDBSCAN or BERTopic to cluster QA tuples based on semantic similarity (combining question, answer, and co-encoded representations). This groups semantically similar problems even if answers differ in granularity.
Evidence Aggregation: Aggregating answers and rationales within clusters, applying Maximal Marginal Relevance (MMR) to ensure diversity and de-duplicate sources.
Contradiction-Aware Summarization: A specialized LLM (DeepSeek-R1) summarizes the cluster into a concise Rule ( $r_m$ ). This rule includes conditions, core guidance, edge cases, and failure modes. A consistency check filters self-contradictory statements.
Schema-Light Facet Induction: Instead of relying on static hand-written schemas, the system automatically induces facet vocabularies (Time, Region, Policy, Industry) from weak supervision and corpus statistics.
Versioning: The experience base is maintained as a versioned store to allow streaming updates.

B. Online Inference: Experience-Guided Planning

During inference, WebExpert integrates the experience module before deep browsing:

Experience Retrieval: Retrieves the top- $k$ relevant rules ( $E^{(k)}$ ) based on cosine similarity between the query and the experience base.
Experience Gate: A lightweight gate computes retrieval confidence.
- If confidence is high, it biases decoding toward active facets (e.g., specific time spans, regions, policies) to generate domain-grounded queries.
- If confidence is low (below threshold $\theta=0.3$ ), it falls back to generic query generation to avoid over-constraint.
Deep Browsing: The generated multi-query plan is fed to a search-and-browse controller that interleaves retrieval and reasoning to synthesize the final answer.

C. Training: Preference-Optimized Planning

The model (QwQ-32B) is fine-tuned using a multi-objective approach:

Token-Level Weighting: The loss function ( $L_{plan}$ ) up-weights tokens that activate domain facets (time, region, etc.) and down-weights off-facet tokens.
Contrastive Retrieval Objective ( $L_{ret}$ ): Optimizes the retrieval of high-quality experiences aligned with the query against hard negatives.
Coverage & Preference: Incorporates coverage-aware objectives and pairwise preference learning to ensure the agent selects the most relevant rules.

3. Key Contributions

Critic-Guided Extraction Chain: A novel pipeline that converts raw expert data into reusable, sentence-level rules with explicit conditions and failure modes, rather than static knowledge graphs.
Schema-Light Facet Induction: Automatically induces domain-specific facets (time, region, policy, industry) from data statistics, eliminating the need for manual schema engineering.
Experience-Conditioned Planning: A training framework combining Supervised Fine-Tuning (SFT), retrieval margins, and preference optimization to jointly improve query planning and evidence retrieval.
Robust Inference Mechanism: A dynamic "experience gate" that balances domain-specific guidance with fallback mechanisms to prevent hallucination or over-constraint.

4. Experimental Results

The system was evaluated on GAIA, GPQA, HLE, and WebWalkerQA, comparing against strong baselines like Search-o1, WebThinker, and standard RAG.

Performance Gains: WebExpert achieved consistent improvements in Answer Exact Match (EM):
- GAIA: +1.5 pp over the strongest baseline (WebThinker).
- GPQA: +1.5 pp.
- HLE: +1.5 pp.
- WebWalkerQA: +1.8 pp.
- With SFT, gains increased to 1.5–3.6 pp across datasets.
Efficiency: The number of Page Hops (pages visited per solved example) decreased significantly (e.g., from 8.1 to 5.2), indicating more direct and precise search paths.
Query Quality: Query Precision@3 (QP@3) improved from 49.3% (baseline) to 61.8% (WebExpert+SFT), demonstrating significantly better retrieval of on-topic evidence.
Ablation Studies:
- Removing Sentence-level embedding or SFT caused the largest performance drops.
- Topic merging was crucial for generalizing rules.
- Retrieving top-5 experiences provided the best balance between precision and coverage.

5. Significance

WebExpert addresses a critical gap in AI agents: the transition from general-purpose browsing to domain-specialized precision. By distilling expert knowledge into dynamic, sentence-level rules and using them to steer query generation, the system reduces "token waste" and irrelevant browsing.

Practical Impact: It offers a scalable solution for industries where accuracy and regulatory compliance (e.g., finance, healthcare) are paramount, reducing the reliance on brittle, hand-crafted rules.
Technical Advancement: It demonstrates that pre-computed, critic-refined experience combined with preference-optimized training is more effective than relying solely on the LLM's internal reasoning or generic RAG.
Reproducibility: The authors provide the code and a clear pipeline (UMAP, HDBSCAN, BERTopic) for constructing domain-specific agent knowledge bases, facilitating adoption in other specialized fields.