WebDS: An End-to-End Benchmark for Web-based Data Science

Imagine you hire a brilliant, super-fast intern named "AI" to do a complex research project. You tell them: "Go find the latest unemployment numbers, compare them with health statistics from three different government websites, figure out if there's a connection, and write a report for our boss."

In the past, we tested these AI interns on very simple tasks, like "Find the price of a specific shoe on one website" or "Write a sentence about the weather." They were great at those. But in the real world, data science is messy. It's like trying to cook a gourmet meal while the kitchen is on fire, the ingredients are scattered across three different grocery stores, and the recipe keeps changing.

This paper introduces WebDS, a new, super-hard test designed to see if AI can actually handle that messy, real-world cooking.

Here is the breakdown of what they did and what they found, using some everyday analogies:

1. The Problem: The "Video Game" vs. The "Real World"

Think of previous AI tests like video games. In a video game, the rules are fixed, the map doesn't change, and the items are always in the same spot. If you practice enough, you can beat the level.

Old Tests: These were like video games. They asked AI to click a button or find a number on a static page.
The Real World: The internet is more like a live city. Buildings get renovated, signs change, traffic jams happen, and you have to walk from one neighborhood to another to get the full picture.

The authors realized that current AI agents are like video game champions who get completely lost when dropped into a real city. They can't handle the chaos of real data science.

2. The Solution: The "WebDS" Obstacle Course

The researchers built WebDS (Web Data Science). Instead of a video game, they built a massive, 29-stop obstacle course across the real internet.

The Course: It involves 870 different tasks. Some are easy (like finding a phone number), but most are hard.
The Challenge: The AI has to:
- Navigate: Walk through different websites (like going from a library to a bank to a newsstand).
- Mix Ingredients: Take a spreadsheet from one site and a news article from another and combine them.
- Use Tools: Sometimes it has to download a file, run a math calculation, or write a code script.
- Deliver: Finally, it has to write a summary or a report.

It's like asking the intern to go to the grocery store, the hardware store, and the library, buy specific items, mix them in a blender, and then bake a cake, all without getting distracted by the shiny toys in the aisles.

3. The Results: The "AI vs. Human" Gap

The researchers put the smartest AI models available (like GPT-4o and others) on this test. The results were shocking:

The AI Performance: The best AI only got about 13% to 22% of the tasks right.
- Analogy: Imagine a student taking a final exam and getting a D-. They knew how to read the questions, but they couldn't connect the dots to solve the problem.
The Human Performance: Real humans (who were given the same tools and time) got about 90% right.
- Analogy: The humans got an A+. They knew how to adapt when a website was confusing, how to double-check their work, and how to realize when they were going down the wrong path.

The Gap: There is a massive 75+ point gap between the best AI and a human. This tells us that while AI is getting smarter at talking, it is still terrible at doing complex, multi-step work in the real world.

4. Why Did the AI Fail? (The "Failure Modes")

The paper analyzed why the AI failed, and it found some funny and frustrating reasons:

The "Hallucinating Librarian": The AI would find the right document but then read the wrong number from it. It's like finding the correct page in a book but reading the paragraph from the page before it.
The "Stuck Record": If a button didn't work, the AI would just keep clicking it 50 times, hoping it would work the next time. It didn't know how to say, "Okay, this isn't working, let's try a different door."
The "Shortcut Taker": Instead of doing the hard work of downloading a file and analyzing it, the AI would just guess the answer based on a quick Google search, often getting it wrong.
The "Lost Tourist": The AI would get confused by similar names (e.g., going to the "Public Transportation" site instead of the "Physical Therapy" site) and wander off to the wrong neighborhood.

5. Why This Matters

This paper is a wake-up call. For a long time, we thought AI was almost ready to take over our jobs. This test shows that while AI is great at answering simple questions, it is not yet ready to be a reliable data scientist or office worker.

The Takeaway:
We need to stop testing AI on easy "video game" tasks and start testing them on these messy, real-world "obstacle courses." If we want AI to be truly useful in the future, we need to teach it how to navigate the chaos of the real internet, not just the clean, predictable world of video games.

In short: The AI is a brilliant scholar who can recite the dictionary but can't navigate a busy city to buy groceries. WebDS is the map we need to help it learn how to get there.

Here is a detailed technical summary of the paper "WebDS: An End-to-End Benchmark for Web-Based Data Science", published as a conference paper at ICLR 2026.

1. Problem Statement

Current benchmarks for Large Language Model (LLM) agents fail to capture the complexity of real-world web-based data science. Existing evaluations fall into two disconnected categories:

Web Agent Benchmarks (e.g., WebVoyager, WebArena): Focus on simplistic interactions like buying items or writing posts. They often lack diverse tool-use capabilities, do not require deep data manipulation, and frequently rely on live sites that change over time, hindering reproducibility.
Data Science Benchmarks (e.g., SQuAD, Spider): Concentrate on static, highly structured datasets (CSVs, SQL) or code-based environments. They do not assess the full end-to-end workflow where an agent must first acquire data via web navigation, clean heterogeneous sources, and then synthesize insights.

Real-world data science workflows are dynamic, multi-step, and require navigating unstructured and multimodal data across multiple websites to generate actionable reports. There is a critical lack of a benchmark that evaluates the entire pipeline from web browsing to analytical output.

2. Methodology: The WebDS Benchmark

The authors introduce WebDS, the first end-to-end benchmark designed specifically for web-based data science tasks.

A. Dataset Construction

Scale: 870 human-written tasks across 29 diverse websites.
Domains: Covers 10 high-stakes domains including Economics, Demographics, Health (CDC), Higher Education, Scientific Research, and E-commerce.
Task Attributes: Tasks are labeled with specific attributes to ensure diversity:
- Question-Answering (QA) vs. Action-Based: Ranging from extracting stats to posting recommendations on forums.
- Hop Complexity: Single-hop (one source) vs. Multi-hop (integrating multiple sources).
- Data Modality: Structured (tables/CSVs), Unstructured Text, and Unstructured Contextual (images/charts).
- Tool Usage: Requires external tools (Python, SQL, Wolfram Alpha) for computation.
- Multi-Website: Tasks requiring navigation across distinct domains.
Difficulty Levels: Tasks are categorized into Easy, Medium, and Hard based on the combination of the above attributes.

B. Evaluation Tracks

To balance realism with scientific rigor, WebDS offers two tracks:

WebDS-live: Agents interact with live, evolving websites. This captures real-world challenges like changing layouts but risks non-reproducibility.
WebDS-dockerized: A subset of websites is containerized (Docker) with frozen states. This ensures deterministic execution and exact reproducibility for longitudinal benchmarking.

C. Evaluation Protocol

The benchmark employs a dual-evaluation strategy:

Automated Binary Evaluation: For tasks with verifiable ground truths (e.g., specific numbers), the output is compared against the truth (Success/Fail).
Subjective LLM-as-a-Judge: For open-ended tasks (e.g., reports), an LLM judge scores the agent on a 1–5 scale based on the full trajectory (not just the final screenshot). The judge analyzes observation-action triplets to assess reasoning quality, tool usage, and report fidelity.
Human Baseline: Six data-science experts performed the same tasks under identical constraints to establish a human performance ceiling.

3. Key Contributions

Comprehensive Task Suite: A large-scale benchmark (870 tasks) spanning 29 sites and 10 domains, requiring complex navigation, multi-hop reasoning, and tool integration. It is the only benchmark covering all features: Multihop, Structured/Unstructured data, Web Nav, QA, Multi-site, Actions, and Tool-Use.
Realistic End-to-End Evaluation: The first benchmark to simulate the full data science pipeline (Browsing $\to$ Cleaning $\to$ Analysis $\to$ Insight Generation), quantifying the gap between current agents and human capabilities.
Reproducible & Fine-Grained Metrics: Introduces a dual-track system (Live/Dockerized) and granular scoring (Domain-wise, Attribute-wise, Difficulty-wise) to enable precise diagnostic analysis of agent failures.

4. Experimental Results

The authors evaluated 9 State-of-the-Art (SOTA) agents, including GPT-4o, Claude 3.5/4.5, Qwen2.5, and specialized agents like BrowserUse and AgentOccam.

Performance Gap:
- Agents: Even the best-performing agent (BrowserUse with GPT-5.1) achieved only 22.2% success on WebDS. Most models scored below 2%.
- Comparison: BrowserUse + GPT-4o achieved 81.9% on WebVoyager but dropped to 12.9% on WebDS. AgentOccam dropped from 45.7% (WebArena) to 4.8% (WebDS).
- Human Baseline: Humans achieved 90% success, revealing a massive ~75 percentage point gap between current agents and human performance.
Model Scaling: Increasing model capacity (e.g., GPT-4o vs. GPT-4o-mini) did not significantly improve performance, suggesting the bottleneck is not raw reasoning power but interaction fidelity.
Failure Analysis:
- Groundedness (40.2%): Agents fail to use information they successfully retrieved (e.g., hallucinating facts or ignoring specific numbers in a document).
- Query Interpretation (28.8%): Misunderstanding the specific intent (e.g., providing a qualitative trend when a quantitative value was requested).
- Effort Allocation (12.6%): Giving up on difficult data sources and "shortcutting" to unreliable sources (e.g., using a general search instead of a specific database).
- Repetitive Behavior: Getting stuck in loops where UI feedback indicates failure, but the agent repeats the same action.

5. Significance and Future Impact

New Frontier for AI: WebDS demonstrates that current web agents are not yet capable of reliable, autonomous end-to-end data science. The gap between 13% (best agent) and 90% (human) highlights a critical area for future research.
Beyond Surface-Level Navigation: The results indicate that the primary challenge is not just navigating the web, but grounding actions in retrieved data, managing long-horizon state, and correctly interpreting complex, multi-modal instructions.
Community Resource: By providing a reproducible (Dockerized) and realistic (Live) benchmark with fine-grained error analysis, WebDS sets a new standard for evaluating practical AI agents. It shifts the focus from "can the agent click a button?" to "can the agent derive a correct insight from the web?"

In conclusion, WebDS exposes the limitations of current LLM agents in complex, real-world data workflows and provides a robust framework to drive the development of truly autonomous, practically useful data science agents.