WebDS: An End-to-End Benchmark for Web-based Data Science

This paper introduces WebDS, the first end-to-end benchmark for web-based data science comprising 870 complex, multi-step tasks across diverse websites, which reveals significant performance gaps between current LLM agents and human capabilities in real-world data acquisition, analysis, and insight generation.

Ethan Hsu, Hong Meng Yam, Ines Bouissou, Aaron Murali John, Raj Thota, Josh Koe, Vivek Sarath Putta, G K Dharesan, Alexander Spangher, Shikhar Murty, Tenghao Huang, Christopher D. Manning

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you hire a brilliant, super-fast intern named "AI" to do a complex research project. You tell them: "Go find the latest unemployment numbers, compare them with health statistics from three different government websites, figure out if there's a connection, and write a report for our boss."

In the past, we tested these AI interns on very simple tasks, like "Find the price of a specific shoe on one website" or "Write a sentence about the weather." They were great at those. But in the real world, data science is messy. It's like trying to cook a gourmet meal while the kitchen is on fire, the ingredients are scattered across three different grocery stores, and the recipe keeps changing.

This paper introduces WebDS, a new, super-hard test designed to see if AI can actually handle that messy, real-world cooking.

Here is the breakdown of what they did and what they found, using some everyday analogies:

1. The Problem: The "Video Game" vs. The "Real World"

Think of previous AI tests like video games. In a video game, the rules are fixed, the map doesn't change, and the items are always in the same spot. If you practice enough, you can beat the level.

  • Old Tests: These were like video games. They asked AI to click a button or find a number on a static page.
  • The Real World: The internet is more like a live city. Buildings get renovated, signs change, traffic jams happen, and you have to walk from one neighborhood to another to get the full picture.

The authors realized that current AI agents are like video game champions who get completely lost when dropped into a real city. They can't handle the chaos of real data science.

2. The Solution: The "WebDS" Obstacle Course

The researchers built WebDS (Web Data Science). Instead of a video game, they built a massive, 29-stop obstacle course across the real internet.

  • The Course: It involves 870 different tasks. Some are easy (like finding a phone number), but most are hard.
  • The Challenge: The AI has to:
    • Navigate: Walk through different websites (like going from a library to a bank to a newsstand).
    • Mix Ingredients: Take a spreadsheet from one site and a news article from another and combine them.
    • Use Tools: Sometimes it has to download a file, run a math calculation, or write a code script.
    • Deliver: Finally, it has to write a summary or a report.

It's like asking the intern to go to the grocery store, the hardware store, and the library, buy specific items, mix them in a blender, and then bake a cake, all without getting distracted by the shiny toys in the aisles.

3. The Results: The "AI vs. Human" Gap

The researchers put the smartest AI models available (like GPT-4o and others) on this test. The results were shocking:

  • The AI Performance: The best AI only got about 13% to 22% of the tasks right.
    • Analogy: Imagine a student taking a final exam and getting a D-. They knew how to read the questions, but they couldn't connect the dots to solve the problem.
  • The Human Performance: Real humans (who were given the same tools and time) got about 90% right.
    • Analogy: The humans got an A+. They knew how to adapt when a website was confusing, how to double-check their work, and how to realize when they were going down the wrong path.

The Gap: There is a massive 75+ point gap between the best AI and a human. This tells us that while AI is getting smarter at talking, it is still terrible at doing complex, multi-step work in the real world.

4. Why Did the AI Fail? (The "Failure Modes")

The paper analyzed why the AI failed, and it found some funny and frustrating reasons:

  • The "Hallucinating Librarian": The AI would find the right document but then read the wrong number from it. It's like finding the correct page in a book but reading the paragraph from the page before it.
  • The "Stuck Record": If a button didn't work, the AI would just keep clicking it 50 times, hoping it would work the next time. It didn't know how to say, "Okay, this isn't working, let's try a different door."
  • The "Shortcut Taker": Instead of doing the hard work of downloading a file and analyzing it, the AI would just guess the answer based on a quick Google search, often getting it wrong.
  • The "Lost Tourist": The AI would get confused by similar names (e.g., going to the "Public Transportation" site instead of the "Physical Therapy" site) and wander off to the wrong neighborhood.

5. Why This Matters

This paper is a wake-up call. For a long time, we thought AI was almost ready to take over our jobs. This test shows that while AI is great at answering simple questions, it is not yet ready to be a reliable data scientist or office worker.

The Takeaway:
We need to stop testing AI on easy "video game" tasks and start testing them on these messy, real-world "obstacle courses." If we want AI to be truly useful in the future, we need to teach it how to navigate the chaos of the real internet, not just the clean, predictable world of video games.

In short: The AI is a brilliant scholar who can recite the dictionary but can't navigate a busy city to buy groceries. WebDS is the map we need to help it learn how to get there.