TimeWarp: Evaluating Web Agents by Revisiting the Past

The paper introduces TimeWarp, a benchmark that evaluates web agents across evolving UI versions to expose their vulnerability to design changes, and proposes TimeTraj, a plan distillation algorithm that significantly improves agent robustness by training on trajectories collected from multiple web versions.

Md Farhan Ishmam, Kenneth Marino

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you are teaching a robot butler how to order groceries online. You spend weeks showing it exactly how to click buttons, fill out forms, and find the cheapest cookies on a specific website. The robot becomes a master at that one version of the site.

But then, the website owner decides to redesign the store. They move the search bar from the top to the bottom, change the color of the "Buy" button, and add a pop-up ad that blocks the screen. Suddenly, your robot butler is completely lost. It doesn't know where to click, and it can't find the cookies anymore.

This is the problem the paper TIMEWARP is trying to solve.

Here is a simple breakdown of what the researchers did, using some everyday analogies:

1. The Problem: The "Time Travel" Gap

The internet is like a living, breathing city that never stops changing. Websites get new layouts, new buttons, and new features every day.

  • The Old Way: Most researchers train their AI agents (robots) on a "frozen" version of a website. It's like training a driver only on a map of a city from 1990. When they try to drive in the city today, they crash because the roads have changed.
  • The New Reality: The paper asks: If we train a robot today, will it still work tomorrow when the website changes?

2. The Solution: The "Time Machine" Benchmark (TIMEWARP)

To test this, the researchers built a special testing ground called TIMEWARP.

  • The Analogy: Imagine a video game level that has six different "skins" or eras.
    • Era 1 (1999): The website looks like a plain text document from the early internet.
    • Era 2-4 (2000s-2010s): The site gets colorful, adds menus, and changes layouts.
    • Era 5-6 (Today): The site is modern, full of icons, pop-ups, and complex designs.
  • The Mission: They created 1,386 different tasks (like "Find a recipe" or "Buy a $5 cookie") and asked the AI to solve them on all six versions of the website. This simulates the internet evolving over 25 years in a controlled lab.

3. The Discovery: Robots are Fragile

When they tested their AI agents, they found something surprising:

  • The "One-Size-Fits-All" Failure: If they trained a robot only on the 2025 version of the site, it failed miserably on the 1999 version. It was like teaching someone to drive a Ferrari and then asking them to drive a Model T Ford; the controls are too different.
  • Visual Confusion: Robots that rely on "seeing" the screen (like looking at a screenshot) were the most confused when the design changed. Robots that "read" the code (text) were a bit more robust, but still struggled.

4. The Fix: The "Master Planner" Strategy (TIMETRAJ & TIMEWARP-BC)

The researchers realized that just showing the robot the answer isn't enough. The robot needs to learn how to think, not just what to click.

They introduced a two-step training method:

  1. The Master Planner (Human + AI): Instead of recording every single mouse click a human makes (which is tedious and hard to copy), a human writes a high-level plan.
    • Analogy: Instead of teaching a student every single step of a math problem, you give them the outline of the solution: "First, find the ingredients. Second, check the price. Third, buy the cheapest one."
  2. The Teacher Robot: An advanced AI (the "Teacher") takes that high-level plan and executes it on all six versions of the website automatically.
    • It figures out how to click the button in 1999, how to scroll in 2010, and how to ignore pop-ups in 2025.
  3. The Student Robot: The student robot watches the Teacher. Crucially, it doesn't just watch the clicks; it watches the thinking, the planning, and the memory the Teacher used to solve the problem.

5. The Result: Super-Resilient Robots

By training the student robot on these "Teacher" plans across all versions of the website, the results were amazing:

  • Before: Some robots had a 0% success rate on new website versions.
  • After: The same robots jumped to a 27% to 37% success rate.
  • The Magic: The robots learned to be flexible. They learned that "finding a search bar" is the goal, whether that bar is at the top, the bottom, or hidden behind an icon.

Why This Matters

This paper is a wake-up call for AI developers. It shows that if we want AI to be useful in the real world, we can't just train it on a static snapshot. We have to teach it to adapt to change.

The Big Takeaway:
Instead of spending years manually recording how a human clicks through a website every time the design changes, we can now write one smart plan and let an AI figure out how to execute that plan on any version of the site. It's the difference between teaching a student to memorize a single map versus teaching them how to navigate using a compass.