Generating Expressive and Customizable Evals for Timeseries Data Analysis Agents with AgentFuel

This paper introduces AgentFuel, a framework that enables domain experts to generate customizable and expressive evaluation benchmarks for timeseries data analysis agents, addressing critical gaps in existing tests and revealing performance limitations in current state-of-the-art models.

Aadyaa Maddi, Prakhar Naval, Deepti Mande, Shane Duan, Muckai Girish, Vyas Sekar

Published 2026-03-16
📖 4 min read☕ Coffee break read

Imagine you have built a super-smart, conversational robot assistant designed to talk to your company's data. You want to ask it things like, "Why did our server slow down yesterday?" or "How many customers added items to their cart but left without buying?"

You expect the robot to be a genius. But in reality, when you ask it these specific, time-sensitive questions, it often gives you nonsense, makes up facts, or just crashes.

This paper introduces AgentFuel, a tool designed to fix this problem. Think of AgentFuel as a "Flight Simulator for Data Robots."

Here is the breakdown of the problem and the solution, using simple analogies:

The Problem: The "Textbook" vs. The "Real World"

Currently, companies test their data robots using standard, generic questions. It's like testing a pilot by having them fly in a calm, empty sky with no wind, no storms, and no other planes.

  • The Result: The pilot (the robot) looks perfect. They get 90%+ on the test.
  • The Reality: When you put that same pilot in a real storm (a real business crisis, like a sudden server crash or a weird user behavior pattern), they freeze up. They don't know how to handle the chaos because they were never trained on it.

The authors found that existing data robots are great at simple math (like "What was the average sales last month?") but terrible at complex stories (like "Did the sales drop after the website crashed, and how long did it take to recover?").

The Solution: AgentFuel (The Flight Simulator)

AgentFuel is a system that helps companies build their own custom "flight simulators" to test their robots before they let them talk to real customers.

Here is how AgentFuel works, step-by-step:

1. Building the "Fake" World (Data Generation)

Instead of using boring, generic data, AgentFuel lets experts create a synthetic world that looks exactly like their real business.

  • The Analogy: Imagine a video game designer creating a level. They don't just make a flat road; they add potholes, sudden rainstorms, and traffic jams.
  • In AgentFuel: You tell the system, "I want a dataset where 500 sensors are working fine, but then suddenly 10 of them start overheating, and the network slows down." AgentFuel generates this fake data automatically, complete with the "disasters" you want to test.

2. Asking the "Tricky" Questions (Query Generation)

Once the fake world is built, AgentFuel creates questions that are specifically designed to trip up a dumb robot but are easy for a smart one.

  • The Analogy: Instead of asking, "What color is the car?", the simulator asks, "The car turned left, then stopped for 5 minutes, then turned right. How long was it stopped after the left turn?"
  • In AgentFuel: It generates questions about sequences and incidents. "Did the user abandon their cart within 10 minutes of adding three items?" or "How many cells lost connection while the core server was overloaded?"

3. The Stress Test (Evaluation)

Now, you run your data robot through this custom simulator.

  • The Result: The robot fails. It might say, "I don't know," or give the wrong number.
  • The Value: This is good! You found the bug before the robot went live. You now know exactly where the robot is weak.

Why This Matters: The "Stateful" Gap

The paper highlights a specific type of failure called the "Stateful Gap."

  • Stateless (Easy): "What is the average temperature?" (Just look at the numbers and average them).
  • Stateful (Hard): "What was the temperature after the alarm went off?" (The robot has to remember the alarm happened, find that moment in time, and then look at the temperature).

Existing robots are like people with very short memories. They can do math, but they can't remember the "story" of what happened before. AgentFuel forces them to practice remembering the story.

The Payoff: Training the Robot

The paper also shows that if you use AgentFuel to test the robot, you can actually teach it to get better.

  • The Analogy: If you keep failing a driving test in the simulator, you can adjust your steering or braking.
  • The Result: The authors used AgentFuel to tweak the robot's instructions (prompts). After just a little bit of "training" using these custom tests, the robot's accuracy jumped by 17%.

Summary

AgentFuel is a toolkit that says: "Don't just test your data robot with easy questions. Build a realistic, messy, disaster-filled simulation of your own business, throw your robot in there, and see if it survives. If it fails, fix it before it talks to your customers."

It turns the evaluation of data agents from a simple pop quiz into a rigorous, real-world stress test.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →