Generating Expressive and Customizable Evals for Timeseries Data Analysis Agents with AgentFuel

Imagine you have built a super-smart, conversational robot assistant designed to talk to your company's data. You want to ask it things like, "Why did our server slow down yesterday?" or "How many customers added items to their cart but left without buying?"

You expect the robot to be a genius. But in reality, when you ask it these specific, time-sensitive questions, it often gives you nonsense, makes up facts, or just crashes.

This paper introduces AgentFuel, a tool designed to fix this problem. Think of AgentFuel as a "Flight Simulator for Data Robots."

Here is the breakdown of the problem and the solution, using simple analogies:

The Problem: The "Textbook" vs. The "Real World"

Currently, companies test their data robots using standard, generic questions. It's like testing a pilot by having them fly in a calm, empty sky with no wind, no storms, and no other planes.

The Result: The pilot (the robot) looks perfect. They get 90%+ on the test.
The Reality: When you put that same pilot in a real storm (a real business crisis, like a sudden server crash or a weird user behavior pattern), they freeze up. They don't know how to handle the chaos because they were never trained on it.

The authors found that existing data robots are great at simple math (like "What was the average sales last month?") but terrible at complex stories (like "Did the sales drop after the website crashed, and how long did it take to recover?").

The Solution: AgentFuel (The Flight Simulator)

AgentFuel is a system that helps companies build their own custom "flight simulators" to test their robots before they let them talk to real customers.

Here is how AgentFuel works, step-by-step:

1. Building the "Fake" World (Data Generation)

Instead of using boring, generic data, AgentFuel lets experts create a synthetic world that looks exactly like their real business.

The Analogy: Imagine a video game designer creating a level. They don't just make a flat road; they add potholes, sudden rainstorms, and traffic jams.
In AgentFuel: You tell the system, "I want a dataset where 500 sensors are working fine, but then suddenly 10 of them start overheating, and the network slows down." AgentFuel generates this fake data automatically, complete with the "disasters" you want to test.

2. Asking the "Tricky" Questions (Query Generation)

Once the fake world is built, AgentFuel creates questions that are specifically designed to trip up a dumb robot but are easy for a smart one.

The Analogy: Instead of asking, "What color is the car?", the simulator asks, "The car turned left, then stopped for 5 minutes, then turned right. How long was it stopped after the left turn?"
In AgentFuel: It generates questions about sequences and incidents. "Did the user abandon their cart within 10 minutes of adding three items?" or "How many cells lost connection while the core server was overloaded?"

3. The Stress Test (Evaluation)

Now, you run your data robot through this custom simulator.

The Result: The robot fails. It might say, "I don't know," or give the wrong number.
The Value: This is good! You found the bug before the robot went live. You now know exactly where the robot is weak.

Why This Matters: The "Stateful" Gap

The paper highlights a specific type of failure called the "Stateful Gap."

Stateless (Easy): "What is the average temperature?" (Just look at the numbers and average them).
Stateful (Hard): "What was the temperature after the alarm went off?" (The robot has to remember the alarm happened, find that moment in time, and then look at the temperature).

Existing robots are like people with very short memories. They can do math, but they can't remember the "story" of what happened before. AgentFuel forces them to practice remembering the story.

The Payoff: Training the Robot

The paper also shows that if you use AgentFuel to test the robot, you can actually teach it to get better.

The Analogy: If you keep failing a driving test in the simulator, you can adjust your steering or braking.
The Result: The authors used AgentFuel to tweak the robot's instructions (prompts). After just a little bit of "training" using these custom tests, the robot's accuracy jumped by 17%.

Summary

AgentFuel is a toolkit that says: "Don't just test your data robot with easy questions. Build a realistic, messy, disaster-filled simulation of your own business, throw your robot in there, and see if it survives. If it fails, fix it before it talks to your customers."

It turns the evaluation of data agents from a simple pop quiz into a rigorous, real-world stress test.

1. Problem Statement

The paper addresses the critical reliability gap in conversational data analysis agents (AI agents that allow users to "talk to data") when applied to enterprise timeseries data. While current agents perform well on generic, stateless Text-to-SQL benchmarks (e.g., Spider, BIRD), they fail significantly in domain-specific scenarios such as IoT, observability, telecommunications, and cybersecurity.

The authors identify two primary "expressivity gaps" in existing evaluation frameworks:

Dataset Gap: Existing benchmarks lack domain-specific semantics. They do not capture the complex patterns of "normal" behavior versus "unexpected" incidents (e.g., sensor degradation, flash crowds, security breaches) required for real-world analysis.
Query Gap: Current benchmarks focus on stateless queries (single lookups or aggregations). They fail to test stateful queries (requiring sequence and timing logic, e.g., "time between event A and B") and incident-specific queries (requiring anomaly detection and causal reasoning).

The paper argues that without specialized evaluation, practitioners cannot trust agents to handle critical tasks like incident root-cause analysis or user journey tracking.

2. Methodology: The AgentFuel Framework

The authors propose AgentFuel, a modular, extensible pipeline designed to generate customized, expressive, and deterministic evaluations for timeseries agents. The system operates in three logical phases:

Phase 1: Customized Dataset Generation

AgentFuel generates synthetic timeseries data that mimics specific domain behaviors.

Entity & State Modeling: It defines entities (e.g., sensors, users) with static attributes and dynamic state transitions (e.g., Normal $\to$ Degraded $\to$ Failure).
Exemplar Injection: It creates "exemplars"—self-contained synthetic datasets encoding specific behaviors (e.g., a specific type of sensor failure or a user cart abandonment flow).
Global Blending: These exemplars are blended over a global time horizon to create a realistic dataset containing both normal operations and injected incidents (e.g., cascading network failures).
Pattern Injection: A library of controllable patterns (KPI degradation, data outages, sudden spikes) ensures the data contains the specific anomalies practitioners care about.

Phase 2: Data-Aligned Query-Answer Generation

Once the dataset is generated, AgentFuel creates a corresponding set of ground-truth question-answer (Q-A) pairs.

Query Taxonomy: It distinguishes between:
- Stateless Queries: Aggregations over fixed windows (e.g., "Average CPU load").
- Stateful Queries: Logic dependent on event sequences and state machines (e.g., "Time between adding to cart and purchase").
- Incident-Specific Queries: Queries requiring the agent to detect anomalies and analyze affected entities (e.g., "Which cells lost availability during the outage?").
Alignment: The system ensures the "Gold Standard" answers are computed deterministically against the generated data, preventing hallucination in the ground truth.
Persona Variation: Using LLMs, it generates natural language variations of queries to simulate different user personas (e.g., SRE vs. Executive) and dialects.

Phase 3: Test Integration

AgentFuel exports the datasets and Q-A pairs into standard formats (SQL tables, flat files) compatible with existing CI/CD pipelines and agent evaluation harnesses. It supports end-to-end functional testing (checking the final answer) rather than just intermediate steps like tool calls.

3. Key Contributions

AgentFuel System: A novel framework for generating domain-customized timeseries benchmarks that bridge the gap between generic SQL benchmarks and complex enterprise needs.
Formal Data & Query Model: A rigorous definition of entities, state transitions, and a taxonomy of queries (stateless, stateful, incident-specific) specifically for timeseries agents.
Pattern Injection Library: A controllable mechanism to inject realistic enterprise incidents (e.g., cascading failures) into synthetic data to test agent robustness.
Open Benchmark Release: The authors released a suite of benchmarks across three domains (E-commerce, IoT, Telecom) on Hugging Face, providing the first standardized evaluation suite for timeseries agents.

4. Experimental Results

The authors evaluated six popular data agents (Databricks Genie, Snowflake Cortex, Nao, PandasAI with various LLMs) on the AgentFuel benchmarks.

Performance Drop on Complex Queries:
- Agents achieved ~73% accuracy on simple, stateless queries.
- Accuracy dropped to ~34% on stateful queries.
- Accuracy plummeted to ~10% on incident-specific queries.
- Example: In the Telecom domain, most agents failed to detect incidents or identify affected entities, often returning runtime errors or incorrect aggregations over the entire dataset.
Failure Mode Analysis:
- Schema Confusion: Agents often selected the wrong tables or columns.
- State Tracking Failure: Agents failed to maintain context across event sequences (e.g., counting views while an item was in the cart).
- Lack of Exploration: For incident queries, agents assumed fixed time windows or global baselines rather than detecting the anomaly from the data itself.
Improvement via Optimization:
- The authors used AgentFuel benchmarks in a GEPA (Generative Prompt Evolution) loop to optimize prompts.
- This resulted in a 17% overall accuracy improvement on the telecom incident-specific queries, demonstrating that AgentFuel can be used not just for testing, but for actively improving agent performance.

5. Significance and Future Work

Reliability for Enterprise: The paper highlights that current "state-of-the-art" agents are unreliable for critical operational tasks involving timeseries data. AgentFuel provides the necessary tooling to validate agents before production deployment.
Beyond Text-to-SQL: It shifts the focus from simple SQL generation to reasoning over temporal data, emphasizing the need for agents to understand state machines and causal relationships.
Future Directions: The authors plan to extend AgentFuel to support multi-turn conversations, integrate with backend tracing for granular debugging, and use the generated data for continuous agent training.

In conclusion, AgentFuel represents a significant step toward making conversational data agents viable for high-stakes, domain-specific timeseries analysis by providing the first systematic method to generate expressive, customizable, and realistic evaluation benchmarks.