xaitimesynth: A Python Package for Evaluating Attribution Methods for Time Series with Synthetic Ground Truth

The paper introduces **xaitimesynth**, an open-source Python package that streamlines the evaluation of time series attribution methods by providing a reusable infrastructure for generating synthetic datasets with known ground truth masks and calculating standard localization metrics.

Gregor Baer

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you are a detective trying to figure out why a machine learning model made a specific prediction about a stock market trend or a patient's heart rate. The model says, "This patient is at risk," but it doesn't tell you which part of the heart rate data caused that decision.

To solve this, we use AI "explanation" tools (called attribution methods) that highlight the important parts of the data, like using a highlighter pen on a textbook.

The Problem:
The trouble is, how do you know if your highlighter is working correctly? In the real world, we rarely have an answer key that says, "Yes, the doctor was worried about exactly this 5-second spike." Without an answer key, we can't tell if the AI is highlighting the right thing or just guessing.

Researchers usually try to fix this by creating fake (synthetic) data where they do know the answer. They build a fake heart rate signal, plant a "clue" in a specific spot, and see if the AI finds it. But until now, every researcher had to build their own fake data factory from scratch, like every chef in a city inventing their own recipe for flour. It was messy, inconsistent, and hard to compare results.

The Solution: xaitimesynth
This paper introduces xaitimesynth, a new Python tool that acts like a universal "fake data factory" specifically designed for time-based data (like stock prices, weather, or heartbeats).

Here is how it works, using some simple analogies:

1. The "Signal + Noise" Recipe

Think of a time series (like a heartbeat) as a cup of coffee.

  • The Background Signal (Noise): This is the regular coffee flavor. It's the normal, boring stuff happening all the time (like a steady heartbeat or random market noise).
  • The Feature (The Clue): This is a specific, strong flavor added to the cup, like a shot of espresso, but only for a few seconds. This is the "clue" that tells the AI which class the data belongs to (e.g., "Sick" vs. "Healthy").

xaitimesynth automatically mixes these two together. It creates thousands of cups of coffee where it knows exactly when and where it added the espresso shot. It keeps a secret "answer key" (a ground truth mask) that says, "The clue was between second 40 and second 50."

2. The "Fluent" Builder

Instead of writing complex code to build these fake datasets, the tool uses a Lego-like builder.

  • You can say: "For Class A, give me a random walk background and a 'peak' feature."
  • You can say: "For Class B, give me a seasonal wave and a 'pulse' feature."
  • You can even save these recipes in a simple text file (YAML) so you can share them with a colleague, ensuring you are both testing on the exact same "fake world."

3. The "Scorecard"

Once you run your AI explanation tool on this fake data, xaitimesynth acts as a strict teacher grading the test. It compares the AI's "highlighter marks" against the secret answer key.

  • Did the AI highlight the right seconds?
  • Did it highlight too much?
  • Did it miss the clue entirely?

It uses standard math scores (like AUC-PR or Relevance Mass Accuracy) to give the AI a grade, telling you exactly how good the explanation method is at finding the truth.

Why This Matters

Before this tool, if you wanted to test a new AI explanation method, you had to build your own fake data factory, which took weeks and might have been different from your neighbor's factory.

xaitimesynth is like opening a public park where everyone uses the same playground equipment. Now, researchers can stop reinventing the wheel and start comparing their AI tools fairly. It ensures that when we say an AI is "explainable," we actually have proof that it's looking at the right things, not just making lucky guesses.

In short: It's a standardized toolkit that lets us build fake time-travel scenarios with known answers, so we can finally test if our AI detectives are actually solving the case.