xaitimesynth: A Python Package for Evaluating Attribution Methods for Time Series with Synthetic Ground Truth

Imagine you are a detective trying to figure out why a machine learning model made a specific prediction about a stock market trend or a patient's heart rate. The model says, "This patient is at risk," but it doesn't tell you which part of the heart rate data caused that decision.

To solve this, we use AI "explanation" tools (called attribution methods) that highlight the important parts of the data, like using a highlighter pen on a textbook.

The Problem:
The trouble is, how do you know if your highlighter is working correctly? In the real world, we rarely have an answer key that says, "Yes, the doctor was worried about exactly this 5-second spike." Without an answer key, we can't tell if the AI is highlighting the right thing or just guessing.

Researchers usually try to fix this by creating fake (synthetic) data where they do know the answer. They build a fake heart rate signal, plant a "clue" in a specific spot, and see if the AI finds it. But until now, every researcher had to build their own fake data factory from scratch, like every chef in a city inventing their own recipe for flour. It was messy, inconsistent, and hard to compare results.

The Solution: xaitimesynth
This paper introduces xaitimesynth, a new Python tool that acts like a universal "fake data factory" specifically designed for time-based data (like stock prices, weather, or heartbeats).

Here is how it works, using some simple analogies:

1. The "Signal + Noise" Recipe

Think of a time series (like a heartbeat) as a cup of coffee.

The Background Signal (Noise): This is the regular coffee flavor. It's the normal, boring stuff happening all the time (like a steady heartbeat or random market noise).
The Feature (The Clue): This is a specific, strong flavor added to the cup, like a shot of espresso, but only for a few seconds. This is the "clue" that tells the AI which class the data belongs to (e.g., "Sick" vs. "Healthy").

xaitimesynth automatically mixes these two together. It creates thousands of cups of coffee where it knows exactly when and where it added the espresso shot. It keeps a secret "answer key" (a ground truth mask) that says, "The clue was between second 40 and second 50."

2. The "Fluent" Builder

Instead of writing complex code to build these fake datasets, the tool uses a Lego-like builder.

You can say: "For Class A, give me a random walk background and a 'peak' feature."
You can say: "For Class B, give me a seasonal wave and a 'pulse' feature."
You can even save these recipes in a simple text file (YAML) so you can share them with a colleague, ensuring you are both testing on the exact same "fake world."

3. The "Scorecard"

Once you run your AI explanation tool on this fake data, xaitimesynth acts as a strict teacher grading the test. It compares the AI's "highlighter marks" against the secret answer key.

Did the AI highlight the right seconds?
Did it highlight too much?
Did it miss the clue entirely?

It uses standard math scores (like AUC-PR or Relevance Mass Accuracy) to give the AI a grade, telling you exactly how good the explanation method is at finding the truth.

Why This Matters

Before this tool, if you wanted to test a new AI explanation method, you had to build your own fake data factory, which took weeks and might have been different from your neighbor's factory.

xaitimesynth is like opening a public park where everyone uses the same playground equipment. Now, researchers can stop reinventing the wheel and start comparing their AI tools fairly. It ensures that when we say an AI is "explainable," we actually have proof that it's looking at the right things, not just making lucky guesses.

In short: It's a standardized toolkit that lets us build fake time-travel scenarios with known answers, so we can finally test if our AI detectives are actually solving the case.

1. Problem Statement

Evaluating Explainable AI (XAI) attribution methods for time series classification is a significant challenge due to the lack of ground truth in real-world datasets. In practical scenarios, it is rarely known which specific time points or features drove a model's prediction.

Limitations of Current Methods:
- Perturbation-based metrics: These alter input features to measure output changes but are sensitive to implementation choices and can yield inconsistent results across different classes.
- Localization-based metrics: These measure the spatial agreement between attributions and known feature locations. However, since real data lacks these locations, researchers must generate synthetic data.
The Gap: Currently, every research study re-implements synthetic data generation from scratch. These implementations vary in complexity, are not reusable, and lack standardization, making it difficult to compare attribution methods across different studies.

2. Methodology

The paper introduces xaitimesynth, a Python package designed to standardize the evaluation workflow for time series attribution methods. The methodology is built on an additive generation model:

Data Generation Process:
- Each synthetic time series sample ( $x$ ) is constructed as the sum of a background signal ( $n$ ) and a class-specific feature ( $f$ ): $x = n + f$ .
- Background Signal ( $n$ ): Represents noise or patterns (e.g., Gaussian noise, random walk, seasonal signals, trends).
- Feature ( $f$ ): A localized, class-discriminating pattern (e.g., peaks, troughs, Gaussian pulses) placed within a specific time window.
- Ground Truth Tracking: The package automatically records the time window where the feature $f$ is located as a binary ground truth mask for every sample.
Architecture & Flexibility:
- Builder API: A fluent, declarative API allows users to define datasets for both univariate and multivariate time series. Users can specify signals and features per class.
- Configuration: Dataset definitions can be created programmatically or via YAML configuration files, supporting reproducibility and sharing of complex setups (including YAML anchors for shared configurations).
- Extensibility: Custom generators can be registered via a decorator API, allowing researchers to integrate arbitrary signal generation functions without modifying the source code.
Evaluation Metrics:
The package implements standard localization metrics to score attributions against the ground truth masks (Table 2 in the paper):
- Ranking Metrics: AUC-ROC and AUC-PR (treating attributions as scores and masks as labels).
- Overlap Metrics: Relevance Mass Accuracy (fraction of total attribution mass inside the ground truth) and Relevance Rank Accuracy (checking if top-K ranked timesteps fall within the ground truth).
- Binary/Regression Metrics: Pointing Game, Normalized Attribution Correspondence (NAC), Mean Absolute Error (MAE), and Mean Squared Error (MSE).

3. Key Contributions

Standardized Package (xaitimesynth): The first open-source Python package that combines synthetic time series generation with automatic ground truth tracking and standard localization metrics in a single, reusable toolkit.
Reproducible Infrastructure: By providing a fluent API and YAML configuration, it eliminates the need for researchers to re-implement data generation pipelines, ensuring that evaluation setups are concise and shareable.
Comprehensive Metric Suite: It integrates a wide range of localization metrics (AUC-PR, AUC-ROC, RMA, RRA, etc.) specifically tailored for time series, filling a gap left by existing libraries like Quantus (which focuses on images) or Time Interpret (which lacks synthetic generation).
Extensibility: The design allows for the easy addition of new signal types, feature types, and custom generators, making it adaptable to various experimental needs.

4. Results

Note: As this is a software introduction paper, it does not present new empirical results comparing specific XAI algorithms. Instead, the "results" are the capabilities demonstrated by the package.

Demonstration: The paper provides a working example (Listing 1) showing how to define a two-class dataset, generate train/test splits, and evaluate attributions.
Visualization: Figure 1 demonstrates the package's ability to visualize the background signal, the localized feature, and the aggregated sum, clearly marking the ground truth window.
Comparison: Table 1 highlights that xaitimesynth is unique in combining Synthetic Time Series Generation and Localization Metrics, whereas other packages (Captum, TSInterpret, Quantus, TimeSynth) typically offer only one of these capabilities or focus on different data modalities (e.g., images).

5. Significance

Reliability in XAI Evaluation: By providing a controlled environment with known ground truth, xaitimesynth allows researchers to perform "sanity checks" on attribution methods. It helps verify if a method correctly identifies the features a model relies on, provided the model has not learned "shortcut" artifacts from the generation process.
Community Standardization: It addresses the fragmentation in the field where every paper uses a different synthetic dataset. By offering a shared toolkit, it enables fair and direct comparison of attribution methods across different studies.
Accessibility: The package is open-source (MIT license), requires no deep learning framework for generation/metrics (only NumPy, pandas, PyYAML), and supports Python 3.10+, lowering the barrier to entry for rigorous XAI evaluation.
Future Research: It facilitates the investigation of attribution behavior in settings where ground truth is known by construction, helping to identify limitations in current XAI methods before applying them to complex real-world problems.

Availability: The package is available at https://github.com/gregorbaer/xaitimesynth and archived on Zenodo.

xaitimesynth: A Python Package for Evaluating Attribution Methods for Time Series with Synthetic Ground Truth

1. The "Signal + Noise" Recipe

2. The "Fluent" Builder

3. The "Scorecard"

Why This Matters

1. Problem Statement

2. Methodology

3. Key Contributions

4. Results

5. Significance

More like this

Faster Stochastic Algorithms for Minimax Optimization under Polyak--Łojasiewicz Conditions

Tensor Completion Leveraging Graph Information: A Dynamic Regularization Approach with Statistical Guarantees

Federated Multi-Agent Mapping for Planetary Exploration

Random Scaling and Momentum for Non-smooth Non-convex Optimization

Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing