Impermanent: A Live Benchmark for Temporal Generalization in Time Series Forecasting

The Big Problem: The "Frozen Test" Trap

Imagine you are training a student to predict the weather.

The Old Way (Static Benchmarks): You give the student a textbook full of weather data from 2010 to 2020. You then give them a "final exam" using data from 2021. They study the 2021 data, memorize the answers, and get a perfect score. You declare them a genius.
The Reality: In the real world, the weather changes every day. A hurricane might hit, or a new climate pattern might emerge. If you only tested them once on a "frozen" set of data, you wouldn't know if they could actually handle a new storm next week.

In the world of AI, researchers have been building "Foundation Models" (super-smart AI forecasters) that claim they can predict anything. But most of them are being tested on static benchmarks. These are like the "frozen exam." The AI might have accidentally "cheated" by seeing the test questions during its training, or the test might just be too easy because the world hasn't changed since the test was written.

The Solution: The "Live Stream" Exam

The authors of this paper introduce Impermanent. Think of Impermanent not as a final exam, but as a live, unscripted reality show.

Instead of giving the AI a frozen test, Impermanent puts the AI in a live studio where the data is a constantly flowing river.

The Setup: The AI has to make a prediction for tomorrow right now.
The Wait: The AI has to wait. It cannot see tomorrow's data yet.
The Score: Once tomorrow actually happens, the system checks if the AI was right.
The Repeat: This happens every single day, week, or month.

This is called a "Live Benchmark." It tests if the AI can keep performing well as the world changes, shifts, and surprises it. It prevents cheating because the AI can't memorize the answers before the test starts.

The Playground: GitHub Activity

To build this live stream, the authors used GitHub (the website where programmers share code). They watched 400 popular software projects and tracked four things:

New bugs reported (Issues).
New code suggestions (Pull Requests).
Code updates (Pushes).
New fans (Stargazers).

Why GitHub? Because software development is chaotic and unpredictable.

The Analogy: Imagine a busy coffee shop. Sometimes it's quiet (low activity). Then, a famous celebrity walks in, and suddenly everyone rushes to order (a "spike" or "burst"). Then, the espresso machine breaks (a "structural break").
GitHub data is full of these spikes, quiet periods, and sudden changes. It's the perfect "stress test" for a forecasting AI. If an AI can predict when a software project will get busy or quiet, it's a truly robust model.

The Results: Who Won the Live Show?

The paper ran a competition between different types of forecasters:

The "Naive" Guessers: Models that just guess "tomorrow will be like today" or "tomorrow will be zero."
The "Statistical" Veterans: Old-school math models (like AutoARIMA) that have been around for decades.
The "Foundation" Giants: The new, massive AI models trained on huge amounts of data.

The Surprise:
In this live, changing environment, the Foundation Models (the big AIs) generally won. They were better at adapting to the sudden spikes and changes in the GitHub data. However, the paper notes that even the winners aren't perfect. Their rankings shift over time. Sometimes a statistical model does better for a while, then an AI model takes over.

This proves that no single model is the "king" forever. In a changing world, you need to keep testing models constantly.

Why Does This Matter?

The authors are saying: "Stop trusting the static test scores."

Just because an AI says it's great at predicting time series based on a 2023 report doesn't mean it will work in 2026. The world is "impermanent" (it changes).

The Takeaway:

Impermanent is a new tool that acts like a continuous fitness tracker for AI forecasters.
It doesn't just check if they are strong once; it checks if they can run a marathon while the terrain keeps changing.
It helps us figure out which AI models are actually ready for the real world, where nothing stays the same.

In a Nutshell

If traditional benchmarks are like taking a driving test on an empty, empty parking lot, Impermanent is like dropping the driver into rush-hour traffic in a city where the roads change every day. It's the only way to know if the driver (the AI) is truly safe and skilled.

1. Problem Statement

The paper addresses a critical gap in the evaluation of Time-Series Foundation Models (TSFMs). While recent TSFMs claim broad generalization across domains, frequencies, and datasets, current evaluation protocols rely on static train-test splits. This approach suffers from several limitations:

Data Contamination: Foundation models may inadvertently train on test data or use test scores for model selection, inflating performance metrics.
Lack of Temporal Robustness: Static benchmarks fail to test performance in evolving, non-stationary environments where data distributions shift, structural breaks occur, and new series emerge.
One-Off Accuracy: Static evaluations provide a single snapshot of accuracy rather than measuring sustained performance over time, which is crucial for real-world deployment.

The authors argue that true temporal generalization requires evaluating models under open-world temporal change, where forecasts are generated sequentially before ground truth is available.

2. Methodology: The Impermanent Benchmark

The authors introduce Impermanent, the first live benchmark designed specifically to evaluate temporal generalization in time-series forecasting.

A. Data Source and Construction

Source: The benchmark is instantiated on GitHub software development activity using the GH Archive event streams.
Scope: It tracks the top 400 repositories by star count.
Event Types: Four distinct time series are constructed for each repository:
1. Issues opened
2. Pull requests (PRs) opened
3. Push events
4. New stargazers
Frequencies: Data is aggregated into four forecast horizons: Hourly ( $h=24$ ), Daily ( $h=7$ ), Weekly ( $h=4$ ), and Monthly ( $h=1$ ).
Characteristics: The dataset is highly non-stationary, exhibiting intermittency, burstiness, level shifts, and heterogeneous scales. It mixes smooth trends with sudden spikes, challenging models to adapt to both slow changes and rapid bursts.

B. Evaluation Protocol (Prequential Approach)

Impermanent employs a prequential, deployment-faithful evaluation loop:

Sequential Cutoffs: At each cutoff date, models receive a historical context window and must generate point and probabilistic forecasts for the next $h$ periods before the ground truth exists.
Scoring: Forecasts are stored and scored only once the corresponding ground truth observations arrive.
Rolling Window: The process repeats with cutoffs spaced exactly one horizon apart, creating a continuous stream of evaluation rather than a single split.
Metrics:
- MASE (Mean Absolute Scaled Error): For point accuracy.
- Scaled CRPS (Continuous Ranked Probability Score): For probabilistic calibration (estimated from 9 quantile levels).
- Normalization: Metrics are scaled by a "Zero Model" (which predicts zero) to ensure comparability across subdatasets and prevent instability when baseline errors are near zero.

C. Models Evaluated

The benchmark evaluates 12 models across three categories:

Baselines: ZeroModel, HistoricAverage, SeasonalNaive.
Statistical Models: AutoARIMA, AutoETS, AutoCES, Dynamic Optimized Theta, Prophet (running on CPU).
Foundation Models: Chronos-2, Moirai 2.0-R-Small, TimesFM 2.5, TiRex (running on A10G GPUs).

3. Key Contributions

First Live Benchmark: Impermanent is the first benchmark to evaluate TSFMs via a live, leak-proof, sequential protocol, shifting focus from static accuracy to sustained performance.
Temporal Generalization Measurement: It explicitly measures how well models handle distributional shifts, shocks, and non-stationarity over time, rather than just cross-sectional generalization.
Dynamic Leaderboards: The system provides a continuously updating leaderboard where model rankings can shift as new data arrives, allowing researchers to track the stability of model performance.
Open Infrastructure: The entire pipeline (data ingestion, forecasting, evaluation, and leaderboard generation) is open-source, automated, and runs on serverless infrastructure (Modal/Amazon S3), ensuring reproducibility without reprocessing historical data.

4. Results

Based on an early snapshot (up to February 12, 2026):

Foundation Model Dominance: Pre-trained foundation models occupy the top four positions in the overall leaderboard. TimesFM 2.5 leads across three of the four evaluation columns (MASE and CRPS).
Nuanced Performance:
- SeasonalNaive achieves a competitive MASE rank (5.39) but performs poorly in probabilistic calibration (CRPS rank 9.50), highlighting the trade-off between point accuracy and distributional uncertainty.
- Statistical Models: AutoETS and AutoARIMA show CRPS ranks comparable to Dynamic Optimized Theta despite weaker point accuracy, suggesting they capture uncertainty well even if point forecasts are less precise.
Stability Analysis: Because the benchmark is live, these rankings are not static. The system allows for tracking whether early advantages persist or degrade as the data distribution shifts, a capability impossible with static benchmarks.

5. Significance and Future Work

Paradigm Shift: Impermanent moves the field from "frozen test set" evaluation to "live deployment" simulation. It challenges the community to prove that foundation-level generalization holds up under real-world temporal dynamics.
Robustness Testing: It provides a rigorous testbed for studying model robustness against concept drift, structural breaks, and external events (e.g., platform changes, release cycles).
Future Directions: The authors plan to expand to additional live data streams, incorporate auxiliary contextual information, and extend evaluation horizons to better understand long-term ranking dynamics.

In conclusion, Impermanent serves as a critical infrastructure for validating whether time-series foundation models are truly generalists capable of operating in dynamic, non-stationary environments, rather than just overfitting to static historical patterns.