It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

Imagine you are trying to teach a robot how to predict the future. Specifically, you want it to predict things like tomorrow's weather, next month's stock prices, or how much electricity a city will need.

For a long time, scientists have been testing these robots using a set of "practice exams" called benchmarks. But according to this paper, those old exams are broken. They are like using a driving test from 1990 to judge a self-driving car in 2026. The roads are different, the cars are different, and the test doesn't actually tell you if the robot can handle real life.

The authors introduce a new, upgraded system called TIME (Towards the Next Generation of Time Series Forecasting Benchmarks). Here is how they fixed the problem, explained simply:

1. The Problem: The "Old Textbook" Trap

The old tests had four big problems:

Recycled Data: They kept using the same old datasets (like old weather logs from 10 years ago). It's like a student memorizing the answers to last year's exam. If the robot just memorized the data, it would get a perfect score but fail in the real world.
Dirty Data: Some of the old data was messy, full of errors or missing pieces, like a recipe with torn pages.
Fake Scenarios: The tests asked the robot to predict things in ways that don't make sense in real life. For example, asking a robot to predict the stock market for "next Tuesday" when the market is closed on weekends.
Blind Grading: The old tests just gave a single number (like "85% accuracy"). But why did it get an 85%? Did it guess the trend right but miss the spikes? The old tests didn't tell you.

2. The Solution: The "TIME" Benchmark

The authors built a brand-new testing ground called TIME. Think of it as a gym for time-traveling robots.

Fresh Ingredients: Instead of old data, they gathered 50 brand-new datasets from real-world sources (like new traffic sensors, fresh energy grids, and recent economic reports). This ensures the robot hasn't seen these specific numbers before, so it can't cheat by memorizing.
The Human-in-the-Loop: They didn't just dump the data in. They used a mix of AI and human experts to clean the data, like a chef tasting a soup and removing the bad spices. They made sure the data was clean and the questions asked were realistic.
Real-World Rules: They designed the tests to match how humans actually use predictions. If you are predicting electricity, you care about the next 24 hours, not 100 years. They set the rules based on real needs.

3. The Secret Sauce: The "Pattern" Lens

This is the most creative part.

In the old days, they grouped tests by Category (e.g., "All Weather Data" vs. "All Stock Data").
The authors say: "Wait a minute. A stock market crash and a sudden heatwave might look very different on a chart, but they share the same shape or pattern."

So, TIME doesn't just look at the category; it looks at the DNA of the data. They break every time series down into its structural parts:

The Trend: Is it going up, down, or flat?
The Rhythm: Is there a repeating beat (like seasons or daily cycles)?
The Noise: Is it chaotic or smooth?

They give every data stream a "Pattern ID card." Then, they test the robots on specific ID cards.

Example: "How good is Robot A at predicting things with Strong Trends but Weak Rhythms?"
Example: "How good is Robot B at predicting things that are Chaotic?"

This is like testing a chef not just on "Italian Food," but specifically on "Spicy Pasta with a Cream Sauce." It tells you exactly what the robot is good at and where it struggles.

4. The Results: Who Won?

They tested 12 of the smartest time-series robots (called Foundation Models) on this new gym.

The Winners: Newer models like Chronos-2 and TimesFM 2.5 generally did the best.
The Surprise: The paper found that a robot's ranking changes depending on the pattern. A robot might be the best at predicting smooth trends but terrible at predicting chaotic spikes. The old tests would have just said "Robot A is #1," hiding this weakness. TIME reveals the nuance.

The Big Takeaway

TIME is a better way to judge AI. It stops robots from cheating by memorizing old answers, ensures the tests are clean and realistic, and gives us a detailed report card that tells us exactly what kind of future the robot can predict.

Instead of just asking, "Is this robot smart?" TIME asks, "Is this robot smart at this specific type of problem?" This helps humans choose the right tool for the job, whether they are managing a power grid, a hospital, or a stock portfolio.

1. Problem Statement

The paper identifies critical limitations in existing Time Series Forecasting (TSF) benchmarks that hinder the evaluation of modern Time Series Foundation Models (TSFMs). As the field shifts from dataset-centric training to task-centric zero-shot generalization, current benchmarks suffer from four primary bottlenecks:

Legacy-Constrained Data Coverage: Most benchmarks rely on reused, public datasets (e.g., from M4, Monash) that have likely been ingested into large-scale pre-training corpora. This creates a high risk of data leakage, where models appear to perform well simply because they have memorized the test data, rather than demonstrating true generalization.
Compromised Data Integrity: Existing datasets often lack rigorous quality assurance, containing extreme outliers, excessive missing values, or non-forecastable series (e.g., white noise), which skews evaluation results.
Misaligned Task Formulations: Current benchmarks often use rigid, "one-size-fits-all" prediction horizons and frequencies that do not reflect real-world operational requirements or the specific predictability of variables.
Limited Analysis Perspective: Evaluations typically aggregate results at the dataset level using static meta-labels (e.g., domain or frequency). This obscures intrinsic temporal patterns (like trend strength or seasonality stability) that cut across domains, failing to provide diagnostic insights into why a model succeeds or fails.

2. Methodology

The authors propose TIME, a next-generation, task-centric benchmark designed to address these gaps through a rigorous construction pipeline and a novel evaluation perspective.

A. Benchmark Construction

Fresh Data Curation: TIME introduces 50 fresh datasets and 98 forecasting tasks sourced from government portals, industry partners, and open-access competitions. These datasets have never (or rarely) been used in prior TSF benchmarks to ensure strict zero-shot evaluation free from data leakage.
Human-in-the-Loop Quality Assurance: A multi-stage pipeline ensures data integrity:
1. Manual Curation: Eligibility checks for format, context, and semantic relevance.
2. Automatic Screening: Algorithms perform timestamp rectification, rule-based validation, statistical tests (Ljung-Box for white noise), outlier removal (IQR-based), and correlation checks.
3. Human Decision Making: Domain experts and LLMs review flagged variates to distinguish between data corruption and meaningful domain characteristics (e.g., preserving high correlations in macroeconomic indicators while removing them in redundant sensor data).
Context-Aligned Task Formulation: Instead of arbitrary defaults, prediction horizons ( $H$ ) and test lengths are tailored to the specific application context and data frequency. This ensures the evaluation mimics real-world operational cycles (e.g., covering full seasonal periods).

B. Pattern-Level Evaluation Strategy

To move beyond coarse dataset-level metrics, TIME introduces a pattern-driven analysis:

Structural Feature Extraction: Using STL (Seasonal and Trend decomposition using Loess), each variate is decomposed into Trend ( $T$ ), Seasonality ( $S$ ), and Remainder ( $R$ ).
Feature Vector: Seven interpretable features are extracted:
1. Trend Strength ( $F_1$ )
2. Trend Linearity ( $F_2$ )
3. Seasonality Strength ( $F_3$ )
4. Seasonality Correlation ( $F_4$ )
5. Residual ACF-1 ( $F_5$ )
6. Complexity ( $F_6$ , spectral entropy)
7. Stationarity ( $F_7$ , binary via ADF test)
Binary Encoding & Retrieval: Feature vectors are converted into binary codes based on median thresholds. This allows the system to retrieve all variates sharing a specific pattern (e.g., "High Trend Strength + Low Seasonality") across different datasets.
Aggregation: Performance is aggregated at the pattern level (rather than dataset level) using scale-invariant metrics (MASE, CRPS) normalized against a Seasonal Naive baseline.

3. Key Contributions

TIME Benchmark: A contamination-free repository of 50 fresh datasets and 98 tasks with context-aligned configurations, enabling fair zero-shot evaluation.
Pattern-Level Evaluation: A novel perspective that stratifies performance based on intrinsic temporal characteristics (patterns) rather than domain labels, offering generalizable insights into model capabilities.
Interactive Leaderboard: A multi-granular platform (available on Hugging Face) that combines quantitative rankings with qualitative visualization, allowing users to inspect model behavior across specific patterns and datasets.
Comprehensive Empirical Study: Evaluation of 12 representative TSFMs (including TimesFM 2.5, Chronos-2, Moirai 2.0, etc.) revealing that newer iterations consistently outperform predecessors, validating the benchmark's ability to capture genuine progress.

4. Key Results

Overall Performance: Chronos-2 and TimesFM 2.5 emerged as the top performers across both point (MASE) and probabilistic (CRPS) forecasting tasks. The results confirm that recent architectural advancements represent genuine capability improvements rather than overfitting to legacy data biases.
Pattern-Specific Insights:
- Trend: Models show significant gains on variates with strong, linear trends compared to weak ones.
- Seasonality: Superior models (e.g., Chronos-2) excel at capturing strong seasonal signals, whereas weaker models struggle. Interestingly, recent models have become more robust to unstable seasonal patterns.
- Stationarity: TSFMs demonstrate a distinct advantage over the Seasonal Naive baseline on non-stationary data, where traditional baselines fail to adapt to shifting statistical properties.
- Complexity: High spectral entropy (complexity) acts as a "performance equalizer," compressing the gap between models. Conversely, low-complexity data allows top models to clearly distinguish themselves.
Ranking Discrepancy: The paper highlights that rankings at the pattern level often differ from task-level aggregations, proving that a single global leaderboard is insufficient for understanding model suitability in specific application scenarios.

5. Significance

Restoring Trust in Benchmarks: By eliminating data leakage through fresh data and rigorous screening, TIME provides a reliable standard for evaluating the true generalization capabilities of foundation models.
Diagnostic Utility: The pattern-level approach shifts the focus from "which model is best overall" to "which model is best for this specific type of time series behavior." This is crucial for practitioners selecting models for specific operational contexts (e.g., financial forecasting vs. industrial sensor monitoring).
Bridging the Gap to Reality: By aligning task formulations with real-world constraints and providing visual verification tools, TIME bridges the gap between abstract error metrics and actionable decision-making, ensuring that benchmark improvements translate to real-world utility.

In conclusion, TIME represents a paradigm shift in time series evaluation, moving from static, dataset-centric benchmarks to dynamic, task-centric, and pattern-aware frameworks that better serve the needs of the foundation model era.

It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

1. The Problem: The "Old Textbook" Trap

2. The Solution: The "TIME" Benchmark

3. The Secret Sauce: The "Pattern" Lens

4. The Results: Who Won?

The Big Takeaway

1. Problem Statement

2. Methodology

A. Benchmark Construction

B. Pattern-Level Evaluation Strategy

3. Key Contributions

4. Key Results

5. Significance

More like this

Convolutional Surrogate for 3D Discrete Fracture-Matrix Tensor Upscaling

Generating Counterfactual Patient Timelines from Real-World Data

LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning

SIEVE: Sample-Efficient Parametric Learning from Natural Language

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models