It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

This paper introduces TIME, a next-generation, task-centric benchmark comprising 50 fresh datasets and 98 forecasting tasks that addresses critical limitations in existing evaluations by ensuring data integrity, aligning with real-world requirements, and proposing a novel pattern-level perspective to rigorously assess the zero-shot generalization capabilities of time series foundation models.

Zhongzheng Qiao, Sheng Pan, Anni Wang, Viktoriya Zhukova, Yong Liu, Xudong Jiang, Qingsong Wen, Mingsheng Long, Ming Jin, Chenghao Liu

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot how to predict the future. Specifically, you want it to predict things like tomorrow's weather, next month's stock prices, or how much electricity a city will need.

For a long time, scientists have been testing these robots using a set of "practice exams" called benchmarks. But according to this paper, those old exams are broken. They are like using a driving test from 1990 to judge a self-driving car in 2026. The roads are different, the cars are different, and the test doesn't actually tell you if the robot can handle real life.

The authors introduce a new, upgraded system called TIME (Towards the Next Generation of Time Series Forecasting Benchmarks). Here is how they fixed the problem, explained simply:

1. The Problem: The "Old Textbook" Trap

The old tests had four big problems:

  • Recycled Data: They kept using the same old datasets (like old weather logs from 10 years ago). It's like a student memorizing the answers to last year's exam. If the robot just memorized the data, it would get a perfect score but fail in the real world.
  • Dirty Data: Some of the old data was messy, full of errors or missing pieces, like a recipe with torn pages.
  • Fake Scenarios: The tests asked the robot to predict things in ways that don't make sense in real life. For example, asking a robot to predict the stock market for "next Tuesday" when the market is closed on weekends.
  • Blind Grading: The old tests just gave a single number (like "85% accuracy"). But why did it get an 85%? Did it guess the trend right but miss the spikes? The old tests didn't tell you.

2. The Solution: The "TIME" Benchmark

The authors built a brand-new testing ground called TIME. Think of it as a gym for time-traveling robots.

  • Fresh Ingredients: Instead of old data, they gathered 50 brand-new datasets from real-world sources (like new traffic sensors, fresh energy grids, and recent economic reports). This ensures the robot hasn't seen these specific numbers before, so it can't cheat by memorizing.
  • The Human-in-the-Loop: They didn't just dump the data in. They used a mix of AI and human experts to clean the data, like a chef tasting a soup and removing the bad spices. They made sure the data was clean and the questions asked were realistic.
  • Real-World Rules: They designed the tests to match how humans actually use predictions. If you are predicting electricity, you care about the next 24 hours, not 100 years. They set the rules based on real needs.

3. The Secret Sauce: The "Pattern" Lens

This is the most creative part.

In the old days, they grouped tests by Category (e.g., "All Weather Data" vs. "All Stock Data").
The authors say: "Wait a minute. A stock market crash and a sudden heatwave might look very different on a chart, but they share the same shape or pattern."

So, TIME doesn't just look at the category; it looks at the DNA of the data. They break every time series down into its structural parts:

  • The Trend: Is it going up, down, or flat?
  • The Rhythm: Is there a repeating beat (like seasons or daily cycles)?
  • The Noise: Is it chaotic or smooth?

They give every data stream a "Pattern ID card." Then, they test the robots on specific ID cards.

  • Example: "How good is Robot A at predicting things with Strong Trends but Weak Rhythms?"
  • Example: "How good is Robot B at predicting things that are Chaotic?"

This is like testing a chef not just on "Italian Food," but specifically on "Spicy Pasta with a Cream Sauce." It tells you exactly what the robot is good at and where it struggles.

4. The Results: Who Won?

They tested 12 of the smartest time-series robots (called Foundation Models) on this new gym.

  • The Winners: Newer models like Chronos-2 and TimesFM 2.5 generally did the best.
  • The Surprise: The paper found that a robot's ranking changes depending on the pattern. A robot might be the best at predicting smooth trends but terrible at predicting chaotic spikes. The old tests would have just said "Robot A is #1," hiding this weakness. TIME reveals the nuance.

The Big Takeaway

TIME is a better way to judge AI. It stops robots from cheating by memorizing old answers, ensures the tests are clean and realistic, and gives us a detailed report card that tells us exactly what kind of future the robot can predict.

Instead of just asking, "Is this robot smart?" TIME asks, "Is this robot smart at this specific type of problem?" This helps humans choose the right tool for the job, whether they are managing a power grid, a hospital, or a stock portfolio.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →