Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling

Imagine you are trying to predict the weather for the next month.

Most current AI models try to do this in one of two ways:

The "One-Step" Walker: They predict tomorrow's weather, then use that prediction to guess the day after, and so on. The problem? If they get tomorrow slightly wrong, that tiny error gets bigger and bigger every day, until by day 30, the prediction is nonsense.
The "Crystal Ball" Gazer: They try to guess the whole month at once. The problem? They often miss the subtle, step-by-step chain reactions that actually drive the weather (like how a breeze today causes a cloud tomorrow).

Timer-S1 is a new, massive AI model from Tsinghua University and ByteDance that solves this by acting like a super-organized, step-by-step storyteller.

Here is the breakdown of how it works, using simple analogies:

1. The Big Brain (The Architecture)

Timer-S1 is huge. It has 8.3 billion parameters (think of these as neurons in a brain), but it's smart enough to only "wake up" about 0.75 billion of them for any single task.

The Analogy: Imagine a massive library with 8.3 million books. Most of the time, you only need to open a few specific books to answer a question. Timer-S1 is a librarian who knows exactly which books to pull off the shelf instantly, making it fast and efficient.

2. The Secret Sauce: "Serial Scaling"

The paper argues that time series (data that changes over time, like stock prices or heart rates) are inherently serial. This means Step 2 depends on Step 1, which depends on Step 0.

The Problem: Old models tried to skip steps or guess the whole future at once, which breaks the chain of logic.
The Timer-S1 Solution: It uses a technique called Serial-Token Prediction (STP).
- The Analogy: Imagine you are building a long tower of blocks.
  - Old models try to glue the whole tower together at once (it falls over) or build one block, then rebuild the whole tower from scratch for the next block (too slow).
  - Timer-S1 builds the tower block by block, but it does it all in one single motion. It looks at the base, calculates the next block, then the next, then the next, all while keeping the foundation of the original data in its mind. It doesn't "roll" the prediction forward (which causes errors); it just extends the story logically in one go.

3. The Training Data (TimeBench)

To learn how to tell these stories, the model needed to read everything.

The Analogy: The researchers created a library called TimeBench containing one trillion time points. That's like reading every single second of every stock market, weather station, and heart monitor on Earth for years.
The Twist: Real-world data is messy. Sometimes it's too high, sometimes too low. To stop the AI from getting biased (like thinking "the sun always rises in the East" and never checking), they used Data Augmentation.
- They would "flip" the data upside down or change the speed (resampling) to teach the model that the pattern matters, not just the specific numbers. It's like teaching a child to recognize a dog whether it's black, white, running, or sleeping.

4. The Two-Stage Training (Pre-training & Post-training)

They didn't just train the model once; they did it in two phases, like a student getting a general degree and then a specialized certification.

Phase 1 (Pre-training): The model reads the whole library (TimeBench) to learn general patterns. It learns how time flows in general.
Phase 2 (Post-training): The model gets a "refresher course" specifically on short-term accuracy.
- Why? Because if you can't predict the next hour correctly, you definitely can't predict the next month. This stage fine-tunes the model to be extra sharp on the immediate future, which helps it stay accurate for the long term.

5. The Results

When they tested Timer-S1 against the world's best models on a giant leaderboard called GIFT-Eval:

It won. It had the lowest error rates for both specific numbers (MASE) and probability ranges (CRPS).
The Analogy: If other models were like a GPS that gets lost after 10 miles, Timer-S1 is like a GPS that can navigate a cross-country road trip without getting confused, even when the road conditions change.

Summary

Timer-S1 is a billion-parameter AI that treats time series forecasting like a serial story. Instead of guessing the whole future at once or stumbling forward step-by-step with accumulating errors, it uses a special "serial" architecture to calculate the entire future in one smooth, logical flow. It learned from a trillion data points and was fine-tuned to be extra careful with the immediate future, making it the most accurate time-series predictor we have today.

1. Problem Statement

Time series forecasting is a critical task across domains like finance, healthcare, and climate science. While recent advances in Time Series Foundation Models (TSFMs) have shown promise, they face a significant scalability bottleneck:

Serial Nature vs. Parallel/Iterative Approaches: Time series forecasting is inherently a serial problem where future predictions depend on all preceding estimations.
- Parallel Forecasting (predicting multiple steps at once) fails to capture the necessary recurrent dependencies and serial computations.
- Autoregressive (Rolling) Forecasting respects the serial nature but incurs high computational overhead and severe error accumulation as the model iteratively rolls over predictions for long horizons.
Data Heterogeneity: Time series data exhibits significant distributional shifts across domains, varying frequencies, and non-stationary dynamics, making it difficult for models to generalize without massive, high-quality pre-training.
Scaling Limitations: Existing TSFMs often struggle to scale effectively to billion-parameter sizes while maintaining inference efficiency and long-term accuracy.

2. Methodology: Serial Scaling

The authors propose Timer-S1, a 8.3B parameter Mixture-of-Experts (MoE) model. The core innovation is the concept of "Serial Scaling," which involves scaling the model across three dimensions: Architecture, Data, and Training Pipeline.

A. Architecture: Serial-Token Prediction (STP)

Timer-S1 introduces a novel architecture designed to perform serial computations efficiently without the overhead of iterative rolling.

TimeMoE Blocks (Main Backbone): The model uses a decoder-only Transformer backbone enhanced with sparse TimeMoE blocks. These blocks use a Mixture-of-Experts (32 experts, 2 activated per token) to handle data heterogeneity and capture diverse temporal patterns.
TimeSTP Blocks (Serial Forecasting): Instead of standard next-token prediction or multi-token prediction, Timer-S1 employs TimeSTP (Time Serial-Token Prediction) blocks.
- Mechanism: Each TimeSTP block takes the output of the previous block and concatenates it with the initial input embeddings. It then iteratively refines the representation to predict the next patch (shifted by one token).
- Serial Computation: To predict a horizon of $H$ , the model passes through $H$ TimeSTP blocks sequentially. This introduces progressive serial computations within a single forward pass, avoiding the need for iterative rolling (which causes error accumulation) while respecting the serial nature of the task.
- Inference Efficiency: Unlike LLMs where auxiliary blocks are discarded after training, TimeSTP blocks are retained during inference, allowing the model to generate multi-step predictions in one go.

B. Data: TimeBench

To support billion-scale training, the authors curated TimeBench, a corpus containing one trillion time points.

Composition: Includes real-world data from finance, IoT, climate, and healthcare, plus synthetic data (canonical signals, KernelSynth).
Augmentation: To mitigate predictive bias (e.g., models favoring specific frequencies or trends), they apply:
- Resampling: Varying temporal resolutions via down-sampling and Fourier interpolation.
- Value-Flipping: Multiplying series by -1 to prevent the model from latching onto persistent directional trends.
Quality Control: Rigorous preprocessing (imputation, outlier removal) and statistical filtering (ADF test, forecastability metrics) ensure high-quality training signals.

C. Training Pipeline: Multi-Stage Strategy

Timer-S1 utilizes a decoupled, multi-stage training approach to optimize different capabilities:

Pre-Training (Dense Supervision): Trained on TimeBench using a standard Serial-Token Prediction (STP) objective with uniform weighting across all horizons. This maximizes sample efficiency and trains both the TimeMoE (context) and TimeSTP (prediction) modules.
Post-Training (Continued Pre-Training):
- Weighted STP (wSTP): A continued pre-training stage focusing on short-term forecasting. It applies a weight decay ( $1/\sqrt{j}$ ) to deeper TimeSTP blocks, prioritizing the learning of shallow blocks (short-term) which are foundational for long-term accuracy.
- Long-Context Extension (LCE): The context window is extended from 2,880 to 11,520 using RoPE (Rotary Position Embedding) scaling, enabling the model to handle longer historical sequences.

3. Key Contributions

Serial-Token Prediction (STP): A generic training objective and architectural block that performs adaptive serial computations. It bridges the gap between the serial nature of forecasting and the efficiency of single-pass inference, eliminating the error accumulation of rolling autoregression.
Billion-Scale MoE Architecture: The successful scaling of a time series foundation model to 8.3B total parameters (0.75B activated per token) using a sparse MoE design, demonstrating that TSFMs can scale similarly to LLMs.
TimeBench Dataset: The creation of a trillion-scale, high-quality, and augmented dataset that addresses data scarcity and bias issues in time series pre-training.
Multi-Stage Training Paradigm: A novel pipeline separating dense pre-training (for general patterns) and weighted post-training (for short-term precision and long-context adaptation), proving that distinct objectives are needed for different forecasting horizons.

4. Results

Timer-S1 was evaluated on the GIFT-Eval leaderboard, a comprehensive benchmark covering 24 datasets and 177 million data points.

State-of-the-Art Performance: Timer-S1 achieved the best scores among pre-trained models:
- MASE (Mean Absolute Scaled Error): 0.693 (Best overall).
- CRPS (Continuous Ranked Probability Score): 0.485 (Best overall).
Long-Term Superiority: The model showed particularly strong gains on medium- and long-term forecasting tasks compared to competitors like Chronos-2, TimesFM-2.5, and Moirai 2.0, validating the effectiveness of the serial computation approach.
Ablation Studies:
- Removing TimeSTP blocks (forcing rolling autoregression) significantly degraded performance.
- Using "Shift-Token" (future input during training) also failed, highlighting the importance of the train-test gap handling in TimeSTP.
- Data augmentation improved robustness against frequency shifts and trend biases.
Scaling Law: Performance improved consistently as the number of TimeMoE and TimeSTP blocks increased, confirming the scaling law for time series foundation models.

5. Significance

Paradigm Shift: Timer-S1 challenges the dominance of purely parallel or iterative rolling approaches in time series forecasting, establishing serial computation as a critical component for scaling.
Efficiency: By generating multi-step predictions in a single forward pass without iterative rolling, Timer-S1 reduces inference latency and computational cost for long horizons.
Foundation for AGI: The model's ability to generalize across diverse domains without task-specific training brings time series forecasting closer to the "train once, apply anywhere" vision required for Agentic AI systems.
Open Research: The release of the model and the TimeBench dataset provides a robust baseline for future research into scaling laws, data curation, and architectural innovations in time series foundation models.

In summary, Timer-S1 demonstrates that by respecting the serial nature of time series through Serial-Token Prediction and scaling via MoE architectures and trillion-scale data, it is possible to achieve superior long-term forecasting performance with high efficiency.