Adapting Time Series Foundation Models through Data Mixtures

Imagine you have a super-smart weather forecaster (let's call him "The Foundation Model"). This guy has read every weather book in the library and can predict the weather for almost any city on Earth just by looking at the sky for a few minutes. He's great at "zero-shot" forecasting, meaning he can guess the weather for a city he's never visited before, just by using his general knowledge.

However, there's a problem. If you ask him to predict the weather for a very specific, weird micro-climate (like a valley that only gets fog at 3 PM on Tuesdays), he might struggle. His general knowledge isn't quite specific enough.

Usually, to fix this, a human expert would say, "Okay, let's give him a crash course on this specific city's weather data." This is called fine-tuning.

The Old Way: The "One-Size-Fits-All" vs. "One-Per-City" Problem

The paper discusses two common ways to do this crash course:

The "Shared" Approach: You feed the forecaster all the data from every city you have, and you try to teach him one single set of rules to handle them all.
- The Analogy: It's like trying to teach a chef to make Italian, Chinese, and Mexican food all at once by mixing all the ingredients into one giant pot. The result? A muddy, average dish that isn't great at any of them.
The "Per-Dataset" Approach: You train a separate, tiny specialist assistant (a "LoRA module") for each city. When you need a forecast for City A, you use Assistant A. For City B, you use Assistant B.
- The Analogy: This is better. You have a specialist for Italian, one for Chinese, and one for Mexican. But here's the catch: City A isn't just one thing. Maybe City A has a rainy season (like the UK) and a dry season (like the Sahara). If you train one assistant on the whole city, they get confused. They try to be an expert on both rain and drought simultaneously, and they end up being mediocre at both.

The New Idea: MixFT (The "Smart Sorter")

The authors of this paper, Thomas Lee and his team, realized that the problem isn't the cities (the datasets); it's the types of weather patterns hidden inside them.

They propose a method called MixFT. Here is how it works, using a simple analogy:

Imagine you have a huge box of mixed-up socks.

The Old Way (Per-Dataset): You sort the socks by the box they came in. "Box 1 has red socks, Box 2 has blue socks." But wait, Box 1 actually has red socks and some blue socks mixed in because the laundry was messy.
The MixFT Way: You ignore the boxes. Instead, you look at the socks themselves and sort them by texture and pattern. You create a pile for "Striped Socks," a pile for "Polka Dots," and a pile for "Solid Colors."

How MixFT does this with data:

The Magic Scanner (Bayesian Mixture Model): MixFT uses a smart mathematical tool to look at the time series data and ask, "What kind of pattern is this?" It doesn't care about the dataset label; it cares about the shape of the data.
Re-Grouping: It splits the data into "sub-domains." For example, it might find that some data looks like "spiky, sudden changes" (like a server crashing) and other data looks like "smooth, slow waves" (like temperature changes).
Specialist Training: It trains a tiny specialist assistant (LoRA) specifically for "Spiky Data" and another for "Smooth Data."
The Forecast: When you want to predict the future, MixFT looks at your new data, asks, "Is this spiky or smooth?" and then picks the right specialist assistant to do the job.

Why is this better?

Less Confusion: The "Spiky" assistant doesn't have to worry about "Smooth" data. It can focus entirely on handling spikes. This makes it much more accurate.
Better Matching: When a new time series comes in, MixFT can instantly say, "Ah, this looks like the 'Spiky' group," and use the expert who knows exactly how to handle spikes.
Flexibility: Even if a single dataset (like a whole city's weather) has both rain and drought, MixFT can split that dataset internally. It realizes, "Okay, the morning data is 'Smooth,' but the afternoon data is 'Spiky.'" It treats them as two different sub-domains.

The Result

In their experiments, MixFT beat the old methods. It was better than trying to teach one model everything, and it was better than training one model per dataset.

The Bottom Line:
Instead of organizing your learning materials by "Folder Name" (Dataset), MixFT organizes them by "Topic" (Sub-domain). By training specialists on specific types of patterns rather than specific sources of data, the AI becomes a much sharper, more accurate forecaster for the real world.

It's the difference between hiring a generalist who knows a little about everything, versus hiring a team of specialists who know exactly how to handle every specific situation you throw at them.

1. Problem Statement

Time Series Foundation Models (TSFMs) have demonstrated strong zero-shot forecasting capabilities, allowing them to predict on time series with little to no historical data. However, their performance often degrades when applied to new domains not fully covered by their pretraining data.

To adapt TSFMs to new domains, practitioners typically use Parameter-Efficient Fine-Tuning (PEFT), specifically LoRA (Low-Rank Adaptation). Current approaches generally fall into two categories:

Shared Fine-Tuning: Training a single LoRA module on all available fine-tuning datasets.
Per-Dataset Fine-Tuning: Training separate LoRA modules for each dataset (or set of datasets) and combining them (e.g., via averaging or routing) during inference.

The Core Limitation: The paper argues that "per-dataset" methods are suboptimal because a single dataset often contains data from multiple underlying sub-domains (e.g., different seasonal patterns, varying volatility, or distinct distribution shifts). Training a single LoRA module on a heterogeneous dataset forces the model to learn conflicting patterns, while per-dataset modules cannot specialize in specific sub-distributions within a dataset.

2. Methodology: MixFT

The authors propose MixFT (Mixture Fine-Tuning), a method that re-partitions fine-tuning data based on latent sub-domains rather than dataset boundaries.

Key Components:

Bayesian Gaussian Mixture Model (GMM):
- MixFT first embeds the time series context windows using the TSFM's embedding layer.
- It fits a Bayesian GMM to these embeddings to identify latent sub-domains (mixture components).
- Unlike standard K-means, the Bayesian approach uses a Normal-Inverse-Diagonal-Wishart (NIDW) prior. This keeps the covariance matrix diagonal (reducing parameters) and provides robustness against overfitting and initialization sensitivity via variational inference.
Data Re-division:
- Based on the posterior predictive distribution of the GMM, the data is re-assigned to specific sub-domain clusters ( $S_k$ ).
- Crucially, a single original dataset can be split across multiple sub-domains if the data exhibits heterogeneous characteristics.
Specialized LoRA Training:
- Instead of one LoRA per dataset, MixFT trains one LoRA module per identified sub-domain.
- This ensures each module specializes in a homogeneous distribution, reducing "destructive interference" between conflicting data patterns.
Zero-Shot Forecasting Inference:
- Given a new time series context, MixFT embeds it and uses the trained Bayesian GMM to infer the most likely sub-domain (mixture component).
- It selects the corresponding LoRA module trained on that specific sub-domain to generate the forecast.
- The authors use a hard assignment (argmax) rather than soft weighting, arguing that sub-domain classification is usually highly confident and that mixing LoRAs from unlikely sub-domains introduces Out-of-Distribution (OOD) noise.

3. Key Contributions

Identification of Sub-Domain Heterogeneity: The paper demonstrates that dataset labels are often poor proxies for data distribution in time series. A single dataset frequently contains multiple sub-domains, making per-dataset fine-tuning non-optimal.
Proposal of MixFT: A novel framework that leverages Bayesian mixture models to discover latent sub-domains, re-partition data, and train specialized LoRA modules. This allows for dynamic selection of the appropriate expert module during inference.
Empirical Validation: Comprehensive experiments showing that MixFT outperforms both shared fine-tuning and state-of-the-art per-dataset methods (like $\mu$ -Datasets, Arrow, Poly, and MBC) in zero-shot forecasting scenarios.

4. Experimental Results

The authors evaluated MixFT on Chronos Bolt (small) and Moirai-1.1-R (small) using the Cloud and Gift-Eval benchmarks.

Performance: MixFT achieved the lowest average rank across all datasets and TSFMs. It consistently outperformed the "Base" (pretrained-only) model and other fine-tuning baselines.
Comparison to Per-Dataset Methods: Many per-dataset methods (e.g., $\mu$ -Datasets) performed worse than the Base model on several datasets, highlighting the difficulty of fine-tuning TSFMs without proper data partitioning.
Ablation Studies:
- Mixture Model: Bayesian GMM outperformed K-means and Topic Models, confirming the benefit of probabilistic modeling for sub-domain discovery.
- Component Selection: The hard assignment (argmax) strategy for selecting LoRAs performed better than soft weighting (ensemble) or routing mechanisms (Arrow), likely because mixing OOD forecasts degrades performance.
- Number of Components ( $K$ ): $K=2$ was found to be optimal for the tested datasets.
Interpretability: Visual analysis of the learned sub-domains revealed interpretable patterns, such as distinguishing between time series with "spiking" volatility versus "flat" trends, or different seasonal patterns.

5. Significance and Impact

Paradigm Shift: MixFT challenges the standard assumption that dataset boundaries define the scope of adaptation. It suggests that data distribution is the true unit of specialization for foundation models.
Improved Zero-Shot Generalization: By ensuring the fine-tuned LoRA module matches the sub-distribution of the target context, MixFT minimizes the generalization gap, leading to more accurate forecasts in low-data regimes.
Efficiency: The method remains parameter-efficient (using LoRA) and computationally feasible, with the overhead of the GMM being negligible compared to the training of the LoRA modules.
Future Direction: This work opens a new avenue for TSFM adaptation, suggesting that future methods should focus on compartmentalizing data based on latent characteristics rather than treating datasets as monolithic blocks.

In conclusion, MixFT provides a robust solution for adapting Time Series Foundation Models by recognizing and exploiting the internal heterogeneity of data distributions, resulting in superior zero-shot forecasting performance compared to existing fine-tuning strategies.

Adapting Time Series Foundation Models through Data Mixtures

The Old Way: The "One-Size-Fits-All" vs. "One-Per-City" Problem

The New Idea: MixFT (The "Smart Sorter")

Why is this better?

The Result

1. Problem Statement

2. Methodology: MixFT

Key Components:

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Varying risk exposure in auto insurance: a weighted tweedie framework for experience rating an cancellation penalties

Remote, bivariate expert elicitation to determine the prior probability distribution for sample size calculation in a Bayesian non-inferiority multicenter randomized controlled trial (Croup Dosing Trial)

Sequentially-Rerandomized Switchback Experiments

Reinforcement Learning from Human Feedback: A Statistical Perspective

Applied Statistics Requires Scientific Context