Adapting Time Series Foundation Models through Data Mixtures

The paper proposes MixFT, a method that enhances time series foundation model fine-tuning by using Bayesian mixtures to re-partition datasets into homogeneous sub-domains, thereby enabling more specialized module training and superior zero-shot forecasting performance compared to existing per-dataset or global fine-tuning approaches.

Thomas L. Lee, Edoardo M. Ponti, Amos Storkey

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you have a super-smart weather forecaster (let's call him "The Foundation Model"). This guy has read every weather book in the library and can predict the weather for almost any city on Earth just by looking at the sky for a few minutes. He's great at "zero-shot" forecasting, meaning he can guess the weather for a city he's never visited before, just by using his general knowledge.

However, there's a problem. If you ask him to predict the weather for a very specific, weird micro-climate (like a valley that only gets fog at 3 PM on Tuesdays), he might struggle. His general knowledge isn't quite specific enough.

Usually, to fix this, a human expert would say, "Okay, let's give him a crash course on this specific city's weather data." This is called fine-tuning.

The Old Way: The "One-Size-Fits-All" vs. "One-Per-City" Problem

The paper discusses two common ways to do this crash course:

  1. The "Shared" Approach: You feed the forecaster all the data from every city you have, and you try to teach him one single set of rules to handle them all.
    • The Analogy: It's like trying to teach a chef to make Italian, Chinese, and Mexican food all at once by mixing all the ingredients into one giant pot. The result? A muddy, average dish that isn't great at any of them.
  2. The "Per-Dataset" Approach: You train a separate, tiny specialist assistant (a "LoRA module") for each city. When you need a forecast for City A, you use Assistant A. For City B, you use Assistant B.
    • The Analogy: This is better. You have a specialist for Italian, one for Chinese, and one for Mexican. But here's the catch: City A isn't just one thing. Maybe City A has a rainy season (like the UK) and a dry season (like the Sahara). If you train one assistant on the whole city, they get confused. They try to be an expert on both rain and drought simultaneously, and they end up being mediocre at both.

The New Idea: MixFT (The "Smart Sorter")

The authors of this paper, Thomas Lee and his team, realized that the problem isn't the cities (the datasets); it's the types of weather patterns hidden inside them.

They propose a method called MixFT. Here is how it works, using a simple analogy:

Imagine you have a huge box of mixed-up socks.

  • The Old Way (Per-Dataset): You sort the socks by the box they came in. "Box 1 has red socks, Box 2 has blue socks." But wait, Box 1 actually has red socks and some blue socks mixed in because the laundry was messy.
  • The MixFT Way: You ignore the boxes. Instead, you look at the socks themselves and sort them by texture and pattern. You create a pile for "Striped Socks," a pile for "Polka Dots," and a pile for "Solid Colors."

How MixFT does this with data:

  1. The Magic Scanner (Bayesian Mixture Model): MixFT uses a smart mathematical tool to look at the time series data and ask, "What kind of pattern is this?" It doesn't care about the dataset label; it cares about the shape of the data.
  2. Re-Grouping: It splits the data into "sub-domains." For example, it might find that some data looks like "spiky, sudden changes" (like a server crashing) and other data looks like "smooth, slow waves" (like temperature changes).
  3. Specialist Training: It trains a tiny specialist assistant (LoRA) specifically for "Spiky Data" and another for "Smooth Data."
  4. The Forecast: When you want to predict the future, MixFT looks at your new data, asks, "Is this spiky or smooth?" and then picks the right specialist assistant to do the job.

Why is this better?

  • Less Confusion: The "Spiky" assistant doesn't have to worry about "Smooth" data. It can focus entirely on handling spikes. This makes it much more accurate.
  • Better Matching: When a new time series comes in, MixFT can instantly say, "Ah, this looks like the 'Spiky' group," and use the expert who knows exactly how to handle spikes.
  • Flexibility: Even if a single dataset (like a whole city's weather) has both rain and drought, MixFT can split that dataset internally. It realizes, "Okay, the morning data is 'Smooth,' but the afternoon data is 'Spiky.'" It treats them as two different sub-domains.

The Result

In their experiments, MixFT beat the old methods. It was better than trying to teach one model everything, and it was better than training one model per dataset.

The Bottom Line:
Instead of organizing your learning materials by "Folder Name" (Dataset), MixFT organizes them by "Topic" (Sub-domain). By training specialists on specific types of patterns rather than specific sources of data, the AI becomes a much sharper, more accurate forecaster for the real world.

It's the difference between hiring a generalist who knows a little about everything, versus hiring a team of specialists who know exactly how to handle every specific situation you throw at them.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →