MoHETS: Long-term Time Series Forecasting with Mixture-of-Heterogeneous-Experts

MoHETS is an encoder-only Transformer that achieves state-of-the-art long-term multivariate time series forecasting by integrating sparse Mixture-of-Heterogeneous-Experts layers to effectively model diverse temporal dynamics, non-stationary regimes, and multi-scale structures while improving parameter efficiency and generalization across arbitrary forecast horizons.

Evandro S. Ortigossa, Guy Lutsker, Eran Segal

Published 2026-03-16
📖 5 min read🧠 Deep dive

Imagine you are trying to predict the weather for the next month. You have a massive amount of historical data: temperature, humidity, wind speed, and pressure. But the data is messy. Sometimes the weather follows a smooth, slow trend (like a gradual warming in spring). Sometimes it has sharp, repeating patterns (like rain every afternoon at 3 PM). And sometimes, it's chaotic and unpredictable because of a sudden storm front.

Traditional AI models trying to solve this are like a single, overworked chef trying to cook a complex banquet. They use the same knife, the same stove, and the same recipe for everything. They chop the vegetables, sear the steak, and bake the cake all with the same tool. It works okay, but it's inefficient, and they often miss the subtle nuances of each dish.

MoHETS is a new, smarter kitchen. Instead of one chef, it employs a team of specialized experts, and it has a smart manager who knows exactly which expert to call for each specific task.

Here is how MoHETS works, broken down into simple concepts:

1. The "Mixture of Heterogeneous Experts" (The Specialized Team)

Most AI models use a "Mixture of Experts" (MoE), but they usually hire a team where everyone is the same (like a team of 100 chefs who all only know how to boil water).

MoHETS changes the game by hiring a diverse team:

  • The "Continuity Chef" (Shared Expert): This expert is always on duty. They are great at seeing the big picture, like the slow, steady trend of a season changing. They use a special tool (a Convolution) that slides over the data to smooth out the long-term flow.
  • The "Rhythm Detectives" (Routed Experts): These are the specialists. When the data has a repeating pattern (like a heartbeat or daily traffic spikes), the manager sends that specific chunk of data to a Fourier Expert. Think of this expert as a musician who listens to the data and instantly recognizes the "beat" or frequency, separating the rhythm from the noise.

The Magic: A smart "Router" (the manager) looks at a piece of data. If it looks like a slow trend, it sends it to the Continuity Chef. If it looks like a repeating pattern, it sends it to the Rhythm Detective. They don't all try to do everything; they only do what they are best at.

2. The "Patch" Strategy (Cutting the Cake)

Instead of looking at the data one second at a time (which is like trying to read a book one letter at a time), MoHETS cuts the timeline into chunks or "patches" (like cutting a cake into slices).

  • This makes the model faster and helps it see local patterns more clearly.
  • It treats each slice as a single "word" in a sentence, making it easier for the AI to understand the story.

3. The "External Context" (The Weather Report)

Sometimes, the data you are predicting depends on things outside the data itself. For example, electricity usage goes up when it's hot, but why is it hot? Maybe it's a holiday, or maybe a heatwave is coming.

  • MoHETS has a special "Cross-Attention" feature. Imagine the model has a second screen showing a calendar and a weather report.
  • When the model is making a prediction, it glances at this second screen. If it sees "Christmas" on the calendar, it knows to expect different traffic patterns, even if the historical data doesn't explicitly say "Christmas." This helps it handle weird, non-repeating events.

4. The "Lightweight Decoder" (The Efficient Delivery)

Once the experts have done their work, the model needs to turn its internal thoughts back into a prediction.

  • Old models used a heavy, bulky machine (a "Linear Head") to do this, which was slow and prone to errors.
  • MoHETS uses a lightweight, flexible conveyor belt (a Convolutional Decoder). It's fast, efficient, and can deliver predictions for any length of time (96 hours, 720 hours, etc.) without needing to be retrained.

Why Does This Matter?

In the real world, time series data (like stock prices, energy grids, or hospital patient monitors) is rarely simple. It's a mix of slow trends, fast rhythms, and sudden shocks.

  • Old Models: Tried to force a square peg into a round hole, using the same math for everything.
  • MoHETS: Recognizes that different parts of the data need different tools. It combines the best of signal processing (math for rhythms) with deep learning (AI for patterns).

The Result: In tests, MoHETS was significantly more accurate than the current state-of-the-art models. It reduced errors by about 12% on average. It's like upgrading from a general-purpose Swiss Army knife to a fully stocked, professional workshop where every tool is perfectly suited for the job at hand.

In a nutshell: MoHETS is a time-traveling prediction engine that doesn't just "guess" the future; it hires a team of specialists, cuts the timeline into manageable chunks, checks the calendar for context, and delivers a highly accurate forecast, all while using less computing power than its competitors.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →