MoHETS: Long-term Time Series Forecasting with Mixture-of-Heterogeneous-Experts

Imagine you are trying to predict the weather for the next month. You have a massive amount of historical data: temperature, humidity, wind speed, and pressure. But the data is messy. Sometimes the weather follows a smooth, slow trend (like a gradual warming in spring). Sometimes it has sharp, repeating patterns (like rain every afternoon at 3 PM). And sometimes, it's chaotic and unpredictable because of a sudden storm front.

Traditional AI models trying to solve this are like a single, overworked chef trying to cook a complex banquet. They use the same knife, the same stove, and the same recipe for everything. They chop the vegetables, sear the steak, and bake the cake all with the same tool. It works okay, but it's inefficient, and they often miss the subtle nuances of each dish.

MoHETS is a new, smarter kitchen. Instead of one chef, it employs a team of specialized experts, and it has a smart manager who knows exactly which expert to call for each specific task.

Here is how MoHETS works, broken down into simple concepts:

1. The "Mixture of Heterogeneous Experts" (The Specialized Team)

Most AI models use a "Mixture of Experts" (MoE), but they usually hire a team where everyone is the same (like a team of 100 chefs who all only know how to boil water).

MoHETS changes the game by hiring a diverse team:

The "Continuity Chef" (Shared Expert): This expert is always on duty. They are great at seeing the big picture, like the slow, steady trend of a season changing. They use a special tool (a Convolution) that slides over the data to smooth out the long-term flow.
The "Rhythm Detectives" (Routed Experts): These are the specialists. When the data has a repeating pattern (like a heartbeat or daily traffic spikes), the manager sends that specific chunk of data to a Fourier Expert. Think of this expert as a musician who listens to the data and instantly recognizes the "beat" or frequency, separating the rhythm from the noise.

The Magic: A smart "Router" (the manager) looks at a piece of data. If it looks like a slow trend, it sends it to the Continuity Chef. If it looks like a repeating pattern, it sends it to the Rhythm Detective. They don't all try to do everything; they only do what they are best at.

2. The "Patch" Strategy (Cutting the Cake)

Instead of looking at the data one second at a time (which is like trying to read a book one letter at a time), MoHETS cuts the timeline into chunks or "patches" (like cutting a cake into slices).

This makes the model faster and helps it see local patterns more clearly.
It treats each slice as a single "word" in a sentence, making it easier for the AI to understand the story.

3. The "External Context" (The Weather Report)

Sometimes, the data you are predicting depends on things outside the data itself. For example, electricity usage goes up when it's hot, but why is it hot? Maybe it's a holiday, or maybe a heatwave is coming.

MoHETS has a special "Cross-Attention" feature. Imagine the model has a second screen showing a calendar and a weather report.
When the model is making a prediction, it glances at this second screen. If it sees "Christmas" on the calendar, it knows to expect different traffic patterns, even if the historical data doesn't explicitly say "Christmas." This helps it handle weird, non-repeating events.

4. The "Lightweight Decoder" (The Efficient Delivery)

Once the experts have done their work, the model needs to turn its internal thoughts back into a prediction.

Old models used a heavy, bulky machine (a "Linear Head") to do this, which was slow and prone to errors.
MoHETS uses a lightweight, flexible conveyor belt (a Convolutional Decoder). It's fast, efficient, and can deliver predictions for any length of time (96 hours, 720 hours, etc.) without needing to be retrained.

Why Does This Matter?

In the real world, time series data (like stock prices, energy grids, or hospital patient monitors) is rarely simple. It's a mix of slow trends, fast rhythms, and sudden shocks.

Old Models: Tried to force a square peg into a round hole, using the same math for everything.
MoHETS: Recognizes that different parts of the data need different tools. It combines the best of signal processing (math for rhythms) with deep learning (AI for patterns).

The Result: In tests, MoHETS was significantly more accurate than the current state-of-the-art models. It reduced errors by about 12% on average. It's like upgrading from a general-purpose Swiss Army knife to a fully stocked, professional workshop where every tool is perfectly suited for the job at hand.

In a nutshell: MoHETS is a time-traveling prediction engine that doesn't just "guess" the future; it hires a team of specialists, cuts the timeline into manageable chunks, checks the calendar for context, and delivers a highly accurate forecast, all while using less computing power than its competitors.

1. Problem Statement

Long-term multivariate time series forecasting is a critical task in domains like energy management, finance, and healthcare. However, it faces significant challenges due to the complex nature of real-world data:

Multi-scale Structures: Data often contains global trends, local periodicities, and transient non-stationary regimes simultaneously.
Limitations of Current Models:
- Homogeneity: Standard Transformer-based models (and recent sparse Mixture-of-Experts or MoE approaches) typically use identical Multi-Layer Perceptron (MLP) experts for all tokens. This "one-size-fits-all" approach fails to capture the distinct inductive biases required for different temporal components (e.g., high-frequency noise vs. low-frequency trends).
- Non-Stationarity: Models struggle to adapt to exogenous influences (e.g., weather, holidays) that shift data distributions.
- Efficiency: Linear projection heads used for decoding often lead to parameter explosion and training instability, especially with long horizons.

2. Methodology: MOHETS Architecture

The authors propose MOHETS, an encoder-only Transformer model designed specifically for time series by integrating Mixture-of-Heterogeneous-Experts (MoHE). The architecture consists of four main components:

A. Input Embedding & Patching

Instance Normalization: Applied to mitigate distribution shifts in non-stationary data.
Patching: The input sequence is segmented into non-overlapping patches (subseries) rather than individual time steps. This reduces computational complexity from $O(L^2)$ to $O(S^2)$ (where $S$ is the number of patches) and aggregates local noise into robust feature vectors.
Channel Independence: Each variate is processed independently to prevent overfitting to spurious cross-variate correlations.

B. Transformer Backbone

Encoder-Only Design: Unlike decoder-only foundation models, MOHETS uses an encoder-only structure, which offers greater flexibility for arbitrary forecast horizons and better stability.
Attention Mechanisms:
- Self-Attention: Uses Rotary Position Embeddings (RoPE) to handle relative time distances and FlashAttention-2 for efficiency.
- Grouped-Query Attention (GQA): Reduces memory overhead during autoregressive rollout while maintaining expressivity.
- Multimodal Cross-Attention: Integrates exogenous covariates (e.g., calendar events, weather) by projecting them into the latent space and using them as keys/values. This allows the model to dynamically retrieve external context based on the current state.

C. Mixture-of-Heterogeneous-Experts (MoHE)

This is the core innovation. Instead of using homogeneous MLPs, the MoHE layer routes patches to a small subset of architecturally distinct experts:

Shared Expert (Depthwise Convolution - DwConv): Always activated. It slides along the sequence dimension to capture global trends and maintain sequence-level continuity.
Routed Experts (Fourier-based - FA-FFN): Selected via a sparse gating mechanism (Top-K). These operate in the frequency domain to capture local periodicities and high-frequency patterns, addressing the spectral bias of standard MLPs.

Routing: A router dynamically assigns patches to the best-suited expert (Fourier for periodicity, Conv for trends), ensuring specialized processing for different signal components.

D. Output Patch Decoder

Convolutional Decoder: Replaces heavy linear projection heads with a lightweight convolutional decoder.
Benefits: This imposes a locality inductive bias, prevents the destruction of local temporal structures, reduces parameter count, and stabilizes training by avoiding the "parameter explosion" associated with linear heads in high-dimensional settings.

3. Key Contributions

MoHE Strategy: Introduction of a Mixture-of-Heterogeneous-Experts layer that combines a shared convolutional expert (for trends) and routed Fourier-based experts (for periodicity), aligning the architecture with the intrinsic decomposition of time series data.
Multimodal Cross-Attention: A novel module that explicitly models interactions between endogenous time patches and exogenous covariates, improving robustness to non-stationary dynamics.
Efficient Architecture: An encoder-only design with a convolutional patch decoder that supports arbitrary forecast horizons without retraining, while significantly reducing parameter count and training instability.
Robust Training: Utilization of Huber Loss to handle outliers and an auxiliary Load Balancing Loss to prevent routing collapse in the sparse MoE architecture.

4. Experimental Results

The model was evaluated on 7 multivariate benchmarks (ETTh1/2, ETTm1/2, Weather, ECL, Traffic) across 4 forecast horizons (96, 192, 336, 720).

State-of-the-Art Performance: MOHETS consistently outperformed 15 strong baselines, including recent SOTA models like TimeXer, SOFTS, iTransformer, and TimeMixer.
Quantitative Gains:
- Achieved an average MSE reduction of 12.3% compared to TimeXer on the ETTh1 dataset.
- Reduced average MSE by 10% on Weather and 5.1% on Traffic compared to strong baselines.
- Outperformed large foundation models (e.g., Time-MoE with billions of parameters) despite having significantly fewer parameters (e.g., MOHETSbase has ~7.4M activated parameters vs. Time-MoE's billions).
Ablation Studies:
- MoHE Composition: The combination of DwConv and Fourier experts significantly outperformed homogeneous MLP-only or pure Fourier configurations.
- Architecture: The encoder-only design proved superior to decoder-only variants.
- Decoding: The convolutional head reduced MSE by ~12% compared to MLP-based heads and improved training stability.
- Covariates: Incorporating exogenous data via cross-attention provided significant gains, particularly on datasets with strong non-stationary shifts (e.g., ETTh1).

5. Significance

MOHETS represents a paradigm shift in time series forecasting by moving away from the "homogeneous" processing logic inherited from NLP (LLMs) toward specialized, heterogeneous processing.

Inductive Bias Alignment: It mathematically acknowledges that time series are a superposition of trends and frequencies, assigning specific operators (Convolutions and Fourier Transforms) to handle them, rather than forcing a generic MLP to learn these distinctions.
Efficiency & Scalability: By combining sparse routing with heterogeneous experts and a lightweight decoder, it achieves SOTA performance with a fraction of the parameters required by foundation models, making it highly scalable and deployable.
Robustness: The integration of exogenous covariates and robust loss functions makes the model particularly effective in real-world, non-stationary environments where data distributions shift frequently.

In conclusion, MOHETS demonstrates that tailoring the internal architecture of Transformers to the specific signal processing nature of time series data yields superior long-term forecasting capabilities compared to simply scaling up generic models.