FSMLP: Modelling Channel Dependencies With Simplex Theory Based Multi-Layer Perceptions In Frequency Domain

The Big Problem: The "Over-Enthusiastic Student"

Imagine you are trying to teach a group of students (a computer model) to predict the weather. You give them data from 10 different weather stations (channels) over the last few days.

Most modern AI models are like over-enthusiastic students. They are so smart and eager to please that they try to memorize everything.

If it rained heavily on Tuesday because of a freak accident, the student thinks, "Aha! It always rains on Tuesdays!"
They memorize the noise and the weird outliers instead of learning the actual pattern.
When you test them on a new day, they fail miserably because they were too busy memorizing the "extreme values" (the weird accidents) rather than understanding the general rules.

In technical terms, this is called overfitting. The paper argues that standard AI models (MLPs) get especially confused when they try to look at how different weather stations relate to each other, especially when the data has some crazy spikes or drops.

The Solution: The "Simplex" Rule

The authors of this paper came up with a clever trick to stop the students from over-memorizing. They introduced a new rule called Simplex-MLP.

The Analogy: The "Budget Constraint"
Imagine you are a chef trying to create a soup. You have 10 different ingredients (the channels).

Standard Model: You can throw in as much of anything as you want. You might dump 99% of the pot into "Salt" just because it tasted salty once. The soup becomes unbalanced and weird.
Simplex-MLP: The authors put a rule on the chef: "You must use exactly 100% of your ingredients, and you cannot use negative amounts."
- If you use 50% Salt, you only have 50% left for everything else.
- This forces the chef to find a balanced, simple recipe that works for the whole group, rather than obsessing over one single ingredient.

By forcing the math to stay within this "Standard Simplex" (a shape where all parts add up to 1), the model is physically prevented from getting obsessed with extreme outliers. It learns the general relationship between the channels instead of memorizing the noise.

The Two-Step Cooking Process (FSMLP)

The paper proposes a full framework called FSMLP (Frequency Simplex MLP). Think of it as a two-step cooking process to make the perfect soup:

Step 1: The "Channel Mixer" (SCWM)
- This step uses our new Simplex Rule. It looks at all the weather stations together and figures out how they influence each other, but it does so carefully, ensuring no single station dominates the prediction. It's like blending the ingredients to get the right flavor balance.
Step 2: The "Time Traveler" (FTM)
- This step looks at the history of the data. But instead of looking at the data second-by-second (which is noisy), it looks at the rhythms and patterns (like the beat of a song).
- The Analogy: Imagine listening to a song.
  - Time Domain: Listening to every single note as it happens. Hard to hear the melody if there's static.
  - Frequency Domain: Looking at the sheet music to see the repeating patterns and the tempo.
- By analyzing the "rhythm" (frequency) of the weather data, the model can spot long-term patterns (like "it rains every 3 days") much better than just looking at the raw numbers.

Why Is This Better?

The paper tested this new method against the current "champions" of AI forecasting (like Autoformer, TimesNet, and PatchTST) on seven different real-world datasets (traffic, electricity, weather, etc.).

The Results:

Less Overfitting: The "Simplex" rule stopped the model from memorizing the weird spikes. It generalized better.
Faster: Because the math is simpler and more structured, the model runs faster and uses less computer memory.
More Accurate: It predicted the future better, especially for long-term forecasts (predicting 720 hours into the future).

Summary in One Sentence

FSMLP is a new AI forecasting method that stops models from obsessing over weird data spikes by forcing them to use a "balanced budget" of information (Simplex) and by listening to the "rhythms" of the data (Frequency) instead of just the noise.

It's like teaching a student to look at the big picture and the repeating patterns, rather than memorizing every single mistake they made in the past.

1. Problem Statement

Time series forecasting (TSF) is critical for domains like energy, weather, and web analytics. While Multi-Layer Perceptrons (MLPs) are lightweight and effective for capturing temporal dependencies, they struggle significantly when modeling inter-channel dependencies (relationships between different variables in the dataset).

The Core Issue: Standard channel-wise MLPs are prone to overfitting, particularly when time series data contains extreme values (outliers).
Theoretical Root Cause: The authors utilize Rademacher complexity theory to demonstrate that the overfitting tendency of standard MLPs is exacerbated by extreme values. In standard MLPs, the weight vector norm ( $B$ ) can become very large to fit these outliers, increasing the Rademacher complexity bound and making the model susceptible to memorizing noise rather than learning generalizable patterns.
Current Limitations: Existing channel-mixing methods (e.g., TSMixer, Autoformer) often fail to mitigate this overfitting, leading to high training accuracy but poor validation performance, especially in datasets with complex channel interactions.

2. Methodology: FSMLP Framework

The authors propose FSMLP (Frequency Simplex MLP), a novel framework that combines Simplex Theory with Frequency Domain transformations to address overfitting and improve dependency modeling.

A. Simplex-MLP Layer

The core innovation is the Simplex-MLP, which constrains the weight matrix of the MLP to lie within a Standard $n$ -Simplex.

Definition: A standard $n$ -simplex is the set of points where coordinates are non-negative and sum to 1 ( $\sum w_i = 1, w_i \geq 0$ ).
Mechanism:
1. Transformation: The raw weights $W$ are transformed using a function $f_{trans}$ (e.g., Logarithmic transformation: $\log(|W|+1)$ ) to ensure positivity.
2. Normalization: The transformed weights are normalized along the channel dimension so that they sum to 1.
3. Result: The final weights $W_{sim}$ are guaranteed to be within the simplex.
Theoretical Benefit: This constraint significantly reduces the upper bound of the Rademacher complexity compared to standard MLPs. By preventing weights from becoming excessively large, the model is less likely to overfit to extreme values or noise, thereby improving generalization.

B. Frequency Domain Modeling

FSMLP operates in the frequency domain to capture dependencies more robustly:

Frequency Temporal MLP (FTM): Extracts temporal information within each channel.
Simplex Channel-Wise MLP (SCWM): Uses the Simplex-MLP to capture inter-channel dependencies.
Rationale: Modeling dependencies in the frequency domain (via Discrete Cosine Transform or DCT) focuses on periodic patterns rather than raw time-domain noise. This reduces the impact of outliers and allows the model to learn relationships between different periods across channels more effectively.

C. Architecture & Loss Function

Architecture: The model cascades $N$ SCWM blocks and $N$ FTM blocks. It transforms input data to the frequency domain, processes it through the Simplex-MLP and Temporal MLP layers, and then applies an inverse transform to return to the time domain for prediction.
Hybrid Loss: The training objective combines:
- Time Domain: Mean Squared Error (MSE).
- Frequency Domain: Mean Absolute Error (MAE). The use of MAE in the frequency domain is chosen because frequency components have varying magnitudes, and squared loss can be unstable for large outliers in this domain.

3. Key Contributions

Theoretical Insight: The paper identifies and theoretically proves (via Rademacher complexity) that extreme values in time series data cause standard channel-wise MLPs to overfit.
Novel Operator (Simplex-MLP): Introduces a weight-constrained MLP layer based on Standard $n$ -Simplex theory, which mathematically bounds the model's capacity to fit noise, reducing overfitting without sacrificing expressiveness.
FSMLP Framework: Proposes a complete forecasting framework integrating Simplex-MLP for channel mixing and frequency domain transformations for temporal modeling.
Plug-and-Play Improvement: Demonstrates that the Simplex-MLP layer can be integrated into other existing MLP-based models (like TSMixer and Autoformer) to significantly boost their performance and reduce overfitting.

4. Experimental Results

The authors evaluated FSMLP on seven benchmark datasets (ETTh1/2, ETTm1/2, Traffic, Weather, ECL) against state-of-the-art (SOTA) models including PatchTST, iTransformer, Autoformer, TSMixer, and FreTS.

Forecasting Accuracy: FSMLP achieved State-of-the-Art (SOTA) performance across all datasets and forecast lengths ( $\tau = 96, 192, 336, 720$ $τ = 96, 192, 336, 720$ ).
- Example: On the Traffic dataset (high complexity), FSMLP achieved an average MSE of 0.415, significantly outperforming FreTS (0.552) and iTransformer (0.428).
- Example: On ETTm1, it achieved an average MSE of 0.365, beating PatchTST and iTransformer.
Overfitting Mitigation: Training curves showed that while other models (TSMixer, Autoformer) suffered from a rapid drop in training loss but high validation loss (overfitting), FSMLP maintained low validation loss throughout training.
Efficiency:
- Inference: FSMLP is the fastest or among the fastest models, with inference times often under 0.02s per 256 samples.
- Training: It requires significantly less memory (e.g., ~600MB vs. >13GB for TimesNet on some datasets) and fewer epochs to converge.
- Complexity: The model has linear computational complexity $O(NL)$, making it highly scalable for large datasets compared to attention-based models like iTransformer ( $O(N^2L)$ ).
Ablation Studies:
- Removing the Simplex constraint caused a significant performance drop, confirming its role in regularization.
- Removing frequency domain transformation degraded performance, highlighting the benefit of frequency-based dependency modeling.
- The Logarithmic transformation variant of Simplex-MLP was found to be the most effective implementation.

5. Significance

Solving the MLP Overfitting Paradox: The paper provides a robust solution to the long-standing issue where MLPs, despite their simplicity, fail to generalize in channel-mixing scenarios due to outliers.
Geometric Regularization: It introduces a novel geometric constraint (Simplex) as a regularization technique, offering an alternative to traditional L1/L2 norms or compression techniques, which were shown to be less effective in the experiments.
Scalability and Practicality: FSMLP is highly efficient, making it suitable for real-time applications and large-scale industrial deployments where computational resources are limited.
Generalizability: The finding that Simplex-MLP can improve other existing models suggests a new direction for enhancing MLP-based architectures in time series forecasting and potentially other domains.

In conclusion, FSMLP represents a paradigm shift by combining geometric constraints with frequency domain analysis to create a lightweight, highly accurate, and robust time series forecasting model that effectively handles the challenges of inter-channel dependencies and data outliers.

FSMLP: Modelling Channel Dependencies With Simplex Theory Based Multi-Layer Perceptions In Frequency Domain

The Big Problem: The "Over-Enthusiastic Student"

The Solution: The "Simplex" Rule

The Two-Step Cooking Process (FSMLP)

Why Is This Better?

Summary in One Sentence

1. Problem Statement

2. Methodology: FSMLP Framework

A. Simplex-MLP Layer

B. Frequency Domain Modeling

C. Architecture & Loss Function

3. Key Contributions

4. Experimental Results

5. Significance

More like this

A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

FLeX: Fourier-based Low-rank EXpansion for multilingual transfer

Spectral Edge Dynamics Reveal Functional Modes of Learning

S3S^3S3: Stratified Scaling Search for Test-Time in Diffusion Language Models

$S^3$ : Stratified Scaling Search for Test-Time in Diffusion Language Models