ms-Mamba: Multi-scale Mamba for Time-Series Forecasting

Imagine you are trying to predict the weather for the next week. You look at the temperature, but you realize the data is tricky. Sometimes the temperature changes every hour (a sudden storm), sometimes it shifts over a day (day vs. night), and sometimes it follows a pattern over months (seasons).

If you only look at the data through a single pair of glasses, you might miss the big picture. If your glasses are too zoomed-in, you see every tiny fluctuation but miss the trend. If they are too zoomed-out, you see the season but miss the sudden storm.

This is the problem the paper "ms-Mamba" tries to solve.

The Problem: The "One-Size-Fits-All" Glasses

For a long time, computers used to predict the future (Time-Series Forecasting) using models that looked at data at just one speed.

The Old Way (RNNs): Like reading a book one word at a time. Good for stories, but slow and forgetful.
The Transformer Way: Like reading the whole book at once to see connections. Very smart, but it gets overwhelmed and slow if the book is too long.
The Mamba Way: A new, super-fast model that remembers things well and runs quickly. But, like the others, it usually looks at the data at just one single speed.

The authors realized: Real life isn't one speed. A stock market crash happens in seconds, but a housing market trend takes years. A single-speed model is like trying to watch a movie at 1x speed when you need to see both the slow-motion drama and the fast-paced action scenes simultaneously. It has to compromise, and it loses accuracy.

The Solution: The "Multi-Scale Mamba" (ms-Mamba)

The authors built a new model called ms-Mamba. Think of it as giving the computer three different pairs of glasses to wear at the same time.

Glasses A (High Speed): Looks at the data very closely, catching every tiny, rapid change (like a sudden spike in solar power).
Glasses B (Medium Speed): Looks at the data over a few hours or days, catching daily patterns.
Glasses C (Slow Speed): Looks at the data over weeks or months, catching the big, slow trends.

Instead of forcing the computer to choose one speed, ms-Mamba runs three "Mamba" brains in parallel. Each brain looks at the same data but at a different "sampling rate" (a different speed). Then, it combines the insights from all three brains to make a single, super-accurate prediction.

How It Works (The Analogy)

Imagine a team of detectives trying to solve a mystery (predicting the future):

Detective 1 is a speedster who notices every footstep and whisper (high frequency).
Detective 2 is a strategist who notices the daily routine of the suspects (medium frequency).
Detective 3 is a historian who notices the suspect's habits over the last decade (low frequency).

In the old models, you only had one detective. If you sent the speedster, you missed the long-term plan. If you sent the historian, you missed the immediate clue.
ms-Mamba sends all three. They talk to each other, combine their notes, and produce a prediction that is smarter than any single detective could be alone.

Why Is This a Big Deal?

The paper tested this new model on 13 different real-world datasets, including:

Solar Energy: Predicting how much power the sun will generate (which changes instantly with clouds but also follows the seasons).
Traffic: Predicting traffic jams (which happen in rush hour but also follow weekly patterns).
Electricity: Predicting power usage.

The Results:

It's More Accurate: ms-Mamba beat the current "champion" models (including the famous S-Mamba and Transformer models) in almost every test. On the Solar Energy dataset, it made significantly fewer mistakes.
It's Cheaper: Even though it uses three "brains" instead of one, it actually uses less computer memory and power than the competitors. It's like getting a Ferrari engine that is also more fuel-efficient.
It's Faster: It predicts the future faster than the heavy, slow models.

The Bottom Line

The world is messy and happens at different speeds. The old AI models tried to force everything into a single speed, which led to mistakes. ms-Mamba is like a smart observer who knows how to look at the world through different lenses simultaneously. By doing so, it sees the whole picture clearly, predicts the future better, and does it all without needing a supercomputer to run it.

In short: It's a smarter, faster, and more efficient way for AI to understand the rhythm of time.

1. Problem Statement

Time-series forecasting (TSF) is a fundamental task involving the prediction of future values based on historical data. While recent advancements have utilized Recurrent Neural Networks (RNNs), Transformers, and the newly proposed State Space Models (specifically Mamba), existing architectures often suffer from a critical limitation: they process input data at a single temporal scale.

Real-world time-series data (e.g., temperature, traffic, energy consumption) inherently contains signals operating at multiple temporal scales simultaneously (e.g., hourly fluctuations, daily cycles, and yearly trends). Single-scale models force a trade-off, often failing to capture both high-frequency variations (rapid changes) and low-frequency trends (long-term patterns) effectively. The paper argues that a single sampling rate ( $\Delta$ ) in standard Mamba models is sub-optimal for these multi-scale dynamics.

2. Methodology: ms-Mamba

The authors propose ms-Mamba (Multi-scale Mamba), a novel architecture designed to process time-series data at multiple temporal resolutions simultaneously without architectural downsampling.

Core Concept

The method leverages the State Space Model (SSM) property where the sampling rate ( $\Delta$ ) is a learnable parameter. In standard Mamba, a single $\Delta$ governs the discretization of the continuous SSM. In ms-Mamba, the authors employ multiple parallel Mamba blocks, each parameterized with a different sampling rate ( $\Delta_i$ ).

Small $\Delta$ : Induces long memory and high temporal resolution, capturing fine-grained, high-frequency variations.
Large $\Delta$ : Induces shorter memory and lower temporal resolution, capturing coarse-grained, low-frequency trends.

Architecture Details

Embedding Layer: Input sequences are projected into a fixed embedding dimension.
Multi-scale Mamba Layer: The core innovation. Instead of a single Mamba block, the layer consists of $n$ $n$ parallel Mamba blocks.
- Each block $i$ processes the same input embedding but with a distinct sampling rate $\Delta_i$ .
- The outputs of these parallel blocks are averaged (fused) to produce a combined representation.
- This allows the model to disentangle and capture overlapping temporal dynamics.
Strategies for Setting $\Delta_i$ : The paper explores three strategies to determine the sampling rates:
- Fixed Scales: $\Delta_1$ is learnable; subsequent rates are fixed multiples ( $\Delta_i = \alpha_i \times \Delta_1$ ).
- Learnable Scales: All $\Delta_i$ are independent learnable parameters (found to be the most effective).
- Dynamic Scales: $\Delta_i$ are estimated via a Multi-Layer Perceptron (MLP) based on the input features.
Bidirectional Processing: To enhance sequential modeling, the multi-scale blocks are applied in both forward and reverse directions.
Output: The final fused embeddings pass through normalization, an MLP, and a linear projection to generate the forecast.

3. Key Contributions

Novel Architecture: Introduction of ms-Mamba, the first Mamba-based architecture explicitly designed for multi-scale time-series processing by utilizing multiple sampling rates in parallel.
Efficient Multi-scale Processing: Unlike previous multi-scale models that rely on complex structural hierarchies or downsampling (which can lose information), ms-Mamba achieves multi-resolution feature extraction purely through the inherent properties of SSMs, avoiding architectural downsampling.
Comprehensive Evaluation: Extensive experiments on 13 real-world benchmark datasets (covering traffic, electricity, weather, and solar energy) comparing ms-Mamba against 10 State-of-the-Art (SOTA) models, including Transformers (iTransformer, PatchTST), Linear models (DLinear, RLinear), and Mamba variants (S-Mamba).
Efficiency Gains: Demonstrates that ms-Mamba achieves superior accuracy while using fewer parameters, less memory, and fewer operations (MACs) compared to its closest competitor, S-Mamba.

4. Experimental Results

The paper reports results across various datasets and forecast horizons (12 to 720 steps).

Performance: ms-Mamba consistently outperforms or matches SOTA models.
- Solar-Energy Dataset: ms-Mamba achieved a Mean Squared Error (MSE) of 0.229, beating the previous best (S-Mamba) at 0.240.
- Traffic Datasets: ms-Mamba achieved the best or second-best results across all forecast lengths on the Traffic and PEMS datasets.
- ETT Datasets: Consistently ranked among the top performers, particularly on ETTh2.
Efficiency:
- On the Solar-Energy dataset, ms-Mamba used 3.53M parameters vs. S-Mamba's 4.77M.
- Memory usage was 13.46MB vs. 18.18MB.
- Computational cost (MACs) was 14.93G vs. 20.53G.
Ablation Studies:
- Learnable Scales: The strategy where all sampling rates are learned independently yielded the best performance, outperforming fixed and dynamic strategies.
- Number of Scales: Using 4 scales provided the optimal balance of performance.
Qualitative Analysis: Visualizations show that while S-Mamba tends to "undershoot" peak values (failing to capture rapid transitions), ms-Mamba successfully captures both sharp peaks and long-term trends due to its multi-scale branches.

5. Significance and Conclusion

The paper establishes that multi-scale processing is crucial for high-accuracy time-series forecasting. By extending the Mamba architecture to handle multiple temporal scales simultaneously via learnable sampling rates, ms-Mamba offers a more robust and efficient solution than single-scale models.

Key Takeaways:

Superiority: It sets a new benchmark for TSF, surpassing both Transformer-based and Mamba-based baselines.
Efficiency: It proves that multi-scale capabilities do not necessarily require increased computational overhead; in fact, they can be achieved more efficiently than single-scale counterparts.
Generalizability: The approach is applicable across diverse domains (traffic, energy, weather), suggesting that the multi-scale nature of time-series data is a universal challenge that ms-Mamba addresses effectively.

The authors conclude that ms-Mamba represents a significant step forward in time-series modeling, offering a simple yet powerful extension to the Mamba architecture that better aligns with the intrinsic multi-scale nature of real-world data.

ms-Mamba: Multi-scale Mamba for Time-Series Forecasting

The Problem: The "One-Size-Fits-All" Glasses

The Solution: The "Multi-Scale Mamba" (ms-Mamba)

How It Works (The Analogy)

Why Is This a Big Deal?

The Bottom Line

1. Problem Statement

2. Methodology: ms-Mamba

Core Concept

Architecture Details

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization