DMS2F-HAD: A Dual-branch Mamba-based Spatial-Spectral Fusion Network for Hyperspectral Anomaly Detection

Imagine you are a security guard looking at a massive, high-resolution photo of a busy city square. Your job is to spot one specific thing that doesn't belong—maybe a bright red balloon in a sea of gray concrete, or a rare bird in a flock of pigeons.

This is exactly what Hyperspectral Anomaly Detection (HAD) tries to do, but with a twist: instead of just seeing colors (Red, Green, Blue), the camera sees hundreds of different "colors" (spectral bands) that reveal the chemical makeup of every object. It's like having X-ray vision that can tell the difference between a plastic roof and a real tree, even if they look the same to your eyes.

The problem? The photo is huge, noisy, and full of distractions. Finding that one "odd" thing is like finding a needle in a haystack, but the haystack is on fire and moving.

Here is how the authors of this paper, DMS2F-HAD, solved this problem using a clever new approach.

The Old Ways: The "Blind" and the "Exhausted"

Before this new method, computers tried to solve this in two main ways, both of which had flaws:

The "Local" Detective (CNNs): These were like a detective who only looks at a tiny 3x3 inch square of the photo at a time. They are fast, but they miss the big picture. If an anomaly is far away or depends on a long pattern, they miss it.
The "Over-Thinker" (Transformers): These are like a genius detective who reads the entire photo, cross-referencing every single pixel with every other pixel to find patterns. They are very accurate, but they are so slow and require so much brainpower (computing power) that they can't be used in real-time. They get exhausted trying to remember everything.

The New Solution: The "Specialized Duo" (DMS2F-HAD)

The authors created a new system called DMS2F-HAD. Think of it not as one giant brain, but as a highly efficient team of two specialized detectives working together, guided by a smart manager.

1. The Two Branches (The Specialists)

Instead of one model trying to do everything, they split the work:

The Spatial Detective (The Eye): This branch looks at the shape and texture of the image. Is that a building? Is that a road? It uses a new technology called Mamba (think of it as a super-fast, efficient reading machine) to scan the image quickly without getting tired.
The Spectral Detective (The Chemist): This branch looks at the chemical signature of the light. Does this pixel look like grass or metal? It also uses Mamba to scan the long list of colors (spectral bands) to find chemical oddities.

The Magic of Mamba: Imagine reading a book. The old "Over-Thinker" (Transformer) tries to read every word and remember every previous word simultaneously, which gets slow as the book gets longer. Mamba is like a reader who can scan the whole book at the speed of light, remembering exactly what matters without getting bogged down. It's linear speed, not exponential slowness.

2. The Smart Manager (Adaptive Gated Fusion)

Once the two detectives have their notes, they need to combine them.

Old Way: Just adding their notes together (like mixing paint). Sometimes the "Chemist" is right, and sometimes the "Eye" is right. Mixing them blindly creates a muddy mess.
New Way (Gated Fusion): The system has a Smart Manager (a "gate") that looks at every single pixel and asks: "Right here, do I need to trust the shape more, or the chemical signature more?"
- In a city with lots of buildings, the manager trusts the Spatial Detective more.
- In a field of grass, the manager trusts the Spectral Detective more.
- This dynamic decision-making prevents false alarms.

3. The "Reconstruction" Trick (How it finds the anomaly)

The system works by trying to rebuild the image from scratch.

It learns what a "normal" background looks like (grass, roads, roofs).
It tries to draw the image again based on what it learned.
The Catch: Because it only learned "normal" things, when it tries to draw a "weird" object (like a hidden tank or a strange vehicle), it fails. It draws a blurry mess or the wrong color.
The computer then compares the Original Photo vs. the Reconstructed Photo. Wherever the difference is huge, BINGO! That's the anomaly.

Why This Matters (The Results)

The paper tested this on 14 different real-world datasets (from deserts to cities). Here is the verdict:

Accuracy: It found the "needles" better than almost anyone else (98.78% success rate).
Speed: It is 4.6 times faster than the previous "Over-Thinker" models.
Efficiency: It uses 3.3 times less memory than the previous best Mamba model.

The Bottom Line

Imagine you need to find a specific person in a crowd of 10,000 people, but you can only use a flashlight that runs on a single AA battery.

The old methods either used a flashlight that was too weak (missed the person) or a flashlight that drained the battery in 5 seconds (too slow).
DMS2F-HAD is like a flashlight that is incredibly bright, scans the whole crowd instantly, and the battery lasts all day.

This makes it perfect for real-world use, like putting this technology on a drone or a satellite to instantly spot fires, illegal dumping, or enemy vehicles without needing a supercomputer to do the math.

1. Problem Statement

Hyperspectral Anomaly Detection (HAD) aims to identify rare, irregular targets (e.g., aircraft, vehicles) in high-dimensional Hyperspectral Images (HSIs) without prior knowledge of the targets (unsupervised). The field faces three primary challenges:

Long-range Dependencies: Traditional Convolutional Neural Networks (CNNs) have limited receptive fields, failing to capture long-range spectral correlations essential for distinguishing anomalies.
Computational Complexity: Transformer-based models utilize self-attention mechanisms that scale quadratically ( $O(N^2)$ ) with sequence length, making them computationally prohibitive for high-dimensional HSI data and unsuitable for real-time, resource-constrained applications.
Spatial-Spectral Fusion: Existing methods often overemphasize spectral features while neglecting spatial correlations, leading to poor anomaly localization and high false-positive rates in complex, textured backgrounds.

2. Methodology: DMS2F-HAD

The authors propose DMS2F-HAD, a novel unsupervised dual-branch autoencoder architecture based on Mamba (a State Space Model). The core design leverages Mamba's linear-time complexity ( $O(N)$ ) to efficiently model long-range dependencies.

A. Architecture Overview

The framework consists of four main components:

Data Pre-processing:
- HSIs are divided into overlapping 3D patches ( $h \times w \times C$ ) to preserve local spatial-spectral structure and increase training samples.
- Random Spatial Masking is applied during training to force the model to infer missing information from context, enhancing robustness to occlusion.
- Patches are projected into a lower-dimensional embedding space ( $c_1$ ) via a $1 \times 1$ convolution.
Dual-branch Encoder:
- Spatial Branch: Processes the spatial context of each patch. It uses Multi-Scale Feature Extraction (MSFE) with parallel $3 \times 3 $and$ 5 \times 5$ convolutions. The features are flattened into tokens and processed by a Mamba block (Selective Scan mechanism) to capture global spatial dependencies regardless of pixel distance.
- Spectral Branch: Models long-range band correlations. To avoid computational inefficiency from processing long spectral vectors directly, it employs a Spectral Grouping strategy. The spectral dimension is segmented into overlapping sub-sequences, which are processed by Mamba to capture both local smoothness and global spectral dependencies.
Adaptive Gated Fusion:
- Instead of simple concatenation or static summation, the outputs of the spatial and spectral branches are merged via a learnable gating mechanism.
- A $1 \times 1 $convolution generates a gate map ($ G$) that dynamically weighs spatial texture against spectral consistency on a pixel-by-pixel basis. This allows the network to prioritize spatial features in heterogeneous areas (e.g., urban) and spectral features in homogeneous backgrounds, minimizing false alarms.
Decoder and Reconstruction:
- A lightweight Spatial-Spectral (SS) Decoder reconstructs the original HSI patch. It mirrors the encoder's capabilities using a Mamba block for global context and parallel convolutions for fine-grained spatial details.
- Anomaly Detection: The model is trained to reconstruct background pixels accurately. Anomalies, which deviate from the learned background distribution, result in high reconstruction errors. The final anomaly map is generated by calculating the $L_2$ norm (residual error) between the original and reconstructed images.

3. Key Contributions

First Dual-Branch Mamba Autoencoder for HAD: Unlike classification-focused Mamba models, this is the first to adapt Mamba for unsupervised background reconstruction, utilizing a lightweight decoder to suppress anomalies.
Adaptive Gated Fusion: Introduces a content-aware mechanism that dynamically arbitrates between spatial and spectral branches, outperforming static fusion methods in complex, textured environments.
Superior Accuracy-Efficiency Trade-off: The model achieves state-of-the-art performance while maintaining linear computational complexity, making it viable for real-time onboard processing.

4. Experimental Results

The model was evaluated on 14 benchmark HSI datasets (including AVIRIS, Salinas, San Diego, and Cat Island) and compared against statistical methods (RX), CNNs, Autoencoders, and Transformers (GT-HAD, TDD).

Accuracy: DMS2F-HAD achieved a state-of-the-art average AUC of 98.78%, outperforming the next best method (GT-HAD, 97.74%). It achieved the highest performance on 9 out of 14 datasets.
Efficiency:
- Inference Speed: The model is 4.6 $\times$ faster than the next fastest deep learning method (TDD) and 65 $\times$ faster than GT-HAD, with an average inference time of 0.55 seconds.
- Complexity: It requires 3.3 $\times$ fewer parameters and 29 $\times$ fewer FLOPs compared to the leading Mamba-based detector (MMR-HAD).
Ablation Studies:
- The dual-branch design significantly outperforms single-branch models.
- The Gated Fusion mechanism provided a critical boost (+1.02% average AUC over the best single branch), particularly in complex scenes like the Gulfport dataset where it improved AUC by over 9% compared to naive addition fusion.

5. Significance

DMS2F-HAD addresses the critical bottleneck in HAD: the trade-off between detection accuracy and computational cost. By replacing quadratic attention mechanisms with linear-complexity Mamba blocks and introducing a dynamic fusion strategy, the paper demonstrates that high-precision anomaly detection can be achieved in real-time. This makes the model highly suitable for practical deployment in resource-constrained environments such as satellite onboard processing, military surveillance, and search-and-rescue operations.

Source Code: Available at https://github.com/Ayushma00/DMS2F-HAD.