SpectralMamba-UNet: Frequency-Disentangled State Space Modeling for Texture-Structure Consistent Medical Image Segmentation

Imagine you are trying to paint a highly detailed portrait of a human heart for a doctor. You need to get two things right at the same time:

The Big Picture: The overall shape and size of the heart (the "structure").
The Tiny Details: The thin, jagged edges of the blood vessels and the texture of the muscle (the "texture").

For a long time, computer programs trying to do this (called AI models) had a hard time. If they focused too much on the big picture, the edges became blurry. If they focused too much on the edges, the whole shape looked messy and disconnected.

This paper introduces a new AI model called SpectralMamba-UNet. Think of it as a "super-painter" that has learned a secret trick: it separates the big picture from the tiny details, paints them separately, and then stitches them back together perfectly.

Here is how it works, using simple analogies:

1. The Problem: The "Blurry vs. Messy" Dilemma

Imagine looking at a photo through a foggy window.

Old AI models tried to look at the whole photo at once. They were good at seeing the general shape of the house (low frequency), but the fog made the window panes and brick textures (high frequency) look blurry.
Alternatively, if they tried to focus only on the bricks, they might lose track of where the house actually ends, making the roof look like it's floating in space.

In medical terms, this meant the AI might miss a tumor's edge or confuse one organ for another.

2. The Solution: The "Frequency Disentanglement" Trick

The authors realized that images are made of different "frequencies," just like a song is made of different musical notes.

Low Frequencies: These are the slow, deep bass notes. In an image, this is the smooth, overall shape of an organ (like the roundness of a liver).
High Frequencies: These are the sharp, fast treble notes. In an image, this is the sharp edge of a bone or the fine texture of skin.

The Innovation: Instead of trying to listen to the whole song at once, SpectralMamba-UNet puts on noise-canceling headphones that split the audio into two channels:

Channel A (The Bass): Focuses only on the big shapes.
Channel B (The Treble): Focuses only on the sharp edges.

3. The Three Secret Tools

The paper describes three specific tools (modules) that make this separation work:

A. The "Splitter" (Spectral Decomposition & Modeling - SDM)

Think of this as a kitchen sieve. When you pour a mix of flour and rocks through a sieve, the flour (low frequency) goes one way, and the rocks (high frequency) stay behind.

The AI takes the medical image, runs it through a mathematical "sieve" (called a Discrete Cosine Transform), and separates the smooth shapes from the sharp edges.
It then uses a special, efficient engine called a Mamba (a type of AI that is great at remembering long sequences) to analyze the "flour" and the "rocks" separately. This ensures the big shape is understood and the edges are preserved without them interfering with each other.

B. The "Volume Knob" (Spectral Channel Reweighting - SCR)

Sometimes, the "flour" is more important, and sometimes the "rocks" are.

Imagine you are mixing a cocktail. You don't always want the same amount of ice and juice.
This tool acts like a smart volume knob. It looks at the separated parts and asks, "Is the edge of this organ important right now? Or is the overall shape more important?" It then turns up the volume on the most critical parts and turns down the noise.

C. The "Master Builder" (Spectral-Guided Fusion - SGF)

Now that the AI has painted the big shape and the sharp edges separately, it needs to put them back together.

If you just tape two pictures together, you might see a seam.
This tool is the master builder who knows exactly how to blend the two layers. It takes the "big shape" info and the "sharp edge" info and fuses them together so smoothly that you can't tell where one ended and the other began. It makes sure the final image looks natural and consistent.

4. Why Does This Matter?

The researchers tested this new "super-painter" on five different types of medical images (CT scans of abdomens, hearts, brain aneurysms, and eye vessels).

The Result: It beat all the previous top models.
The Real-World Impact:
- For a heart scan, it can see the thin walls of the heart chambers much clearer.
- For a brain scan, it can spot tiny, dangerous aneurysms that other AI might miss because they look like noise.
- For eye scans, it can trace the tiny, winding blood vessels without breaking the line.

The Bottom Line

SpectralMamba-UNet is like giving a doctor a pair of glasses that can zoom in on the tiny details without losing the big picture. By teaching the AI to separate "structure" from "texture" and then recombine them intelligently, it creates much more accurate maps of the human body. This helps doctors diagnose diseases faster and plan treatments with greater confidence.

1. Problem Statement

Medical image segmentation requires a delicate balance between modeling global anatomical structures (context) and preserving fine-grained boundary details (texture).

Limitations of CNNs: Traditional Convolutional Neural Networks (e.g., U-Net) suffer from limited receptive fields, making it difficult to capture long-range dependencies and global context, often leading to structural inconsistencies in large anatomical variations.
Limitations of Transformers and SSMs: While Vision Transformers (ViTs) and State Space Models (SSMs like Vision Mamba) excel at long-range dependency modeling, they typically rely on 1D serialization (patch tokenization or flattening). This process disrupts local spatial continuity and weakens the representation of high-frequency information (e.g., organ boundaries and tissue edges).
The Core Issue: Existing methods treat all spatial frequencies uniformly. This "entanglement" forces a trade-off: aggressive global modeling smooths out critical boundary cues, while preserving local details often sacrifices contextual consistency. Furthermore, recent studies indicate that SSMs are particularly vulnerable to losing high-frequency components during long-sequence processing.

2. Methodology: SpectralMamba-UNet

The authors propose SpectralMamba-UNet, a novel frequency-disentangled framework that explicitly separates structural (low-frequency) and textural (high-frequency) information in the spectral domain using a U-shaped encoder-decoder architecture.

Core Components

The framework integrates three key modules:

1. Spectral Decomposition and Modeling (SDM)

Mechanism: Intermediate feature maps in the encoder are projected into the frequency domain using the 2D Discrete Cosine Transform (DCT).
Disentanglement: The spectral coefficients are split into Low-Frequency ( $F_{low}$ ) and High-Frequency ( $F_{high}$ ) components using a fixed binary mask (ratio $\alpha = 0.125$ $α = 0.125$ ).
- Low Frequency: Captures global anatomical layouts.
- High Frequency: Encodes fine-grained variations, edges, and textures.
Modeling: Each frequency band is processed independently by separate Mamba blocks (State Space Models) to model long-range dependencies within that specific band.
Reconstruction: The processed spectral maps are transformed back to the spatial domain via Inverse DCT (IDCT) and fused with the original features via residual addition.

2. Spectral Channel Reweighting (SCR)

Purpose: To adaptively balance the importance of low- and high-frequency components, as their significance varies across different anatomical structures and scales.
Mechanism: The module applies Global Average Pooling (GAP) and Global Max Pooling (GMP) to the enhanced spectral representations ( $\tilde{F}_{low}, \tilde{F}_{high}$ ). These are passed through a shared Multi-Layer Perceptron (MLP) and a sigmoid activation to generate channel-wise weights ( $W_{low}, W_{high}$ ).
Function: These weights encode frequency-specific channel importance and are propagated to the decoder for modulation.

3. Spectral-Guided Fusion (SGF)

Purpose: To address the redundancy and lack of spectral awareness in standard skip connections within U-shaped architectures.
Mechanism: During the decoding phase, the upsampled decoder features and encoder skip features are concatenated. The previously learned channel weights ( $W_{low}, W_{high}$ ) are used to apply frequency-conditioned gating (element-wise multiplication) to the concatenated features.
Result: This ensures that the fusion process is aware of spectral characteristics, promoting frequency-consistent integration of multi-scale features.

3. Key Contributions

First Frequency-Disentangled SSM Framework: Introduces the first framework to integrate frequency disentanglement with State Space Modeling (Mamba) specifically for medical image segmentation, enabling separate and effective modeling of global structures and fine boundaries.
Novel Architectural Modules: Proposes a coherent pipeline comprising SDM (for band-wise feature analysis), SCR (for adaptive frequency-aware channel reweighting), and SGF (for frequency-guided decoder fusion).
Generalizability: Demonstrates consistent performance improvements across five diverse medical datasets with varying modalities (CT, MRI, Fundus) and segmentation targets (organs, vessels, lesions).

4. Experimental Results

The model was evaluated on five public benchmarks: Synapse (multi-organ CT), ACDC (cardiac MRI), DRIVE (retinal vessels), EAT (epicardial adipose tissue), and IA (intracranial aneurysm).

Quantitative Performance:
- Synapse (Multi-organ): Achieved the best HD95 (15.31) and competitive mDSC (81.10%), outperforming strong baselines like VM-UNet and Swin-Transformer. Notably, it showed a massive improvement (+10.89% DSC) on the Pancreas compared to VM-UNet.
- ACDC (Cardiac): Attained the highest mean DSC (92.89%), with superior performance on thin structures like the Myocardium (91.39%).
- DRIVE (Vessels): Achieved the best DSC (83.61%) and lowest HD95 (2.26), indicating superior boundary localization for tubular structures.
- Overall: Consistently outperformed CNN-based (Res-UNet), Transformer-based (TransUNet, Swin), and Mamba-based (VM-UNet) baselines across all metrics.
Qualitative Analysis:
- Visual comparisons show that SpectralMamba-UNet produces sharper boundaries and better topological consistency (e.g., connected retinal vessels) compared to baselines, which often suffer from fragmented structures or smoothed edges.
Ablation Studies:
- Spectral Decomposition (+Freq): Significantly improved boundary metrics (e.g., HD95 on IA dropped from 34.28 to 22.76).
- Spatial Mamba: Enhanced structural continuity.
- SCR & SGF: Further refined performance by adaptively weighting channels and guiding fusion.
- Full Model: The combination of all components yielded the best results, proving the complementary nature of spectral modeling and state space dependency learning.

5. Significance

This work addresses a critical gap in medical image segmentation: the inability of current long-range dependency models to preserve high-frequency details. By explicitly disentangling frequency components and processing them with specialized Mamba blocks, SpectralMamba-UNet achieves a rare balance between global anatomical consistency and local boundary precision.

The approach suggests that integrating frequency-domain analysis with linear-complexity state space models is a highly promising direction for medical imaging, offering a scalable solution that avoids the computational overhead of Transformers while overcoming the locality bias of CNNs. The method's success across diverse modalities (CT, MRI, Fundus) highlights its strong potential for general clinical application.