MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention

Imagine you are a detective trying to find tiny, hairline cracks in a massive, aging bridge. These cracks are tricky: some are long and winding like rivers, others are short and jagged like lightning bolts, and they are often hidden against a noisy, textured background (like peeling paint or rough concrete).

To solve this case, you need a team of specialists. If you only have one type of detective, you might miss the clues. This paper introduces MixerCSeg, a new "detective team" designed specifically to find these cracks better and faster than anyone else.

Here is how the team works, broken down into simple concepts:

1. The Problem: The "One-Size-Fits-All" Failure

Previous AI models tried to solve this with just one type of brain:

The Local Detective (CNN): Great at seeing small details (like the texture of the crack), but blind to the big picture. They can't see that a short crack connects to a long one far away.
The Global Detective (Transformer): Great at seeing the whole picture and connecting distant dots, but they are slow, expensive to run, and sometimes miss the tiny, fine details.
The New Kid (Mamba): A fast, efficient detective that scans things in a line. It's good, but it sometimes struggles to see the "whole room" in a single glance because it processes information sequentially (one step at a time).

The Mistake: Previous models just stacked these detectives on top of each other (like putting a local detective, then a global one, then a Mamba one in a line). This is inefficient and doesn't let them talk to each other properly.

2. The Solution: The "TransMixer" (The Coordinated Team)

The authors created a new architecture called MixerCSeg. Instead of stacking the detectives, they created a coordinated team where everyone works simultaneously but focuses on what they do best.

The core of this team is the TransMixer. Imagine the AI looks at an image and splits the "clues" (data) into two piles:

The Global Pile: These clues are sent to the Transformer specialist to figure out the big connections (e.g., "This crack on the left connects to that one on the right").
The Local Pile: These clues are sent to the CNN specialist to zoom in and sharpen the edges of the crack.

The Magic Trick: The system uses a special "Mamba" mechanism to automatically decide which clues belong to the Global pile and which belong to the Local pile. It's like a smart manager who instantly knows, "You, look at the big picture; you, zoom in on this texture." This happens inside a single step, making it incredibly fast.

3. The Special Tool: DEGConv (The "Directional Flashlight")

Cracks are weird. They don't just go straight; they branch out, curve, and twist. Standard tools often get confused by these shapes.

The team invented a special tool called DEGConv (Direction-guided Edge Gated Convolution).

The Analogy: Imagine trying to trace a crack with a flashlight. A normal flashlight shines light everywhere. This new flashlight is directional. It knows the crack is going "North-East," so it shines its beam exactly that way to highlight the edge.
How it works: It looks at the direction of the crack at every tiny point and uses that knowledge to "gate" (open or close) the flow of information. If the crack turns, the tool turns with it. This makes the AI incredibly sensitive to the exact shape of the crack without needing a supercomputer to do the math.

4. The Refiner: SRF (The "High-Res Polisher")

When the AI builds its map of the cracks, it starts with a rough, low-resolution sketch. If you just stretch that sketch to make it big, the edges look blurry and jagged.

The SRF module is like a high-definition polish. It takes the rough, low-res sketch and uses the sharp, high-res details from the beginning of the process to "fill in the gaps." It ensures that the final map of the crack is pixel-perfect, with sharp, clean edges, without adding extra weight to the system.

The Results: Fast, Light, and Accurate

The best part? This "super-team" is surprisingly lightweight.

Efficiency: It uses very little computer power (only 2.05 GFLOPs). To put that in perspective, it's like running a high-end video game on a smartphone, whereas other top models require a massive server farm.
Performance: Despite being small, it beats all the current "State-of-the-Art" models. It finds more cracks, draws them more accurately, and handles messy backgrounds better.

Summary

MixerCSeg is like a highly efficient construction crew. Instead of hiring one giant crane (Transformer) that's slow, or a thousand tiny hand-tools (CNN) that miss the big picture, they hired a specialized team that communicates instantly. They use a smart manager (Mamba) to split the work, a directional flashlight (DEGConv) to trace the tricky shapes, and a high-def polisher (SRF) to make the final result perfect.

The result? A system that can spot dangerous cracks in roads and bridges faster, cheaper, and more accurately than ever before.

1. Problem Statement

Road crack segmentation is critical for infrastructure maintenance but remains a significant challenge due to:

Morphological Diversity: Cracks vary widely in shape, width, and continuity.
Low Contrast: Cracks often blend into complex backgrounds with uneven textures.
Limitations of Existing Architectures:
- CNNs: Efficient at extracting local features but struggle with long-range dependencies due to limited receptive fields.
- Transformers: Excellent at modeling global dependencies via self-attention but suffer from high computational complexity ( $O(N^2)$ ) and reduced inference efficiency.
- Mamba (State-Space Models): Offers linear computational complexity ( $O(N)$ ) and global context, but its progressive scanning mechanism can limit the capture of global context in a single forward pass and lacks explicit modeling of local textures compared to CNNs.
Hybrid Model Flaws: Existing hybrid approaches often simply stack different architectures (e.g., Mamba + Transformer) sequentially or in parallel without deeply analyzing their internal interaction logic, failing to fully leverage their complementary strengths.

2. Methodology

The authors propose MixerCSeg, a lightweight hybrid architecture designed to act as a "coordinated team of specialists." The model consists of three core components:

A. TransMixer (The Core Encoder)

Instead of stacking modules, TransMixer decouples the feature representation within a single Mamba block based on the inherent attention behavior of Mamba.

Mechanism: It analyzes the hidden state transition factor ( $\Delta t$ $Δ t$ ) in the Mamba block. Tokens are sorted along the channel dimension and split into:
- Global Tokens ( $d_g$ ): Processed via Self-Attention to explicitly model long-range dependencies.
- Local Tokens ( $d_l$ ): Processed via a Local Refinement Module (using convolutional operations) to enhance fine-grained texture details.
Benefit: This creates a natural division of labor where global context and local texture are handled by the most suitable mechanism within a unified flow, avoiding the redundancy of simple stacking.

B. Direction-guided Edge Gated Convolution (DEGConv)

Designed to address the irregular geometries and branching nature of cracks.

Spatial Block Processing: The feature map is partitioned into non-overlapping local views.
Direction Embedding: For each view, gradients are computed, and a directional histogram is generated using an arctangent function to capture the dominant orientation of cracks within that block.
Gating Mechanism: The generated directional embedding is added to the original features. An EdgeConv (using strip convolutions $1\times k $and$ k\times 1$) extracts directional features, which are then used to generate gating weights. This dynamically regulates information flow to preserve critical edge details while suppressing noise.

C. Spatial Refinement Multi-Level Fusion (SRF)

A decoder module designed to fuse multi-scale features without increasing complexity.

Process: High-resolution features (rich in spatial detail) guide the upsampling and fusion of lower-resolution semantic features.
Refinement: A spatial attention map is generated from the high-resolution features to perform weighted refinement on upsampled lower-level features, ensuring precise boundary alignment and reducing misalignment issues common in standard upsampling.

3. Key Contributions

TransMixer Architecture: A novel feature encoding structure that decouples tokens into global and local pathways based on Mamba's latent attention behavior, effectively combining CNN, Transformer, and Mamba strengths without simple stacking.
DEGConv Module: A lightweight module that integrates spatial block processing and directional priors to enhance edge sensitivity and texture modeling for irregular cracks with minimal computational overhead.
SRF Module: A fusion strategy that refines multi-scale details using high-resolution guidance, improving segmentation boundary accuracy without adding significant complexity.
State-of-the-Art Efficiency: The model achieves superior performance with extremely low resource requirements (2.05 GFLOPs and 2.54M parameters).

4. Experimental Results

The model was evaluated on four benchmark datasets: DeepCrack, Crack500, CamCrack789, and CrackMap.

Performance: MixerCSeg achieved State-of-the-Art (SOTA) results across all datasets.
- On DeepCrack, it achieved an mIoU of 91.51% and F1-score of 92.05%, outperforming the second-best model (SCSegamba) by 1.43% in mIoU.
- It significantly outperformed other hybrid models like MambaVision and RestorMixer.
Efficiency:
- Parameters: 2.54 M (9.3% fewer than SCSegamba).
- FLOPs: 2.05 G (88.7% lower than SCSegamba).
- Memory: 1190 MiB (significantly lower than competitors like RestorMixer at 10,384 MiB).
Ablation Studies:
- Removing TransMixer, DEGConv, or SRF individually resulted in performance drops, confirming the necessity of each component.
- Hyperparameter analysis showed that a global channel ratio ( $\gamma$ ) of 0.5 and a cell size of $8\times8$ in DEGConv provided optimal balance between noise robustness and detail capture.

5. Significance

MixerCSeg represents a significant advancement in efficient deep learning for infrastructure monitoring. By moving beyond the "stacking" paradigm of hybrid models, it demonstrates that decoupling architectural strengths based on the underlying mathematical properties of the base model (Mamba) yields superior results.

Its primary significance lies in its extreme efficiency: it delivers high-precision, pixel-level segmentation suitable for real-time deployment on edge devices (due to low FLOPs and memory usage) while maintaining the ability to model complex, long-range crack structures that pure CNNs miss and pure Transformers find computationally prohibitive. This makes it highly practical for large-scale road health monitoring systems.