FermatSyn: SAM2-Enhanced Bidirectional Mamba with Isotropic Spiral Scanning for Multi-Modal Medical Image Synthesis

Imagine you are a doctor trying to plan a surgery or radiation treatment for a patient. To do this safely, you usually need a full set of "maps" of the patient's brain: an MRI, a CT scan, and maybe a few other types of images. Each map shows different things (like soft tissue vs. bone).

The Problem: Getting all these maps is hard. It takes a long time, it's expensive, and sometimes patients can't handle the radiation or the noise of the machine. So, doctors often have to work with an incomplete set of maps.

The Goal: We want a computer program that can look at the maps we do have and "imagine" (synthesize) the missing ones with perfect accuracy. It needs to get the big picture right (the shape of the brain) and the tiny details right (the exact edge of a tumor).

The Solution: The authors of this paper created FermatSyn, a new AI system designed to be the ultimate "medical image translator." Here is how it works, explained with simple analogies:

1. The "Expert Guide" (SAM2-Based Prior Encoder)

Imagine trying to draw a perfect map of a city without ever having seen one. You might get the streets wrong.

What FermatSyn does: It uses a pre-trained AI called SAM2 (Segment Anything Model) as a "guide." Think of SAM2 as a super-experienced cartographer who has already memorized the general layout of human anatomy.
The Trick: Instead of teaching the whole guide from scratch (which is slow and expensive), FermatSyn just gives the guide a few "sticky notes" (called LoRA+) with specific instructions on how to handle medical images. This allows the system to instantly understand the "big picture" anatomy (like where the skull ends and the brain begins) without losing its general knowledge.

2. The "High-Definition Zoom" (HRDM & CIN)

When you shrink a photo to make it smaller, you often lose the fine details (like the texture of skin or the edge of a tumor). Most AI models do this too much, making the final image look blurry or "blocky."

What FermatSyn does: It has a special module called HRDM that acts like a "detail preservationist." It uses multiple lenses to look at the image at different zoom levels simultaneously.
The Analogy: Imagine a team of artists. One paints the broad landscape (the whole brain), while another uses a fine brush to paint the tiny cracks in the pavement (lesions). A special "bridge" (the CIN) then stitches these two paintings together perfectly, ensuring the tiny details don't get lost in the big picture.

3. The "Golden Spiral" Scan (Fermat Spiral Scanning)

This is the paper's most unique invention.

The Old Way (Raster Scan): Traditional AI looks at an image like you read a book: left-to-right, top-to-bottom. This creates a bias; the AI gets really good at seeing things in a straight line but gets confused when things curve or turn corners.
The "Rectangular Spiral" Way: Some newer models try to spiral inward like a square snake. But this creates "corners" where the AI gets confused and sees things differently depending on which way it's looking.
The Fermat Way: FermatSyn uses a Fermat Spiral, inspired by how sunflowers arrange their seeds or how galaxies spin.
- The Analogy: Imagine a gardener planting seeds in a sunflower. They don't plant them in rows or square boxes; they plant them in a perfect, continuous spiral that covers every inch of the flower head evenly.
- Why it matters: By scanning the image in this "sunflower pattern," the AI sees the image from every direction equally. It doesn't have a "favorite" direction. This eliminates the "corner artifacts" and makes the 3D structure of the brain look perfectly smooth and consistent.

4. The "Two-Way Street" (Bidirectional Mamba)

Once the image is scanned in this perfect spiral, the system processes it using a Mamba model.

The Analogy: Think of reading a sentence. If you only read from left to right, you might miss the context of the end of the sentence. FermatSyn reads the image "forward" (from the center out) and "backward" (from the outside in) at the same time. This ensures the AI understands the relationship between the center of the tumor and the edge of the brain simultaneously.

The Results: Why Should We Care?

The authors tested FermatSyn on real brain tumor data.

Better Quality: The fake images it created were sharper and more accurate than any previous method.
Real-World Use: They took these "fake" images and used them to train a robot to find tumors. The robot performed just as well on the fake images as it did on real ones.
The Bottom Line: This means in the future, if a patient can't get a full set of scans, doctors can use FermatSyn to generate the missing maps. These generated maps are so good that they can be used for critical medical decisions without risking patient safety or accuracy.

In short: FermatSyn is like a master architect who uses a sunflower's pattern to build a perfect 3D model of a brain, ensuring that every tiny detail is captured, no matter which direction you look at it.

1. Problem Statement

Multi-modal medical image synthesis aims to generate missing imaging modalities (e.g., generating CT from MRI or T1c from T2) to address data scarcity caused by long scan times, patient contraindications, and radiation hazards. However, existing methods face three critical limitations:

Underutilized Structural Priors: Current frameworks lack mechanisms to inject domain-specific anatomical knowledge, leading to cross-modal images that are structurally implausible.
Deficient Local Fidelity: Aggressive downsampling in existing architectures discards high-frequency details essential for detecting small lesions and reconstructing precise boundaries.
Directional Bias in Serialization: State-of-the-art Mamba-based models use raster or rectangular-spiral scanning. These introduce "path-dependent" artifacts and uneven spatial coverage (e.g., corner hotspots in rectangular spirals), causing directional bias that corrupts spatial coherence and lesion boundary recognition.

2. Methodology: FermatSyn

The proposed FermatSyn framework addresses these gaps through a unified architecture comprising three core innovations:

A. SAM2-Enhanced Hybrid Encoder

To inject anatomical priors without prohibitive computational cost, the authors employ a SAM2-based Prior Encoder:

Mechanism: A frozen SAM2 Vision Transformer (ViT) backbone is used as a global feature extractor.
Efficiency: Instead of full fine-tuning, LoRA+ (Low-Rank Adaptation) is applied to the MLP and MHSA layers. This injects low-rank updates ( $W_{tuned} = W_0 + \alpha P_A P_B$ ) to adapt the segmentation model for synthesis tasks, preserving its ability to delineate organ boundaries and tissue interfaces.
Detail Preservation: A Hierarchical Residual Downsampling Module (HRDM) runs in parallel to the SAM2 encoder. It uses multi-scale dilated convolutions and depthwise separable convolutions to preserve high-frequency texture and fine details often lost in pooling.
Integration: A Cross-scale Integration Network (CIN) bridges the semantic gap between the global SAM2 features and local HRDM features. It splits channels into even/odd sets to process low/high-frequency statistics separately before fusing them via convolution.

B. Isotropic Fermat Spiral Scanning

To eliminate directional bias, the paper introduces a novel scanning strategy for the State Space Model (SSM):

The Problem with Rectangular Spirals: Previous methods (e.g., I2I-Mamba) use rectangular rings, leading to uneven nearest-neighbor spacing and "corner hotspots" (high activation variance).
The Fermat Solution: The authors parametrize a Fermat spiral using a golden-angle step ( $\phi_g \approx 137.508^\circ$ $ϕ_{g} \approx 137.50 8^{\circ}$ ).
- Formula: $r_k = \alpha\sqrt{k}, \theta_k = k \cdot \phi_g$ .
- Benefit: This ensures dense, uniform packing where consecutive points never align along rational angles, achieving near-isotropic spatial coverage.
Continuity Constraint: A grid-matching objective balances global isotropy with local path continuity, ensuring the serialized sequence preserves spatial relationships while maintaining the benefits of the spiral topology.

C. Bidirectional Fermat-Scan Mamba (BFS-Mamba)

The serialized features are processed by a Bidirectional Mamba module:

Architecture: It employs symmetric forward and backward SSM paths to model long-range dependencies in both directions.
Fusion: The outputs of the forward and backward passes are fused via a $1\times1$ convolution and a residual connection.
Decoder: A convolutional decoder reconstructs the target modality from the fused features.

3. Key Contributions

SAM2-Prior Injection: First integration of a frozen SAM2 ViT with LoRA+ fine-tuning to provide domain-aware anatomical priors for medical image synthesis.
Isotropic Scanning Strategy: Introduction of the Fermat Spiral Scanning strategy, which mathematically guarantees uniform spatial coverage and reduces directional bias (operator footprint standard deviation reduced by 29% compared to rectangular spirals).
Hybrid Architecture: A novel combination of HRDM (for high-frequency detail) and CIN (for cross-scale fusion) within a Bidirectional Mamba framework.
Clinical Validation: Demonstration that synthesized images are statistically indistinguishable from real images for downstream clinical tasks (segmentation).

4. Experimental Results

The method was evaluated on SynthRAD2023 (MRI-CT), BraTS2019, BraTS-MEN, and BraTS-MET datasets.

Quantitative Performance:
- Intra-modal Synthesis: FermatSyn achieved state-of-the-art (SOTA) results. For example, in T2w $\to$ T1c synthesis, it reached 29.78 dB PSNR and 0.896 SSIM, outperforming the previous SOTA (I2I-Mamba) by 2.46 dB and 1.5% respectively.
- Cross-modal Synthesis: In MRI $\to$ CT, it achieved 0.931 SSIM and 31.48 dB PSNR, significantly reducing FID scores compared to GANs, Transformers, and Diffusion models.
- 3D Consistency: It showed superior inter-slice coherence, with Hausdorff Distance (HD) improvements of ~13% over I2I-Mamba.
Downstream Clinical Utility:
- A U-Net trained on real images performed on FermatSyn-synthesized images with no statistically significant difference ( $p > 0.05$ ) compared to real-image training across all tumor sub-regions (WT, ET, TC).
- Conversely, models trained only on synthesized data achieved performance within 2.1% of the real-image baseline, confirming the synthesized data's high fidelity for data augmentation.
Efficiency: Inference time is 31 ms per slice (256x256), comparable to I2I-Mamba (24 ms) and significantly faster than Diffusion models (148 ms).

5. Significance

FermatSyn represents a significant leap in medical image synthesis by solving the trade-off between global anatomical consistency and local textural fidelity.

Theoretical Impact: It challenges the standard raster/rectangular-spiral serialization in Mamba models, proving that phyllotaxis-inspired (Fermat) scanning provides a more isotropic receptive field, which is crucial for 2D medical image analysis.
Clinical Impact: By generating high-fidelity synthetic data that preserves pathological features (tumor boundaries, core regions), FermatSyn offers a viable solution for alleviating data scarcity in radiotherapy planning and surgical navigation without compromising diagnostic quality.
Future Direction: The work paves the way for extending isotropic scanning to 3D volumetric synthesis and applying knowledge distillation for real-time deployment.