VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba

The Problem: The "Blurry Slice" Dilemma

Imagine you are trying to understand a 3D object, like a loaf of bread, but you only have a camera that takes incredibly sharp photos of the top and sides (the crust), but the photos of the slices inside are very blurry and spaced far apart.

In the world of Volume Electron Microscopy (VEM), scientists use powerful microscopes to see the tiny structures inside cells (like neurons or mitochondria). However, the machines they use are like that imperfect camera:

Lateral (Side-to-Side): Super sharp and detailed.
Axial (Top-to-Bottom): Blurry, low-resolution, and "chunky."

This creates a "stretched" or "anisotropic" image. It's like looking at a high-definition video that has been stretched vertically; the characters look tall and thin, and the details between the frames are missing. Scientists need these images to be "isotropic" (equal in all directions) to study how cells work in 3D, but fixing this manually is impossible.

The Old Ways: Why They Failed

Before this paper, scientists tried two main ways to fix the blur:

The "Stacking" Method (2D Models): They treated every slice as a separate 2D picture and tried to fix them one by one.
- The Flaw: It's like trying to fix a movie by editing each frame individually without looking at the previous or next frame. The result? The characters might look great in one frame but jump weirdly to the next. The 3D connection is broken.
The "Heavyweight" Method (3D Transformers): They used massive AI models that looked at the whole 3D block at once.
- The Flaw: These models are like trying to lift a giant boulder with a single finger. They are so computationally heavy that they require supercomputers and take forever to run, making them impractical for large datasets.

The Solution: VEMamba (The "Smart Scanner")

The authors, Longmi Gao and Pan Gao, built VEMamba. Think of it as a smart, efficient 3D scanner that uses a new type of AI architecture called Mamba.

Here is how VEMamba works, broken down into three simple concepts:

1. The "Re-Ordering" Trick (ALCSSM)

Imagine you have a giant 3D block of cheese, and you want to describe every crumb inside it to a robot.

Old way: You try to describe the whole block at once (too hard) or just the top layer (misses the inside).
VEMamba's way: It uses a technique called Axial-Lateral Chunking Selective Scan.
- Imagine slicing the cheese into thin strips, but instead of just cutting horizontally, it cuts vertically and horizontally at the same time, weaving a path through the cheese like a snake.
- It turns the complex 3D block into a simple, long 1D line (a sequence) that the AI can read easily.
- The Magic: By scanning in multiple directions (up-down, left-right, and diagonally), the AI learns how the top slice connects to the bottom slice and how the left side connects to the right side. It forces the AI to understand the 3D consistency without getting overwhelmed.

2. The "Smart Mixer" (DWAM)

After the AI scans the cheese in all those different directions, it has eight different "opinions" on what the 3D structure looks like.

The Problem: Some opinions are better than others depending on the part of the image.
The Solution: VEMamba uses a Dynamic Weights Aggregation Module.
- Think of this as a conductor in an orchestra. It listens to all eight different "scans" (instruments) and decides, "Okay, for this specific part of the cell, the vertical scan is most important, but for this part, the horizontal scan is better."
- It mixes them together perfectly to create one super-clear 3D picture.

3. The "Realistic Practice" (MoCo & Degradation)

AI models often fail in the real world because they are trained on "perfect" fake data.

The Problem: If you train a model to fix a blurry photo by just making the image smaller (downsampling), the model learns to fix "mathematical blur," not "real microscope blur."
The Solution: The authors created a Degradation Simulation.
- They intentionally messed up their training data with realistic noise, blurring, and artifacts that happen in real microscopes.
- They used a technique called Momentum Contrast (MoCo). Imagine a student (the AI) practicing with a teacher who constantly changes the difficulty of the test. The student learns to recognize the types of mistakes (degradation) and how to fix them, rather than just memorizing the answers. This makes the model robust enough to handle real-world messy data.

The Results: Fast, Cheap, and Clear

When they tested VEMamba:

Quality: It produced 3D images that were sharper and more accurate than previous methods. It didn't just fill in the gaps; it reconstructed the actual biological structures (like mitochondria) with high fidelity.
Speed & Cost: It was much faster and used fewer computer resources than the "heavyweight" models. It's like getting a Ferrari's speed with a Toyota's fuel efficiency.
Downstream Success: When they used the reconstructed images to count cell parts (segmentation), the results were nearly as good as if they had used the perfect, expensive, isotropic microscope data to begin with.

Summary Analogy

If reconstructing a 3D cell from blurry slices was like rebuilding a shredded document:

Old 2D methods tried to glue the pieces back together one by one, often getting the order wrong.
Old 3D methods tried to read the whole shredded pile at once, which took a lifetime.
VEMamba is like a super-smart robot that sorts the shreds into a specific order (reordering), reads them in a way that connects the top to the bottom (consistency), and uses a smart filter to ignore the coffee stains and tears (degradation learning), resulting in a perfect, readable document in record time.

The Bottom Line: VEMamba gives scientists a way to see the 3D world of cells clearly, quickly, and without needing a supercomputer, paving the way for better medical and biological discoveries.

1. Problem Statement

Volume Electron Microscopy (VEM) is essential for 3D tissue imaging at nanometer resolution but inherently produces anisotropic data. While lateral ( $x, y$ ) resolution is high, axial ( $z$ ) resolution is significantly lower due to the physical constraints of serial sectioning (e.g., ssTEM). This anisotropy hinders 3D visualization and downstream biological analysis.

Existing reconstruction methods face two critical limitations:

Architectural Limitations: Most methods rely on 2D models (ignoring 3D continuity) or 3D Transformers (which incur prohibitive computational costs and memory footprints for large volumes). 2D approaches often result in artifacts and poor consistency between adjacent slices.
Simulation Gap: Standard training simulations use simple downsampling, which fails to capture the complex degradations (blur, noise, specific acquisition artifacts) found in real-world VEM data, leading to suboptimal performance on actual datasets.

2. Methodology

The authors propose VEMamba, a self-supervised framework leveraging the Mamba architecture (State Space Models) to achieve efficient, high-quality isotropic reconstruction. The architecture consists of four main stages:

A. Overall Architecture

The pipeline includes:

Shallow Feature Extraction: Processes the input anisotropic subvolume.
Degradation Extraction: Uses Momentum Contrast (MoCo) to learn a latent representation of the data degradation.
Deep Feature Extraction: The core processing stage using Residual Volume Mamba Groups (RVMGs).
Reconstruction: Fuses features and upscales the output using a pixel-shuffle layer.

B. Core Innovations

1. Axial-Lateral Chunking Selective Scan Module (ALCSSM)

Goal: To model 3D spatial dependencies with linear complexity while enforcing consistency between axial and lateral dimensions.
Mechanism:
- Chunking: The 3D feature tensor is partitioned along the channel dimension to reduce memory usage.
- Multi-directional Scanning: Instead of standard 2D scanning, ALCSSM employs eight continuous 3D scanning trajectories (Axial-to-Lateral and Lateral-to-Axial, including reverse directions).
- Reordering: These scans transform 3D spatial dependencies into optimized 1D sequences suitable for the Mamba (SSM) backbone, ensuring the model simultaneously captures inter-slice (axial) and intra-slice (lateral) information flow.

2. Dynamic Weights Aggregation Module (DWAM)

Goal: To adaptively fuse the outputs from the diverse scanning directions.
Mechanism: After the SSM processes the sequences, DWAM restores them to 3D feature maps. It then uses a Multi-Layer Perceptron (MLP) to generate context-dependent weights ( $W_1, W_2$ ) for the feature maps. The final output is a weighted concatenation ( $W \odot F$ ), allowing the model to emphasize the most informative scanning paths based on the input context.

3. Degradation Learning Framework (MoCo & VDIM)

Degradation Representation Learning: Uses Momentum Contrast (MoCo) in a self-supervised manner. It creates positive pairs from subvolumes of the same volume and negative pairs from different volumes to learn an implicit degradation representation ( $D$ ) without needing ground-truth degradation labels.
Volume Degradation Injection Module (VDIM): Injects the learned degradation representation $D$ into the main reconstruction network via channel-wise affine transformations. This makes the network "degradation-aware," improving robustness against real-world acquisition artifacts.

3. Key Contributions

Novel 3D Dependency Reordering: Introduction of the ALCSSM and DWAM, which enable efficient 3D modeling using Mamba. This explicitly enforces axial-lateral consistency, solving the trade-off between 2D efficiency and 3D coherence.
Realistic Degradation Simulation: A framework combining a multi-type degradation model (blur, noise, downsampling) with MoCo-based self-supervised learning. This bridges the domain gap between synthetic training data and real anisotropic VEM data.
First Mamba Application in VEM: To the authors' knowledge, this is the first application of the Mamba architecture for VEM isotropic reconstruction, demonstrating its suitability for large-scale 3D volumetric data.

4. Experimental Results

The method was evaluated on two datasets: EPFL (neural tissue, FIB-SEM) and CREMI (Drosophila brain, ssTEM) with upscaling factors of $\times4, \times8, \times10$ .

Quantitative Performance:
- VEMamba achieved state-of-the-art (SOTA) performance in 5 out of 6 settings for PSNR and 4 out of 6 for SSIM, outperforming baselines like IsoVEM (Transformer-based) and EMDiffuse (Diffusion-based).
- It demonstrated superior LPIPS scores, indicating better perceptual quality.
Computational Efficiency:
- VEMamba has the lowest parameter count (0.94M) and lowest FLOPs (0.28T) among compared deep learning models, significantly lighter than IsoVEM (1.40M params, 0.61T FLOPs) and EMDiffuse (15.16M params, 22.51T FLOPs).
Qualitative Analysis:
- Visual comparisons show VEMamba produces the most faithful reconstructions with fewer artifacts (e.g., hallucinated boundaries or missing structures) compared to GANs, Transformers, and Diffusion models, particularly in the axial ( $xz, yz$ ) planes.
Downstream Task (Mitochondria Segmentation):
- Reconstructions from VEMamba yielded the highest Intersection over Union (IoU) scores for mitochondria segmentation, nearly matching the isotropic ground truth (difference of only 0.002), proving the utility of the reconstruction for biological analysis.
Ablation Study:
- Removing ALCSSM, DWAM, or MoCo individually resulted in measurable drops in PSNR (0.04–0.07 dB) and increased axial pixel errors, confirming the necessity of each component.

5. Significance

VEMamba represents a paradigm shift in volume electron microscopy reconstruction. By replacing computationally expensive 3D Transformers and artifact-prone 2D models with an efficient, Mamba-based 3D architecture, it offers a practical solution for high-resolution 3D tissue imaging. Its ability to learn from realistic degradation patterns via self-supervision makes it highly robust for real-world biological applications, potentially accelerating discoveries in neuroscience and cell biology by making high-quality 3D analysis accessible without the prohibitive costs of isotropic imaging hardware.