Multiscale Structure-Guided Latent Diffusion for Multimodal MRI Translation

Jianqiang Lin (Northeastern University, Shenyang, China, Key Laboratory of Intelligent Computing in Medical Image, Shenyang, China), Zhiqiang Shen (Northeastern University, Shenyang, China, Key Laboratory of Intelligent Computing in Medical Image, Shenyang, China), Peng Cao (Northeastern University, Shenyang, China, National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Shenyang, China), Jinzhu Yang (Northeastern University, Shenyang, China, National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Shenyang, China), Osmar R. Zaiane (University of Alberta, Edmonton, Canada), Xiaoli Liu (AiShiWeiLai AI Research, Beijing, China)

Published 2026-03-16

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

Imagine you are trying to paint a perfect portrait of a patient's brain, but you only have a few scattered clues. In the medical world, doctors use different types of MRI scans (like T1, T2, FLAIR) to see different things. Sometimes, a patient is too sick or the machine is too expensive to get all the scans. This leaves the doctor with a "missing piece" puzzle, making it hard to diagnose tumors or plan surgery.

For a long time, computers tried to guess the missing scans using AI, but the results were often blurry, distorted, or looked like a "bad copy" of the real thing.

This paper introduces a new AI called MSG-LDM. Think of it as a super-smart art restorer that doesn't just guess; it understands the skeleton of the brain before it paints the skin.

Here is how it works, broken down into simple analogies:

1. The Problem: The "Blurry Copy"

Imagine you have a photo of a house, but you want to see what it looks like at night (a different "modality"). Old AI methods would try to guess the night version by just smudging the day photo. The result? The windows might end up in the wrong place, or the roof might look melted. The AI got the vibe right but lost the structure.

2. The Solution: Separating "Skeleton" from "Skin"

The authors realized that every MRI scan has two parts:

The Skeleton (Structure): The shape of the brain, the location of the tumor, the boundaries of organs. This is the same no matter which type of scan you take.
The Skin (Style): The brightness, contrast, and texture that change depending on the machine or the scan type.

MSG-LDM uses a trick called Style-Structure Disentanglement.

Analogy: Imagine a chef making a cake. The "skeleton" is the cake batter and the shape of the pan. The "style" is the frosting and sprinkles.
Old AI tried to guess the whole cake at once and often messed up the shape.
MSG-LDM first builds the perfect cake batter (the structure) using the clues it does have. Then, it adds the specific frosting (the style) needed for the missing scan. This ensures the brain's shape stays perfect, even if the "look" changes.

3. The Secret Sauce: "High-Frequency Injection"

One of the biggest issues with AI is that it gets the big picture right but misses the tiny details (like the sharp edge of a tumor).

The Analogy: Think of a low-resolution photo where the edges are fuzzy.
The paper introduces a High-Frequency Injection Block. Imagine this as a magnifying glass that the AI uses while it's building the skeleton. It specifically looks for sharp edges and fine textures and forces them into the drawing. It tells the AI: "Don't just guess the general shape; make sure the tumor's edge is razor-sharp."

4. The Multi-Scale Approach: Zooming In and Out

The AI doesn't just look at the brain from one distance. It looks at it multi-scale.

Analogy: Imagine looking at a city map.
- Low Scale: You see the whole country (the big anatomical layout).
- High Scale: You zoom in to see the individual streets and houses (the fine details).
MSG-LDM builds the brain by combining these views. It makes sure the brain is in the right place (low scale) and that the tiny blood vessels are drawn correctly (high scale).

5. The "Teacher" (Loss Functions)

How does the AI know it's doing a good job? The authors gave it two strict "teachers" (mathematical rules):

Style Consistency Teacher: Tells the AI, "If you are making a T1 scan, it must look like a T1 scan, not a T2 scan." This prevents the AI from getting confused about which "skin" to put on.
Structure-Aware Teacher: Tells the AI, "The edges must be sharp, and the shapes must match the real anatomy." It checks the "fingerprint" of the image to ensure no details are lost.

The Result

When the researchers tested this new method on real brain tumor data (BraTS2020) and white matter data (WMH), the results were impressive.

Better Accuracy: The AI generated missing scans that were much closer to the real thing than previous methods.
Sharper Details: The boundaries of tumors were clearer, which is crucial for surgeons.
Robustness: It worked well even when many scans were missing, not just one.

In a nutshell:
MSG-LDM is like a master architect who, when given a few blueprints, can reconstruct the entire building perfectly. It ignores the confusing "decoration" (style) to focus on the solid "foundation" (structure), and then adds the right decorations back in, ensuring the final building is safe, accurate, and detailed. This helps doctors see the full picture of a patient's brain, even when the data is incomplete.

1. Problem Statement

Context: Multi-modal Magnetic Resonance Imaging (MRI) is critical for brain disease diagnosis and treatment monitoring. However, clinical practice often suffers from missing modalities due to long acquisition times, patient intolerance, or equipment constraints. This missing data degrades the performance of downstream tasks like tumor segmentation.
Limitations of Existing Methods:

While diffusion models have shown promise in medical image synthesis, existing methods struggle with arbitrary missing-modality scenarios.
They often produce anatomical inconsistencies (distorted structures) and degraded texture details (loss of high-frequency boundaries).
Traditional diffusion models lack explicit structural awareness, leading to inefficient reconstruction where modality-specific styles (e.g., contrast differences) entangle with shared anatomical structures.

2. Methodology: MSG-LDM

The authors propose MSG-LDM, a Latent Diffusion Model framework designed to perform multi-modal MRI translation by explicitly disentangling structure (anatomy) from style (modality-specific appearance) in the latent space.

Core Architecture

The framework operates in the VAE latent space and consists of the following key components:

Style-Structure Disentanglement:
- Structural Encoder ( $E_{str}$ ): Extracts modality-invariant structural features. It includes a High-Frequency Injection Block (HFIB) that decomposes features into low-frequency (global anatomy) and high-frequency (edges/textures) components, re-injecting the high-frequency residuals to preserve fine details.
- Style Encoder ( $E_{sty}$ ): Extracts modality-specific style features ( $S_j$ ) to ensure the structural representation remains pure and invariant to the input modality.
- Multi-Modal Structure Feature Fusion (MMSF): Fuses structural features from available modalities using learnable attention weights to emphasize informative structures while suppressing irrelevant variations.
- Multi-Scale Structure Feature Enhancement (MSSE): Enhances the unified structural representation ( $F_s$ ) by injecting multi-scale high-frequency information into high-level representations via structure-guided cross-attention. This ensures the model captures both global layouts and fine-grained details.
Diffusion Process:
- The Latent Diffusion Model (LDM) is conditioned on the unified structural representation ( $F_s$ ).
- The generation process is guided by $F_s$ to synthesize the missing modality, ensuring anatomical fidelity.
Loss Functions:
- Style Consistency Loss ( $L_{sc}$ ): A contrastive learning-based loss that pulls style features of the same modality together and pushes different modalities apart, preventing style leakage into the structural representation.
- Structure-Aware Loss ( $L_{sa}$ ): Composed of an $L_1$ reconstruction loss (voxel-level fidelity) and a Frequency-Domain SSIM loss. The latter compares the magnitude spectra (via 2D DCT) of generated and ground-truth images to enforce global structural consistency and preserve fine details.
- Total Loss: Combines segmentation loss, style consistency, structure-aware loss, and the standard diffusion denoising loss.

3. Key Contributions

Structure-Guided Latent Diffusion: The paper demonstrates that diffusion models are intrinsically insensitive to structural information in medical images. By explicitly incorporating structural priors, MSG-LDM accelerates generation and preserves anatomical fidelity.
Multi-Modal Multi-Scale Structural Representation: The design of a structural encoder with HFIB, MMSF, and MSSE allows the model to capture modality-invariant structures across multiple scales (low-frequency context + high-frequency boundaries).
Robust Disentanglement Mechanism: The introduction of style consistency and structure-aware losses effectively suppresses modality-specific style interference, leading to more stable and consistent cross-modality synthesis.
State-of-the-Art Performance: The method outperforms existing GAN and diffusion-based approaches in both quantitative metrics and qualitative visual reconstruction.

4. Experimental Results

Datasets:

BraTS2020: 369 multi-modal scans (T1, T1CE, T2, FLAIR) with tumor segmentation labels.
WMH: White Matter Hyperintensity dataset with T1 and FLAIR images.

Quantitative Performance:

MSG-LDM consistently outperforms baselines (MM-GAN, SynDiff, MISA-LDM) across PSNR, SSIM, and Dice scores.
Example (BraTS2020, 1 available modality): MSG-LDM achieved a Dice score of 0.871 for T1 synthesis, compared to 0.821 for SynDiff and 0.792 for MM-GAN.
Performance improves as the number of available modalities increases, demonstrating robustness in various missing-modality scenarios.

Qualitative Analysis:

Visual comparisons show that MSG-LDM generates images with clearer boundaries and more complete anatomical structures compared to baselines.
Heat maps indicate that the synthesized images capture both low-frequency global context and high-frequency fine structural patterns, closely matching the distribution of ground truth.

5. Significance

The MSG-LDM framework addresses a critical bottleneck in clinical AI: the reliability of multi-modal analysis when data is incomplete. By explicitly separating anatomical structure from modality-specific style, the method ensures that synthesized images are not just visually plausible but anatomically accurate. This advancement:

Enhances the reliability of downstream tasks like tumor segmentation and lesion analysis.
Reduces the need for complete multi-modal scans in clinical settings, potentially lowering costs and patient burden.
Provides a new paradigm for medical image synthesis that prioritizes structural integrity over simple texture generation.

The source code is publicly available, facilitating further research and clinical adoption.

Multiscale Structure-Guided Latent Diffusion for Multimodal MRI Translation

1. The Problem: The "Blurry Copy"

2. The Solution: Separating "Skeleton" from "Skin"

3. The Secret Sauce: "High-Frequency Injection"

4. The Multi-Scale Approach: Zooming In and Out

5. The "Teacher" (Loss Functions)

The Result

1. Problem Statement

2. Methodology: MSG-LDM

Core Architecture

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Improvement of DVB-S2/S2X Performance Using External Synchronization

ospEDA: Orthogonal Subspace Projection for Electrodermal Activity Decomposition

IOGRUCloud: A Scalable AI-Driven IoT Platform for Climate Control in Controlled Environment Agriculture

On the Isospectral Nature of Minimum-Shear Covariance Control

Learning interpretable and stable dynamical models via mixed-integer Lyapunov-constrained optimization