Using Unsupervised Domain Adaptation Semantic Segmentation for Pulmonary Embolism Detection in Computed Tomography Pulmonary Angiogram (CTPA) Images

Imagine you are a master detective trained in New York City to spot a specific type of tiny, dangerous clue (a pulmonary embolism) hidden inside complex maps (CT scans of lungs). You are incredibly good at your job in New York.

But then, you are sent to Tokyo to do the same job. The maps in Tokyo look slightly different: the paper texture is different, the ink colors are slightly off, and the lighting in the room is different. Even though the clue is the same, your New York-trained brain gets confused. You start missing clues or seeing them where they don't exist.

This is the problem doctors face with AI. An AI trained on scans from one hospital often fails when used at another hospital because of these subtle differences (called "Domain Shift"). Furthermore, teaching the AI to recognize the new style usually requires a human expert to manually draw every single clue on thousands of new maps, which is too expensive and slow.

This paper presents a clever solution: An AI that teaches itself how to adapt without needing a human teacher for the new city.

Here is how they did it, explained with simple analogies:

1. The "Mean Teacher" Strategy (The Wise Mentor)

Instead of hiring a new human teacher for the Tokyo hospital, the AI uses a "Mean Teacher" system.

The Student: The AI trying to learn the new city.
The Teacher: A slightly older, more stable version of the Student.
How it works: The Teacher guesses where the clues are in the new maps. These guesses are called "Pseudo-labels." The Student tries to match the Teacher's guesses. Over time, the Teacher becomes smarter, and the Student learns from them. It's like a student learning by mimicking a mentor who is slowly getting better at the job.

2. The Three Secret Weapons

To make sure the Student doesn't get confused by the "noise" in the new maps, the researchers added three special tools:

A. Prototype Alignment (The "Group Hug")

The Problem: In the new city, the "clue" might look a bit gray instead of red, and the "background" might look a bit blue instead of white. The AI gets confused about what belongs to which group.
The Solution: Imagine the AI creates a "center point" (a prototype) for what a "clue" looks like and a "center point" for what "background" looks like. This tool drags the New York "clue center" and the Tokyo "clue center" closer together until they are hugging. It forces the AI to realize, "Oh, even though the colors are different, this is still the same type of object."

B. Global and Local Contrastive Learning (The "Big Picture vs. The Details")

The Problem: The AI needs to understand both the whole map (the layout of the lungs) and the tiny details (the shape of the tiny clot).
The Solution:
- Global: It looks at the whole map to understand the "skeleton" or layout. It learns that "a heart is always in the middle," regardless of the image style.
- Local: It zooms in on tiny patches to learn the texture of the clot.
- The Trick: It uses a "Momentum Queue" (like a memory bank) to remember thousands of examples it has seen before. This helps it learn the difference between "clue" and "not clue" without needing a massive computer to hold everything at once.

C. Attention-Based Auxiliary Local Prediction (The "Flashlight" vs. "Random Shuffling")

The Problem: This is the most important part for tiny clues. Imagine you are looking for a needle in a haystack. If you randomly grab handfuls of hay to look at, 99% of the time, you'll just grab empty hay. You'll never find the needle. This is what happens when AI randomly crops images; it usually misses the tiny embolisms.
The Solution: The researchers gave the AI a Flashlight.
- Because the AI uses a "Transformer" (a type of AI that pays attention to relationships), it naturally knows where to look. It creates a "heat map" showing where the important stuff is.
- Instead of randomly grabbing a piece of the image, the AI uses its Flashlight to shine only on the areas where the "clue" is likely hiding. It then studies those specific spots intensely. This ensures the AI never wastes time looking at empty background space.

3. The Results: A Detective Who Adapts

The researchers tested this system in two ways:

Cross-Center: Moving from Hospital A (FUMPE) to Hospital B (CAD-PE).
- Before: The AI was terrible (IoU score of 0.11). It was like a detective who couldn't read the new maps at all.
- After: The AI became excellent (IoU score of 0.41). It successfully adapted to the new hospital's style.
Cross-Modality: Moving from CT scans to MRI scans (completely different types of images).
- Result: The AI achieved a 69.9% success rate without ever seeing a single labeled MRI scan from a human expert.

Why This Matters

Most advanced AI systems today are like expensive supercomputers that need massive power and huge teams of experts to train. This new method is like a smart, self-taught detective.

It works on standard hospital computers (not just super-servers).
It doesn't need expensive human experts to label every new image.
It specifically solves the problem of finding tiny things that are easily missed.

In short, this paper gives us a way to take a smart medical AI trained in one place and instantly make it useful in a completely different place, saving time, money, and potentially lives by catching dangerous blood clots that might otherwise be missed.

1. Problem Statement

The paper addresses the critical challenge of Pulmonary Embolism (PE) detection in Computed Tomography Pulmonary Angiography (CTPA) images. While deep learning shows promise, practical deployment is hindered by two main factors:

Domain Shift: Models trained on data from one hospital (source domain) often fail to generalize to data from another hospital (target domain) due to variations in scanner protocols, contrast timing, and patient demographics.
Annotation Scarcity: Acquiring pixel-level annotations from radiologists for target domains is prohibitively expensive and time-consuming.
Specific Challenge in PE: PE lesions are often sub-segmental and minute, occupying a tiny fraction of the image volume. Existing Unsupervised Domain Adaptation (UDA) methods often fail here because they rely on random cropping for local context learning, which frequently results in selecting background-only patches, missing the critical lesion data.

2. Methodology

The authors propose a Transformer-based Unsupervised Domain Adaptation (UDA) framework utilizing a Mean-Teacher architecture. The core innovation lies in integrating three specific modules to enhance pseudo-label reliability and feature alignment in the feature space.

A. Backbone Architecture

Encoder: Mix Vision Transformer (MiT-B5), pretrained on ImageNet. This is chosen over CNNs to capture long-range semantic dependencies and handle global context better.
Decoder: Feature Pyramid Network (FPN) to integrate high-resolution details (crucial for small lesions) with deep semantic features.

B. Core Framework: Mean-Teacher with Consistency Regularization

Teacher-Student Model: A student network learns from labeled source data and unlabeled target data. A teacher network generates pseudo-labels for the target domain, with weights updated via Exponential Moving Average (EMA) of the student's weights.
Entropy-Based Filtering: Instead of fixed thresholds, the system filters pseudo-labels based on entropy, retaining only the top 80% of high-confidence pixels to reduce noise accumulation.
Style Transfer: Initial input-level alignment is achieved using Fast Fourier Transform (FFT) and Histogram Matching to reduce visual discrepancies between domains.

C. Key Feature-Space Alignment Modules

The paper introduces three novel modules to address the limitations of existing UDA methods:

Prototype Alignment (PA):
- Goal: Reduce category-level distribution discrepancies.
- Mechanism: Aligns the class-specific centroids (prototypes) of the source and target domains in the latent feature space. It employs a momentum update strategy to ensure stable prototypes despite the sparsity of PE lesions.
Global and Local Contrastive Learning (GLCL):
- Goal: Capture both global semantic layout and local pixel-level topology.
- Mechanism:
  - Global (GCL): Uses Momentum Contrast (MoCo) with a FIFO queue to align global image features across domains without requiring massive batch sizes.
  - Local (LCL): Aligns local pixel neighborhoods to preserve structural contours and fine details.
- Benefit: Decouples structural semantics from superficial style variations (e.g., CT vs. MRI).
Attention-based Auxiliary Local Prediction (AALP):
- Goal: Solve the "random cropping" failure in sparse lesion detection.
- Mechanism: Instead of random cropping, this module utilizes the Transformer's self-attention maps from the final layers to identify high-information regions (saliency). It automatically crops patches containing lesions or vascular structures.
- Function: An auxiliary network predicts segmentation on these attention-guided patches, enforcing consistency between local and global views and ensuring the model learns from lesion-rich data rather than background noise.

3. Key Contributions

Efficient Transformer-Based UDA: Developed a computationally efficient framework using MiT-B5 and Mean-Teacher, balancing high performance with resource constraints suitable for clinical environments.
Three-Module Integration: Designed and integrated PA, GLCL, and AALP to actively improve pseudo-label quality and address the specific challenge of detecting tiny PE lesions.
AALP Innovation: Replaced ineffective random cropping with Attention-based Auxiliary Local Prediction, significantly enhancing sensitivity to minute objects by leveraging Transformer attention mechanisms for saliency-guided patch extraction.
Rigorous Validation: Demonstrated robustness in a strictly unsupervised setting (no target labels used for model selection) across both cross-center (CTPA) and cross-modality (CT-MRI) tasks.

4. Experimental Results

A. Cross-Modality Adaptation (MMWHS Dataset: CT $\to$ MRI)

Baseline: Source-only model achieved only 19.4% Dice score on the target domain.
Proposed Method: Achieved 69.9% Dice score.
Ablation:
- Adding PA improved performance by ~2.7%.
- Adding AALP provided a significant jump (from 64.7% to 69.3%), proving the efficacy of attention-guided cropping over random cropping.
- GLCL provided further refinement to 69.9%.
Comparison: Outperformed or matched SOTA methods (like MA-UDA and MAPSeg) while using a lighter 2D architecture and a single RTX 4090 GPU, whereas competitors often required 3D CNNs, VAEs, or complex ensembles.

B. Cross-Center Adaptation (CTPA Datasets: FUMPE $\leftrightarrow$ CAD-PE)

FUMPE $\to$ CAD-PE: IoU increased from 0.1152 (no adaptation) to 0.4153.
CAD-PE $\to$ FUMPE: IoU increased from 0.1705 to 0.4302.
Visualization: Attention maps confirmed that AALP successfully localized PE lesions, whereas random cropping often selected background-only regions.

5. Significance

Clinical Applicability: The method addresses the "domain shift" problem that currently prevents AI models from being deployed across different hospitals. It achieves high accuracy without requiring expensive re-annotation of target data.
Handling Small Lesions: By replacing random cropping with attention-guided sampling, the framework specifically overcomes the difficulty of detecting sub-segmental PE, a common failure point in previous medical UDA studies.
Resource Efficiency: Unlike many SOTA medical UDA methods that rely on heavy 3D architectures or adversarial training (which are unstable and hardware-intensive), this approach uses a 2D Transformer backbone and self-training, making it feasible for deployment in resource-constrained clinical settings.
Fair Evaluation: The study adheres to strict protocols by selecting the best model without access to target domain labels, providing a realistic assessment of clinical utility.

Using Unsupervised Domain Adaptation Semantic Segmentation for Pulmonary Embolism Detection in Computed Tomography Pulmonary Angiogram (CTPA) Images

1. The "Mean Teacher" Strategy (The Wise Mentor)

2. The Three Secret Weapons

A. Prototype Alignment (The "Group Hug")

B. Global and Local Contrastive Learning (The "Big Picture vs. The Details")

C. Attention-Based Auxiliary Local Prediction (The "Flashlight" vs. "Random Shuffling")

3. The Results: A Detective Who Adapts

Why This Matters

1. Problem Statement

2. Methodology

A. Backbone Architecture

B. Core Framework: Mean-Teacher with Consistency Regularization

C. Key Feature-Space Alignment Modules

3. Key Contributions

4. Experimental Results

A. Cross-Modality Adaptation (MMWHS Dataset: CT →\to→ MRI)

B. Cross-Center Adaptation (CTPA Datasets: FUMPE ↔\leftrightarrow↔ CAD-PE)

5. Significance

More like this

Safe Decentralized Operation of EV Virtual Power Plant with Limited Network Visibility via Multi-Agent Reinforcement Learning

Rewriting TTS Inference Economics: Lightning V2 on Tenstorrent Achieves 4x Lower Cost Than NVIDIA L40S

Customized User Plane Processing via Code Generating AI Agents for Next Generation Mobile Networks

NeuralLVC: Neural Lossless Video Compression via Masked Diffusion with Temporal Conditioning

Hypernetwork-Conditioned Reinforcement Learning for Robust Control of Fixed-Wing Aircraft under Actuator Failures

A. Cross-Modality Adaptation (MMWHS Dataset: CT $\to$ MRI)

B. Cross-Center Adaptation (CTPA Datasets: FUMPE $\leftrightarrow$ CAD-PE)