SGDFuse: SAM-Guided Diffusion Model for High-Fidelity Infrared and Visible Image Fusion

Imagine you are trying to take a perfect photo of a car driving through a thick fog at night. You have two cameras:

The Thermal Camera (Infrared): It sees the heat of the car engine and the driver perfectly, even in the dark. But the picture is blurry, and you can't see the car's color or the texture of the road.
The Regular Camera (Visible): It sees the road markings, the car's paint, and the trees clearly. But in the fog and darkness, the car itself looks like a dark, invisible blob.

The Problem:
For years, scientists have tried to "stitch" these two pictures together. But most old methods were like a clumsy editor who just mashed the two photos together without understanding what they were looking at. They would accidentally blur the car (the important part) or make the fog look too bright. They suffered from "Semantic Blindness"—they couldn't tell the difference between a critical target (like a person or a car) and the background noise.

The Solution: SGDFuse
The authors of this paper created a new system called SGDFuse. Think of it as hiring a super-smart editor with two special tools to fix the photo.

The Two-Stage Process (The "Chef's Recipe")

Instead of trying to do everything at once, SGDFuse cooks the image in two distinct steps:

Stage 1: The Rough Draft (Structural Foundation)
First, the system takes the thermal and regular photos and blends them into a decent "rough draft." It's like sketching a painting with a pencil. It gets the basic shapes and positions right, but it's not perfect yet.

Stage 2: The Masterpiece (The Magic Touch)
This is where the magic happens. The system uses two powerful tools:

SAM (The "Spotlight"): This is a pre-trained AI that is amazing at finding objects. It looks at the rough draft and draws a glowing "mask" around everything important (the car, the person, the tree). It tells the system, "Hey, pay attention to THIS part! Don't blur this!"
Diffusion Model (The "Sculptor"): This is a type of AI that creates images by slowly removing noise (like chipping away at a block of marble to reveal a statue). Usually, this sculptor just tries to make things look pretty. But in SGDFuse, the sculptor is holding the "Spotlight" (the SAM mask).

How they work together:
The Diffusion Model starts with a noisy, blurry mess. As it slowly cleans it up, the SAM mask acts as a guardrail. It says, "Make the car's edges sharp because the mask says it's a car. Keep the thermal heat on the engine. But don't waste effort making the fog look detailed."

Because the AI "knows" what is important, it doesn't just guess; it reconstructs the image with high fidelity, keeping the heat of the car and the texture of the road perfectly aligned.

Why is this a big deal?

No More "Blind" Editing: Old methods treated every pixel the same. SGDFuse understands the story of the image. It knows a person is more important than a patch of grass.
Better for Robots: If you are an autonomous driving car, you need to see the pedestrian in the fog. SGDFuse creates a fused image that helps the car's computer "see" the person much better than before, leading to safer driving.
Medical Magic: The paper also tested this on medical scans (like MRI and PET scans). Just like with cars, it helps doctors see tumors (the "heat") clearly against the body tissue (the "texture"), leading to better diagnoses.

The Bottom Line

Think of SGDFuse as upgrading from a photocopier (which just copies and pastes pixels) to a smart artist who understands the scene. By using a "Spotlight" (SAM) to guide a "Sculptor" (Diffusion Model), it creates a final image that is not only beautiful to look at but also incredibly useful for computers trying to make sense of the world.

1. Problem Statement

Infrared and Visible Image Fusion (IVIF) aims to combine the thermal radiation information of infrared (IR) images with the rich texture and detail of visible (VIS) images. However, existing state-of-the-art methods suffer from a critical flaw termed "semantic blindness."

The Issue: Most current approaches (CNNs, GANs, and even some diffusion models) treat fusion as a low-level pixel reorganization task. They lack the ability to semantically distinguish between foreground targets (e.g., a person or vehicle) and background textures.
Consequences: This leads to the erroneous suppression of salient thermal targets, blurred boundaries for large objects, and the introduction of visual artifacts. Consequently, while the fused images may look visually acceptable, they perform poorly in downstream tasks like object detection and semantic segmentation because critical structural information is lost or distorted.

2. Methodology: SGDFuse

The authors propose SGDFuse, a novel framework that reframes IVIF as a Semantic-Guided Generation (SGG) task. Instead of simple pixel mapping, the model synthesizes a fused image under the explicit steering of high-level semantic priors.

Core Architecture: Two-Stage Decoupling

To resolve the inherent conflict between low-level feature alignment and high-fidelity iterative generation, SGDFuse employs a deliberate two-stage strategy:

Stage I: Structural Foundation (Preliminary Fusion)
- Goal: Generate a robust structural prior ( $F_1$ ) by aligning IR and VIS features.
- Mechanism:
  - Infrared Path: Uses a Multi-Scale Feature Enhancement Module (MSFEM) with parallel convolutional branches (1x1, 3x3, 5x5, 7x7) and channel attention to capture thermal boundaries and enhance structural cues.
  - Visible Path: Uses a Transformer Block (TB) with multi-head self-attention to extract global context and fine-grained textures.
  - Fusion: Features are dynamically aligned and fused via a cross-attention mechanism to produce an initial fused image.
Stage II: Semantic-Guided Refinement (Conditional Diffusion)
- Goal: Refine the structural prior into a high-fidelity, semantically coherent image.
- Mechanism:
  - Semantic Priors: The Segment Anything Model (SAM) generates high-quality semantic masks from both the IR and VIS inputs.
  - Input Construction: The initial fused image ( $F_1$ ) is concatenated with the IR and VIS semantic masks to form a 5-channel input (3 channels for image + 2 channels for masks).
  - Diffusion Process: A conditional Denoising Diffusion Probabilistic Model (DDPM) iteratively denoises this 5-channel input. The semantic masks act as spatial anchors, guiding the diffusion process to preserve target structures while reconstructing textures.
  - Hierarchical Feature Aggregation Head (HFAH): Integrated into the decoder to aggregate multi-level features, enhancing edge consistency and structural detail.

Training Strategy & Loss Functions

Stage I Loss: Uses Gradient Loss and Intensity Loss to ensure the preliminary image aligns with VIS edges and IR thermal intensity.
Stage II Loss: Introduces a novel Mask-Guided Loss ( $L_{stage2}$ $L_{s t a g e 2}$ ). This includes:
- Mask-guided Intensity Loss: Enforces luminance consistency within semantically salient regions defined by the SAM masks.
- Mask-guided Gradient Loss: Ensures edge clarity within target regions.
- This closed-loop guidance ensures the diffusion model does not hallucinate artifacts but strictly adheres to the semantic structure provided by SAM.

3. Key Contributions

Semantic-Guided Generation (SGG) Framework: The paper establishes a new methodological paradigm for image fusion, shifting from pixel-level reorganization to semantically-steered synthesis. This directly addresses the "semantic blindness" of previous methods.
SGDFuse Network: The first effective implementation of SGG, featuring a two-stage architecture that decouples structural alignment (Stage I) from semantic generation (Stage II), resolving the task conflict between cross-modal alignment and high-fidelity reconstruction.
Holistic Guidance System: A comprehensive "Input-Process-Output" system that integrates SAM-derived masks as dense spatial constraints, models spatio-semantic correlations during denoising, and enforces consistency via a custom Mask-Guided Loss.
Robustness and Generalizability: The framework is validated not only on standard IVIF datasets but also on medical imaging (MRI-PET/SPECT) and downstream tasks, proving the SGG logic is a generalizable solution for multi-modal fusion.

4. Experimental Results

The authors evaluated SGDFuse on four datasets: MSRS, M3FD, LLVIP, and RoadScene, comparing it against 13 state-of-the-art methods (including U2Fusion, PIAFusion, MaeFuse, SAGE, and other diffusion-based models).

Quantitative Performance: SGDFuse achieved State-of-the-Art (SOTA) results across multiple metrics, including Entropy (EN), Standard Deviation (SD), Spatial Frequency (SF), Mutual Information (MI), Visual Information Fidelity (VIF), and $Q_{abf}$ $Q_{ab f}$ (edge preservation).
- Example: On the MSRS dataset, it achieved the highest EN (6.81), SD (45.28), and $Q_{abf}$ (0.74).
Qualitative Performance: Visual comparisons show SGDFuse preserves thermal targets (e.g., pedestrians in low light) without blurring, while maintaining sharp visible textures. It avoids the "false fusion" or target suppression seen in other methods.
Downstream Task Performance:
- Object Detection (YOLOv5): Fused images from SGDFuse yielded the highest mAP for detecting "Person" and "Car" categories.
- Semantic Segmentation (DeeplabV3+): Achieved the highest mean IoU, particularly in complex backgrounds and for small objects, demonstrating superior structural fidelity.
Efficiency: Despite using a diffusion model, SGDFuse achieves an inference latency of 59ms (at 60 sampling steps), which is competitive with non-iterative methods and significantly faster than other diffusion-based approaches (e.g., Text-DiFuse at 350ms).
Ablation Studies:
- Removing SAM caused a significant performance drop, confirming the necessity of semantic priors.
- The two-stage design outperformed End-to-End (E2E) training, proving that decoupling structural and generative tasks is crucial for convergence and quality.
- The framework showed robustness even when SAM masks were perturbed (eroded/dilated), indicating it does not rely on perfect segmentation.

5. Significance

Paradigm Shift: SGDFuse moves the field of image fusion away from "pixel reorganization" toward "semantic synthesis." It demonstrates that incorporating high-level semantic understanding (via SAM) is essential for producing fused images that are useful for both human perception and machine vision.
Practical Impact: By solving the semantic blindness problem, the method significantly enhances the reliability of autonomous driving and surveillance systems in adverse conditions (night, smoke, low light).
Generalizability: The success of the framework in medical imaging (MRI fusion) suggests that the SGG approach is a universal solution for multi-modal data integration, not limited to IR/VIS fusion.
Future Direction: The paper sets a baseline for future research into lightweighting diffusion models and developing domain-adaptive semantic encoders to further refine the guidance mechanism.

SGDFuse: SAM-Guided Diffusion Model for High-Fidelity Infrared and Visible Image Fusion

The Two-Stage Process (The "Chef's Recipe")

Why is this a big deal?

The Bottom Line

1. Problem Statement

2. Methodology: SGDFuse

Core Architecture: Two-Stage Decoupling

Training Strategy & Loss Functions

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The Quantification Horizon Theory of Consciousness

Algebras of actions in an agent's representations of the world

Heuristic Multiobjective Discrete Optimization using Restricted Decision Diagrams

PLM-Net: Perception Latency Mitigation Network for Vision-Based Lateral Control of Autonomous Vehicles

Automated Explanation Selection for Scientific Discovery