Fusion and Grouping Strategies in Deep Learning for Local Climate Zone Classification of Multimodal Remote Sensing Data

Imagine you are trying to understand the "personality" of a bustling city. You want to know which neighborhoods are dense skyscrapers, which are quiet parks, and which are industrial factories. In the world of climate science, these distinct areas are called Local Climate Zones (LCZs). Knowing them helps us fight problems like the "Urban Heat Island" effect, where cities get dangerously hotter than the countryside.

To map these zones, scientists use satellites. But looking at a city from space is tricky. It's like trying to describe a person's face using only a black-and-white sketch (which shows shape but no color) or only a color photo that gets blurry in the rain.

This paper is about teaching computers to be the ultimate detective by combining two different types of satellite "eyes":

SAR (Radar): Like a bat using echolocation. It sees through clouds and darkness and tells us about the texture and shape of buildings (roughness, height).
MSI (Optical): Like a human eye. It sees colors and tells us about what things are made of (green grass, blue water, red roofs).

The researchers asked: "How do we best mix these two types of vision so the computer doesn't get confused?"

Here is the breakdown of their journey, explained with simple analogies:

1. The Problem: The "Confused Chef"

Imagine you are a chef trying to make a perfect soup. You have two ingredients: a rough, crunchy vegetable (Radar) and a smooth, colorful fruit (Optical).

If you just throw them in a pot and stir (simple mixing), the soup might taste okay, but you lose the crunch and the flavor.
If you try to taste them separately and then guess the recipe (late mixing), you might miss how they interact.
The goal is to find the perfect way to chop, blend, and season them together so the final dish is delicious.

2. The Four Recipes (The Models)

The team tested four different "recipes" (models) to see which one made the best soup (classification):

Recipe 1 (FM1 - The Hybrid Chef): This chef does two things at once. They chop the ingredients finely (Pixel-level) and blend them into a smooth puree (Feature-level) before mixing them together. This was the most successful recipe. It captured both the texture and the color perfectly.
Recipe 2 (FM2 - The Over-Thinker): This chef tries to use a super-complex attention system (like a chef who constantly tastes every single grain of salt while cooking). While smart, it made the process too slow and didn't actually taste better than the first recipe.
Recipe 3 (FM3 - The Blur Artist): This chef smears the ingredients through different sized sieves (Multi-scale Gaussian smoothing) to see the big picture and the tiny details at the same time. It was good, but not quite as good as the Hybrid Chef.
Recipe 4 (FM4 - The Late Decision): This chef cooks the Radar and Optical ingredients in two separate pots and only tries to combine them at the very end. This was the least effective. By the time they combined them, the flavors had already been lost.

3. The Secret Weapons: Grouping and Merging

Even with the best recipe, the chef can get confused if the ingredients look too similar. The team added two "tricks" to help:

Trick A: Band Grouping (Sorting the Spice Rack):
Satellites have many "bands" (like many different spices). Some spices taste almost identical. Instead of using 18 different spices, the team grouped them into 7 logical categories (e.g., "All the Red Spices," "All the Earthy Spices"). This stopped the computer from getting overwhelmed by redundant information.
- Analogy: Instead of asking a student to memorize 18 similar shades of blue, you tell them to just remember "Sky Blue," "Ocean Blue," and "Navy Blue."
Trick B: Label Merging (The "Good Enough" Rule):
In the city map, some zones are so similar that even humans argue about them. For example, is that patch of land "Bare Rock" or "Bare Soil"? They look almost the same to a satellite.
The team decided to merge these confusing pairs into one big category called "Bare Surfaces."
- Analogy: Instead of trying to distinguish between a "Shiba Inu" and a "Pomeranian" (which are hard to tell apart), you just call them both "Small Fluffy Dogs." You might lose some detail, but you stop making mistakes, and your overall score goes up!

4. The Results: The Winning Team

When they combined the Hybrid Chef (Recipe 1) with Sorting the Spice Rack and The "Good Enough" Rule, they achieved a 76.6% accuracy.

This is a big deal because:

It beat the previous best methods (the "State of the Art").
It was especially good at identifying the rare and difficult neighborhoods (the "underrepresented classes"). In a city, some areas are huge (like a big park), and some are tiny (like a small industrial zone). Previous models ignored the tiny ones. This new method paid attention to them.

The Takeaway

This paper teaches us that when teaching computers to see the world, timing and organization matter more than complexity.

Don't wait until the end to mix your data (Late Fusion); mix it early and often.
Don't overwhelm the computer with too many similar details; group them logically.
Sometimes, admitting that two things are "basically the same" (Merging) leads to a smarter, more accurate overall map.

By using these strategies, we can create better maps of our cities, helping us understand how urbanization changes our climate and how to make our cities more livable.

1. Problem Statement

Local Climate Zones (LCZs) are critical for analyzing urban structures, land use, and the impact of urbanization on local climates. However, classifying the 17 distinct LCZ categories using remote sensing data presents significant challenges:

Data Complexity: LCZ classification relies on multimodal data, specifically Synthetic Aperture Radar (SAR) and Multispectral Imaging (MSI). These modalities have different imaging mechanisms (active vs. passive), leading to data complexity and heterogeneity.
Class Imbalance: Real-world datasets, such as the So2Sat LCZ42 benchmark, suffer from severe class imbalance. Some classes (e.g., "Heavy Industry," "Bush/Scrub") have fewer than 2% of samples, while others (e.g., "Large Low-Rise") have ~15%. This leads to poor performance on underrepresented classes.
Semantic Ambiguity: Many LCZ classes share similar spectral or structural properties (e.g., surface albedo), causing frequent misclassification between adjacent classes in standard 17-class models.
Gap in Analysis: While fusion models exist, there is a lack of comprehensive analysis regarding how specific fusion mechanisms (pixel, feature, decision) and grouping strategies (band grouping, label merging) affect class-wise accuracy, particularly for minority classes.

2. Methodology

The study proposes a comparative analysis of four deep learning fusion models and two data grouping strategies using the So2Sat LCZ42 dataset (Sentinel-1 SAR and Sentinel-2 MSI pairs).

A. Fusion Strategies (Deep Learning Architectures)

The authors implemented four Convolutional Neural Network (CNN)-based models:

FM1 (Baseline Hybrid Fusion): A multi-level fusion model combining pixel-level (concatenating raw inputs) and feature-level (element-wise multiplication of modality-specific features) fusion. The outputs are then fused again (hybrid) before classification.
FM2 (Attention-Based Hybrid Fusion): An enhancement of FM1 incorporating self-attention (to capture long-range dependencies within a modality) and cross-attention (to align features between SAR and MSI modalities) before feature-level fusion.
FM3 (Multi-scale Gaussian Smoothing): An enhancement of FM1 where input images are pre-processed with multi-scale Gaussian filters (kernel sizes 2, 4, 6, 8). This aims to capture structural variations at different scales and reduce noise.
FM4 (Weighted Decision-Level Fusion): A late-fusion approach where separate classifiers (U-Net for SAR, vanilla CNN for MSI) generate predictions, which are combined via a weighted average ( $\alpha$ and $1-\alpha$).

B. Grouping Strategies

To address data redundancy and semantic similarity, two strategies were applied:

Spectral Band Grouping (SBG): Reduces dimensionality and redundancy by clustering spectrally similar bands.
- SAR: Grouped into 3 groups (VH, VV, CMOE).
- MSI: Grouped into 4 groups (RGB, VRE, SWIR, NIR).
Label Merging (LM): Addresses class imbalance and semantic ambiguity by merging the 17 original LCZ classes into 8 broad categories based on surface albedo and semantics (e.g., merging "Compact High/Mid/Low-rise" into "Compact Built Types").

C. Experimental Setup

Dataset: So2Sat LCZ42 (32x32 pixel patches, 400k+ pairs).
Metrics: Overall Accuracy (OA), Precision, Recall, F1-Score, Kappa Coefficient ( $\kappa$ ), and Matthews Correlation Coefficient (MCC) for robustness against imbalance.
Ablation Study: Systematic testing of fusion levels (early, intermediate, hybrid, late) and the impact of SBG and LM on all models.

3. Key Contributions

Comprehensive Fusion Analysis: A systematic comparison of pixel, feature, and decision-level fusion strategies specifically for LCZ classification, demonstrating that hybrid fusion (FM1) outperforms simple late fusion.
Novelty in Pre-processing: Introduction of multi-scale Gaussian smoothing (FM3) and spectral band grouping (SBG) as effective pre-processing steps for multimodal remote sensing fusion.
Label Merging Strategy: Proposal of a semantic-based label merging scheme to mitigate misclassification between spectrally similar classes, significantly boosting performance metrics for minority classes.
State-of-the-Art (SOTA) Comparison: The proposed models outperform existing SOTA models (e.g., MsF-LCZ-Net, MSCA-Net) in overall accuracy and, crucially, in the classification of underrepresented classes.

4. Results

Best Performing Model: The combination of FM1 (Hybrid Fusion) + SBG + LM (FM1BL) achieved the highest performance with an Overall Accuracy (OA) of 76.6% and a Kappa coefficient of 0.723.
Fusion Effectiveness:
- Hybrid Fusion (FM1) consistently outperformed late fusion (FM4).
- Attention Mechanisms (FM2) increased computational cost significantly (training time ~27 hours vs. ~3.5 hours for FM1) without yielding proportional accuracy gains, suggesting they were less efficient for this specific task.
- Multi-scale Smoothing (FM3) improved results over raw data but was slightly less effective than the baseline FM1, likely due to static scale settings.
Impact of Grouping:
- Label Merging (LM) was the most impactful strategy, improving OA from ~66% (17-class) to ~76% (8-class) for the best models. It effectively absorbed intra-class misclassifications.
- Band Grouping (SBG) consistently improved performance across all fusion models by reducing spectral redundancy.
Class-Wise Performance:
- The FM1BL model showed superior performance on underrepresented classes (e.g., LCZ 1, 2, 7, E, F) compared to SOTA models.
- While MSCA-Net (MSI-only) had high weighted averages, the proposed multimodal fusion models achieved better macro-averages and MCC scores, indicating better handling of class imbalance.
- Confusion matrices showed that merging classes (e.g., "Bush" and "Low Plants") resolved common confusion points.

5. Significance

Urban Planning & Climate Studies: The study provides a robust framework for generating accurate LCZ maps, which are essential for studying Urban Heat Islands (UHI) and planning sustainable urban environments.
Handling Imbalance: By demonstrating that Label Merging and Hybrid Fusion significantly improve the detection of minority classes, the paper offers a practical solution for real-world remote sensing tasks where data is rarely balanced.
Efficiency vs. Complexity: The findings suggest that for multimodal SAR-MSI fusion, sophisticated attention mechanisms may not always be necessary; well-structured hybrid fusion combined with intelligent data grouping yields better accuracy-to-computation ratios.
Reproducibility: The authors have made their code and processed datasets publicly available, facilitating further research in multimodal remote sensing fusion.

In conclusion, the paper establishes that a hybrid fusion approach (combining pixel and feature levels) augmented by spectral band grouping and semantic label merging is the most effective strategy for LCZ classification, outperforming complex attention-based models and late-fusion baselines, particularly for underrepresented urban and natural classes.