Multi-Level Bidirectional Decoder Interaction for Uncertainty-Aware Breast Ultrasound Analysis

Imagine you are trying to solve a complex puzzle: identifying a tumor in a breast ultrasound image.

Traditionally, doctors (and computer programs) try to do two things at once:

Draw the outline of the tumor (Segmentation).
Decide if it's dangerous (Classification: Benign vs. Malignant).

For a long time, computer programs tried to do these two jobs like two separate workers sharing a single notebook at the very beginning of the process. They would look at the raw image, take notes, and then go to their own separate desks to finish their specific tasks. The problem? Once they left the "notebook" phase, they stopped talking to each other. If one worker got confused, the other didn't know to help, and they couldn't use each other's insights to fix mistakes.

This paper introduces a new, smarter way to work together. Here is the breakdown of their solution using simple analogies:

1. The Problem: The "Silent Partners"

Think of the old method like two chefs in a kitchen who only talk to each other while chopping vegetables (the Encoder). Once they start cooking the final dish (the Decoder, where the image is rebuilt), they work in silence.

Chef A is trying to carve the perfect shape of the vegetable.
Chef B is trying to guess if the vegetable is fresh or rotten.
If Chef A sees a weird spot on the edge, Chef B doesn't know about it until it's too late. If Chef B smells something bad, Chef A doesn't know to carve around it.

2. The Solution: The "Multi-Level Conversation"

The authors propose a system where the two chefs talk to each other at every single step of the cooking process, not just at the start.

They built a system with Task Interaction Modules (TIM). Imagine these as walkie-talkies that the chefs use at every stage of plating the dish:

From Shape to Smell: Chef A (Segmentation) says, "Hey, look at this jagged edge here; it looks suspicious." Chef B (Classification) uses that info to say, "Okay, that jagged edge makes me think this is malignant."
From Smell to Shape: Chef B says, "This texture feels like a benign cyst." Chef A uses that info to say, "Okay, I'll smooth out my carving lines because it's likely harmless."

Why is this better? Because they are talking while they are building the final picture, they can correct each other in real-time. If the image is blurry (common in ultrasound), they can combine their strengths to figure out what's really there.

3. The "Uncertainty Detective" (UPA)

Ultrasound images are often noisy, like a radio with static. Sometimes the picture is clear; other times, it's a mess.

The Old Way: The system would force the two chefs to trust each other equally, even when the image was terrible. This led to mistakes.
The New Way (Uncertainty Proxy Attention): The system has a "Detective" that checks how confident the chefs are.
- If the image is clear and the chefs are sure, the Detective says, "Go ahead, trust each other fully!"
- If the image is fuzzy and the chefs are confused, the Detective says, "Stop! Don't trust the other person's guess right now; stick to your own training."

This prevents the system from "hallucinating" or making up details when the data is bad. It's like a manager who knows when to let the team collaborate and when to let them work alone to avoid spreading errors.

4. The "Zoom Lens" (Multi-Scale Context)

Tumors come in all sizes, from tiny peas to large grapefruits.

The system uses a Multi-Scale Fusion mechanism. Imagine a photographer with a camera that can instantly switch between a wide-angle lens (to see the whole context) and a macro lens (to see tiny details).
This ensures the system doesn't miss a tiny tumor because it was looking too broadly, and doesn't get confused by a large tumor because it was looking too closely.

The Results: Why Should We Care?

The authors tested this new "Teamwork" system on real medical data (thousands of ultrasound images).

The Score: It correctly identified tumor boundaries 74.5% of the time and correctly diagnosed the tumor type 90.6% of the time.
The Comparison: It beat the previous "Silent Partner" systems and even the fancy "Transformer" systems (which are usually very smart) by a significant margin.

The Big Takeaway

This paper proves that in medical AI, communication is key. Instead of building two separate experts who only share a few notes at the start, we should build a team that constantly shares insights, checks each other's confidence, and adapts to the difficulty of the specific image they are looking at.

By letting the "shape finder" and the "diagnostician" talk to each other at every level of the process, the computer becomes a much more reliable assistant for doctors, potentially leading to earlier detection and better patient outcomes.

1. Problem Statement

Breast cancer detection via ultrasound requires simultaneous lesion segmentation (identifying boundaries) and tissue classification (distinguishing benign vs. malignant). While Multi-Task Learning (MTL) is a common approach, existing methods suffer from two critical limitations:

Encoder-Only Interaction: Most MTL architectures share parameters only in the encoder. Once features enter separate decoders, task representations diverge, preventing the exploitation of complementary information (e.g., how boundary precision aids classification and vice versa) during the critical spatial reconstruction phase.
Static Coordination: Current strategies rely on static loss weighting or global uncertainty estimates that do not adapt to instance-specific prediction difficulty. Ultrasound images contain severe speckle noise, posterior acoustic shadowing, and morphological variations, requiring dynamic adjustment of task reliance based on the confidence of specific samples.

2. Methodology

The authors propose a novel framework featuring a transfer-learning encoder (EfficientNet) feeding a four-level decoder (D1–D4). The core innovation lies in Multi-Level Bidirectional Decoder Interaction and Uncertainty-Aware Adaptive Coordination.

A. Task Interaction Modules (TIM)

TIMs operate at every decoder level (D1–D4), establishing bidirectional communication between segmentation and classification streams during spatial upsampling:

Segmentation-to-Classification (Spatial Context): Uses attention-weighted pooling to inject boundary-aware spatial context into classification features. A gating mechanism selectively incorporates this context to prevent error propagation in ambiguous cases.
Classification-to-Segmentation (Semantic Guidance): Uses multiplicative modulation where classification features act as semantic priors to refine spatial localization. This modulates decoder features to amplify regions consistent with semantic understanding while suppressing uncertain areas.
Scale-Specific Synergy: Unlike single-level approaches, TIM captures interactions from coarse semantic contexts (early decoder stages) to fine-grained spatial details (later stages).

B. Uncertainty Proxy Attention (UPA)

To address instance heterogeneity without the computational cost of Bayesian methods, the authors introduce UPA:

Mechanism: It uses feature activation variance as a proxy for prediction uncertainty. High variance indicates inconsistent activations (low confidence).
Adaptive Weighting: A lightweight MLP learns to interpolate between "base" features and "enhanced" (TIM-modulated) features based on the calculated uncertainty.
Dynamic Balance: If a task is uncertain (high variance), the network relies more on the complementary task. This allows per-sample and per-level adaptive coordination without manual tuning.

C. Multi-Scale Context Fusion (HMSF)

To handle varying lesion sizes (5–40mm), the architecture employs Hierarchical Multi-Scale Fusion using parallel dilated separable convolutions ( $r \in \{1, 2, 4\}$ ). This is combined with attention gates at skip connections to suppress background speckle while preserving lesion boundaries.

D. Loss Formulation

The total loss balances segmentation and classification:

Segmentation: Focal Tversky Loss (handling class imbalance) + Boundary Regularization (geometric curvature filtering) + Texture Regularization (gradient consistency).
Classification: Focal Cross-Entropy to emphasize hard samples.

3. Key Contributions

Decoder-Level Interaction: Shifts the paradigm from encoder-only sharing to bidirectional communication at all decoder levels, enabling mutual refinement during spatial reconstruction.
Uncertainty-Aware Coordination: Introduces a lightweight, non-Bayesian mechanism (UPA) that dynamically balances task reliance based on feature activation variance, adapting to specific image difficulties (e.g., shadowing or noise).
Scale-Specific Task Synergy: Demonstrates that task interactions should vary by resolution, with classification guiding early semantic stages and segmentation refining late boundary stages.
State-of-the-Art Performance: Achieves competitive results on standard breast ultrasound benchmarks, outperforming both single-task CNNs/Transformers and existing MTL baselines.

4. Experimental Results

The model was evaluated on two public datasets: BUSI (780 images) and BUSI-WHU (927 images).

BUSI Dataset Performance:
- Segmentation: 74.50% IoU and 85.25% Dice (outperforming Transformer baselines like Swin-UNet and MISSFormer by 1.7–4.2% IoU).
- Classification: 90.60% Accuracy and 89.84% F1-score.
BUSI-WHU Dataset Performance:
- Segmentation: 86.40% IoU and 92.70% Dice.
- Classification: 95.00% Accuracy.
Ablation Studies:
- Adding HMSF improved IoU by ~2.5%.
- Adding TIM alone improved IoU by ~1.77% and Classification Precision by ~3.88%.
- Adding UPA further boosted AUC from 94.41% to 97.31%, proving that uncertainty-regulated interpolation prevents over-commitment to unreliable signals.
- Full Model: Achieved a 7.07% IoU gain over the baseline encoder-decoder architecture.

5. Significance and Conclusion

This work fundamentally challenges the standard MTL paradigm in medical imaging, which typically restricts task interaction to the encoder. By moving interaction to the decoder, the authors demonstrate that segmentation and classification are mutually informative during the reconstruction of spatial details.

The proposed Uncertainty-Aware Multi-Level Decoder Interaction framework provides a robust solution for the noisy and heterogeneous nature of breast ultrasound. It effectively handles instance-specific difficulties without heavy computational overhead, offering a new architectural direction for multi-task medical image analysis that prioritizes dynamic, level-specific task coordination over static parameter sharing. The code is publicly available, facilitating further research in this domain.