SAGE: Shape-Adapting Gated Experts for Adaptive Histopathology Image Segmentation

The Big Problem: One Size Does Not Fit All

Imagine you are a doctor looking at a giant, high-resolution map of a city (a Whole Slide Image of a tissue sample). Your job is to find specific buildings (cells) and draw a line around them to see if they are healthy or sick.

The problem is that this "city" is incredibly chaotic.

Some areas are simple, flat suburbs with identical houses (normal tissue).
Other areas are dense, chaotic downtowns with skyscrapers, alleyways, and weird shapes (cancerous tissue).

Current computer programs (AI models) act like a rigid assembly line. They force every single patch of the map to go through the exact same set of machines, regardless of whether it's a simple suburb or a complex downtown.

The Waste: They waste time over-analyzing the simple suburbs.
The Failure: They often get overwhelmed and confused by the complex downtowns because the assembly line isn't flexible enough.

The Solution: SAGE (Shape-Adapting Gated Experts)

The authors propose SAGE, which is like replacing that rigid assembly line with a smart, dynamic traffic control system.

Instead of forcing every car (data) down the same road, SAGE asks: "What kind of car is this, and which road does it need?"

1. The "Dual-Path" Highway

SAGE builds a road with two lanes running side-by-side:

The Main Lane (The Backbone): This is the standard, reliable road that everyone uses. It keeps the basic structure of the map intact.
The Expert Lanes (The Detours): These are special, high-speed lanes designed for specific types of traffic. Some lanes are great for straight roads (Convolutional Neural Networks/CNNs), and others are great for navigating complex intersections (Transformers).

2. The "Smart Gatekeeper" (The Router)

At every intersection, there is a Gatekeeper (the Router). It looks at the incoming data and makes a split-second decision:

"This patch looks simple. Let's keep it on the Main Lane."
"This patch looks weird and complex! Let's send it to the Expert Lane for a closer look."

The Gatekeeper doesn't just pick one lane; it can pick a few "Experts" to work together on the hardest problems. This is called Mixture of Experts (MoE).

3. The "Universal Adapter" (SA-Hub)

Here is the tricky part: The "CNN Experts" speak a different language than the "Transformer Experts." One thinks in grids (like a spreadsheet), and the other thinks in sequences (like a sentence).

If you try to mix them, they can't understand each other. SAGE introduces a Shape-Adapting Hub (SA-Hub). Think of this as a universal translator and adapter.

If an Expert needs a grid, the Hub converts the data into a grid.
If the Expert finishes its work, the Hub converts the result back so it fits perfectly with the Main Lane.
This allows the different types of AI brains to collaborate without getting confused.

How It Works in Real Life (The Analogy)

Imagine you are managing a team of Artists to paint a mural of a forest.

Old Way (Static Models): You hand every artist the same brush and tell them to paint every part of the forest the same way. They struggle to paint the tiny leaves on the trees because they are using a big brush, and they waste time painting the sky with a tiny brush.
SAGE Way:
- You have a Main Artist who paints the general background.
- You have a Specialist Team: One artist is a master of textures (great for leaves), and another is a master of composition (great for the whole forest view).
- A Manager (The Router) looks at the section of the mural being painted.
  - If it's the sky, the Manager says, "Main Artist, you handle this."
  - If it's a complex cluster of leaves, the Manager says, "Bring in the Texture Specialist!"
  - If it's a weird, tangled root system, the Manager says, "Bring in both the Texture Specialist and the Composition Specialist to work together!"
- The Adapter (SA-Hub) ensures that when the Texture Specialist finishes their part, the paint blends perfectly with what the Main Artist did, so there are no ugly seams.

The Results: Why It Matters

The paper tested this system on three different types of medical image datasets (EBHI, GlaS, and DigestPath).

Better Accuracy: SAGE achieved the highest scores (Dice scores) ever recorded on these tests. It drew the lines around the cancer cells more accurately than any previous model.
Handles the Weird Stuff: It was particularly good at handling "out-of-distribution" data—meaning when the tissue looked different or stranger than what it was trained on, SAGE didn't panic; it just routed the data to the right expert.
Efficiency: It doesn't waste energy. It only uses the "heavy machinery" (the extra experts) when the problem actually needs it.

Summary

SAGE is a smart AI framework that stops treating all medical images the same. Instead of a rigid assembly line, it uses a dynamic traffic system that routes difficult parts of an image to specialized experts and uses a universal translator to make sure they all work together. The result is a system that is faster, smarter, and much better at finding cancer in complex tissue samples.

1. Problem Statement

The paper addresses the significant challenges in computer-assisted cancer detection using gigapixel Whole Slide Images (WSIs) in digital pathology. The core difficulties include:

Cellular Heterogeneity: Extreme variability in cell size, shape, and tissue morphology (ranging from homogeneous normal tissue to complex malignant patterns).
Limitations of Static Architectures: Current state-of-the-art models (CNN-Transformer hybrids like U-Net variants) rely on static computation graphs. They process all input regions through the same fixed sequence of layers.
- Consequence: This leads to over-processing of simple regions (wasting computation) and under-modeling of complex regions (failing to capture fine details).
- Limitation: Existing hybrid models often lack the ability to dynamically route computation based on input characteristics or to effectively integrate heterogeneous feature types (e.g., CNN local features vs. Transformer global context) in a flexible manner.

2. Methodology: SAGE Framework

The authors propose SAGE (Shape-Adapting Gated Experts), an input-adaptive framework that transforms static backbones into dynamically routed architectures. The core components are:

A. Dual-Path Architecture

Instead of a single deterministic path, SAGE introduces a dual-path design at each backbone layer:

Main Path: Preserves the original backbone transformation (e.g., ConvNeXt or ViT layers) to maintain stability and pretrained inductive biases.
Expert Path: Conditionally activates a sparse subset of "expert" blocks (reused backbone layers) to refine features based on the specific input.

Fusion: The outputs of both paths are fused using a learnable scalar gate ( $\alpha_i$ ), allowing the model to balance stability (main path) and input-specific refinement (expert path).

B. Hierarchical Expert Routing

SAGE employs a two-level routing strategy to select experts dynamically:

Group-Level Gating: A lightweight network estimates a preference between Shared Experts (generalizable across domains) and Fine-Grained Experts (specialized for specific input characteristics).
Semantic Affinity Routing (SAR): Computes base logits for all experts using a query-key mechanism.
Prior-Guided Logit Modulation: The group-level gate modulates the SAR logits in log-space. If the gate favors shared experts, shared expert logits are boosted; otherwise, fine-grained logits are boosted.
Top-K Selection: The model selects the top- $K$ experts based on the modulated logits and applies independent sigmoid gating to their weights.

C. Shape-Adapting Hub (SA-Hub)

A critical innovation to handle heterogeneous experts (e.g., mixing CNN blocks and Transformer blocks):

Problem: CNNs output 4D tensors ( $B, C, H, W$ ), while Transformers output 2D token sequences ( $B, N, D$ ).
Solution: The SA-Hub contains an Input Adapter ( $S_{in}$ ) and an Output Adapter ( $S_{out}$ ).
- $S_{in}$ reshapes/flattens features to match the target expert's expected format.
- $S_{out}$ reconstructs the features back to the main path's spatial format.
This allows CNN and Transformer experts to coexist in the same routing layer without requiring a unified native tensor format.

3. Key Contributions

Dynamic Routing for Static Backbones: A novel formulation that converts static CNN/Transformer backbones into dynamically routed architectures via parameter reuse (upcycling), enabling input-adaptive computation.
Hierarchical Gating Mechanism: A routing system that balances shared (domain-general) and fine-grained (input-specific) specialization using group-level gating and logit modulation.
Shape-Adapting Hub (SA-Hub): A lightweight module enabling seamless interaction between heterogeneous expert types (CNN and Transformer) by aligning feature shapes dynamically.
Scalable Foundation: Demonstrates that dynamic expert routing can be effectively deployed on gigapixel WSIs via sliding-window reconstruction, maintaining compatibility with high-resolution pathology workflows.

4. Experimental Results

The model, specifically SAGE-ConvNeXt+ViT-UNet, was evaluated on three major benchmarks: EBHI, GlaS, and DigestPath.

Performance Metrics:
- EBHI: Achieved a 95.23% Dice Score (DSC), outperforming the best baseline (EViT-UNet) by 0.37%.
- GlaS (Test A/B): Achieved 92.78% and 91.42% DSC respectively, surpassing previous SOTA models.
- DigestPath: Achieved 91.26% DSC at the WSI level and 92.66% at the Patch level, showing superior generalization on gigapixel images.
Ablation Studies:
- Logit Modulation: Consistently improved performance across all metrics (e.g., +0.17% DSC gain on EBHI).
- Gating Mechanism: Sigmoid gating outperformed Softmax gating, likely due to its non-competitive nature which allows multiple distinct feature extractors to activate simultaneously for complex tissues.
- Expert Capacity: Increasing the Top-K value (number of active experts) yielded the most significant performance gains.
Qualitative Analysis: Visualizations (Grad-CAM) show SAGE effectively routes computation to different experts to refine local boundaries while preserving global context, reducing over-segmentation in complex tissue areas compared to static models.
Efficiency: While parameters increased slightly (~5.5%), computational cost (FLOPs) scales linearly with the number of active experts ( $K$ ). The model remains efficient for $K=1$ or $2$, with higher $K$ reserved for maximum accuracy.

5. Significance and Impact

Overcoming Static Limitations: SAGE proves that static computation graphs are suboptimal for highly heterogeneous medical data. By introducing dynamic routing, the model adapts its complexity to the input, optimizing both accuracy and resource usage.
Heterogeneous Integration: The SA-Hub solves a major architectural bottleneck, allowing the seamless combination of CNNs (local texture) and Transformers (global context) within a single Mixture-of-Experts (MoE) framework.
Clinical Relevance: The model's robustness in handling distribution shifts (e.g., different staining, tissue types) and its ability to accurately segment complex glandular structures on gigapixel WSIs make it a strong candidate for real-world digital pathology applications, potentially improving cancer diagnosis speed and accuracy.
Future Direction: The work establishes a scalable foundation for dynamic expert routing in visual networks, suggesting future applications in other dense prediction tasks and broader clinical settings.