WSI-INR: Implicit Neural Representations for Lesion Segmentation in Whole-Slide Images

Imagine you are a detective trying to solve a mystery on a massive, ancient tapestry. This tapestry is a Whole-Slide Image (WSI) of a tissue sample, used by doctors to find cancer. The tapestry is so huge that it's impossible to look at the whole thing at once.

The Old Way: The "Sticker" Problem

Traditionally, detectives (AI models) would cut this giant tapestry into thousands of tiny, square stickers (patches). They would study each sticker individually, trying to guess if it contained a clue (a lesion).

Why this failed:

Lost Context: If you cut a picture of a face into squares, a square containing just the tip of a nose doesn't tell you it's a face. The AI loses the "big picture" because the stickers are treated as separate islands.
The Zoom Confusion: Pathologists often look at the same spot on the tapestry with a magnifying glass (high resolution) and then zoom out (low resolution). The old AI thought these were two different pictures. If it learned to spot a cancer cell when zoomed in, it would get completely confused when zoomed out, often breaking the cancer spot into tiny, scattered dots.

The New Solution: WSI-INR (The "Infinite Paintbrush")

The authors propose a new method called WSI-INR. Instead of cutting the tapestry into stickers, they treat the entire slide as a single, continuous, infinite painting.

Think of it like a magical paintbrush that doesn't need a canvas. You simply tell the brush, "Paint the spot at coordinates (X, Y)." The brush instantly knows what color (healthy tissue or cancer) goes there, no matter how close or far you are from the canvas.

Here is how it works, using simple analogies:

1. The Continuous Map (No More Stickers)

Instead of a grid of stickers, WSI-INR uses a continuous map. It learns a secret formula that connects any point on the slide to its meaning.

Analogy: Imagine a GPS. Old methods tried to memorize every single street corner as a separate photo. WSI-INR learns the rules of the city. You can ask it for the location of a specific house, and it knows exactly where it is, even if you ask for a house that doesn't exist on the map yet.

2. The Multi-Resolution Hash Grid (The "Smart Zoom")

This is the paper's secret sauce. In the real world, looking at a city from a helicopter (low res) and from a street corner (high res) shows different details, but it's the same city.

The Problem: Old AI thought the helicopter view and the street view were two different cities.
The Fix: WSI-INR uses a Multi-Resolution Hash Grid. Think of this as a set of smart, layered lenses.
- Low Layers: These are like a wide-angle lens. They see the big shapes (the neighborhood).
- High Layers: These are like a microscope. They see the tiny details (the individual bricks).
- The Magic: The system understands that these are just different "sampling densities" of the same continuous city. It realizes that a "blob" seen from the helicopter is just a cluster of "dots" seen from the street. This allows the AI to stay consistent whether you zoom in or out.

3. The Two-Step Training (Learn to See, Then Learn to Find)

Training this AI is like training a new artist.

Step 1 (The Sketch): First, the AI is told to just reconstruct the image. "Look at the coordinates and tell me what color the tissue is." It learns the texture and structure of the slide without worrying about finding cancer yet. It's like learning to draw a realistic landscape before trying to find a hidden treasure in it.
Step 2 (The Hunt): Once the AI has a perfect mental map of the slide, they "freeze" that knowledge and teach it to spot the cancer. Because it already understands the landscape perfectly, it can find the lesions much more accurately.

4. Inference-Time Optimization (The "Personalized Tune-Up")

When the AI meets a new patient's slide it has never seen before, it doesn't just guess. It does a quick "warm-up" (called Inference-Time Optimization).

Analogy: Imagine a musician who knows how to play a song perfectly. When they walk onto a new stage with different acoustics, they don't relearn the song; they just tweak their instrument slightly to match the room. WSI-INR tweaks its internal "hash grid" to match the specific texture of the new slide, ensuring the prediction is perfect for that specific patient.

The Results: Why It Matters

The paper tested this against the old "sticker" methods (like U-Net).

The Old Way: When the resolution changed (zoomed out), the old AI's performance crashed. It started seeing cancer as scattered, broken fragments.
The New Way: WSI-INR stayed strong. Even when the resolution changed drastically, it maintained a clear, continuous picture of the cancer. In fact, when they optimized it for a specific lower resolution, it actually got better (improving scores by over 26%), while the old methods got much worse.

The Bottom Line

WSI-INR is a shift from thinking of medical images as a pile of disconnected photos to viewing them as a living, continuous landscape. By understanding that zooming in and out is just looking at the same thing with different eyes, this new method helps doctors spot diseases more accurately, even when the image quality or zoom level varies. It's a step toward AI that truly "sees" the whole picture.

1. Problem Statement

Whole-Slide Images (WSIs) are the standard for digital pathology, offering high-resolution, multi-scale views of tissue. However, current deep learning approaches for lesion segmentation face two critical limitations:

Loss of Spatial Continuity: Existing methods tile WSIs into discrete patches. This breaks the global spatial continuity of tissue structures, forcing models to approximate global context through feature aggregation or attention mechanisms, which often fails to capture true lesion boundaries.
Resolution Inconsistency: Pathologists view tissues at multiple resolutions, and scanners vary in sampling density. Current patch-based methods treat different resolutions as independent samples. This leads to cross-resolution fragmentation, where a model trained at one resolution performs poorly or produces discontinuous predictions when applied to a different resolution (e.g., Base/2 or Base/4), as it fails to recognize that different resolutions represent the same continuous tissue.

2. Methodology: WSI-INR

The authors propose WSI-INR, a novel patch-free framework based on Implicit Neural Representations (INRs). Instead of processing discrete image patches, WSI-INR models the entire WSI as a continuous function that maps spatial coordinates $(x, y)$ directly to tissue semantic features and segmentation probabilities.

Key Architectural Components:

Multi-Resolution Hash Grid Encoding:
- To handle the multi-scale nature of WSIs, the model uses a learnable hash grid encoding (inspired by Instant-NGP).
- It treats different resolutions not as different images, but as varying sampling densities of the same continuous tissue function.
- The encoding consists of $L$ grid levels with increasing resolution. It uses a spatial hash function to map coordinates to learnable feature tables, allowing the model to allocate more representational capacity to structurally complex regions (lesions) while compressing homogeneous backgrounds.
- This enables consistent feature representation across resolutions, matching the intrinsic pyramid structure of WSIs.
Dual-Branch Shared Decoder:
- A shared decoder $f_\theta$ processes the encoded features from multiple WSIs to learn general morphological priors.
- It consists of two branches:
  - CNN Branch: Explicitly models local spatial continuity and neighborhood patterns to capture fine-grained tissue textures.
  - MLP Branch: Captures global spatial relationships through implicit representation.
- The outputs are fused to create a unified implicit feature space.
Two-Step Training Strategy:
- Step 1 (Reconstruction): The encoder, decoder, and reconstruction head are trained to reconstruct the image intensity from coordinates (minimizing MSE). The segmentation head is frozen. This establishes a stable, semantically structured implicit representation of the tissue.
- Step 2 (Segmentation): The encoder and decoder are frozen. Only the segmentation head is trained using Binary Cross-Entropy (BCE) and Dice loss. This prevents "shortcut learning" where the model might ignore structural details to minimize loss quickly.
Inference-Time Optimization (ITO):
- For a new, unseen WSI, the global network parameters remain fixed.
- The model performs Inference-Time Optimization by optimizing only the hash encoding specific to that slide using the reconstruction loss (no segmentation labels required).
- This allows the model to rapidly adapt to the specific texture and structural characteristics of the new slide while retaining the semantic priors learned during training.

3. Key Contributions

Patch-Free Framework: Introduced the first INR-based framework for WSI lesion segmentation, modeling the slide as a continuous function rather than discrete patches, thereby preserving global spatial continuity.
Resolution Robustness: Proposed a multi-resolution hash encoding that unifies different resolutions as sampling densities of a single function, solving the issue of cross-resolution inconsistency.
Generalization to Pathology: Demonstrated that INRs, previously successful mainly on structurally consistent anatomical data (like MRI), can effectively segment highly heterogeneous and irregular pathological lesions.
Performance Gains: Achieved significant improvements in cross-resolution segmentation compared to state-of-the-art patch-based models.

4. Experimental Results

The method was evaluated on the CAMELYON16 dataset (breast cancer lymph node metastasis).

Cross-Resolution Performance:
- Models were trained at a base resolution and tested at Base/2 and Base/4.
- U-Net and TransUNet: Suffered severe performance drops at lower resolutions (Dice score decreased by 54.28% and 36.18%, respectively).
- WSI-INR (Resolution-Specific Optimization): Maintained robust performance. At Base/4, it improved the Dice score by +26.11% compared to its base resolution performance, whereas others degraded significantly.
Ablation Studies:
- Models without encoding or with fixed positional encodings (like NeRF) failed to segment lesions (Dice $\approx$ 0.09 or 0.00).
- The full multi-resolution hash grid was essential; removing high-level (fine detail) or low-level (global structure) components degraded performance significantly.
Qualitative Results: Visualizations showed that patch-based methods produced fragmented, discontinuous predictions at lower resolutions, while WSI-INR preserved smooth, continuous lesion boundaries.

5. Significance and Future Work

Paradigm Shift: This work shifts the paradigm of computational pathology from discrete patch processing to continuous implicit modeling, offering a fresh perspective for handling the massive scale and heterogeneity of WSIs.
Clinical Relevance: By ensuring robustness across different scanner resolutions and sampling densities, WSI-INR addresses a critical practical bottleneck in deploying AI models across different medical institutions.
Limitations & Future: The authors note current limitations in modeling micro-scale lesion regions and generalizing across multi-center datasets. Future work aims to expand the framework to other WSI analysis tasks and improve cross-institutional generalization.

In summary, WSI-INR effectively bridges the gap between the continuous nature of biological tissue and the discrete nature of digital image processing, providing a robust, resolution-invariant solution for lesion segmentation in digital pathology.