TopoCL: Topological Contrastive Learning for Medical Imaging

Imagine you are trying to teach a computer to recognize different types of skin lesions, like moles or melanomas, just by looking at photos. This is a tough job because medical images are tricky: two different diseases can look almost identical in terms of color and brightness, but they have completely different shapes and structures.

For example, a harmless mole might be a solid circle, while a dangerous one might have a ring-like shape with a hole in the middle. Standard AI models are like students who only study the color of the paint on a canvas. They might miss the fact that the painting is actually a donut (a shape with a hole) versus a solid cookie, even if they are both brown.

This paper introduces TopoCL, a new way to teach AI to "see" the shape and structure of medical images, not just the colors. Here is how it works, broken down into simple concepts:

1. The Problem: The "Color-Blind" AI

Current AI methods (called Contrastive Learning) are great at learning visual details like texture and color. They do this by showing the AI two slightly different versions of the same photo (like a cropped version and a brightened version) and asking, "Are these the same thing?"

However, these methods often ignore topology. In math, topology is the study of shapes that don't change when you stretch or twist them.

Holes: Does the shape have a hole in the middle?
Connectivity: Is the object one solid piece, or is it broken into islands?
Boundaries: Is the edge smooth or jagged?

In medicine, these structural details are often the difference between a benign (harmless) tumor and a malignant (cancerous) one. Standard AI misses these clues.

2. The Solution: TopoCL (Topological Contrastive Learning)

TopoCL is like giving the AI a new pair of glasses that lets it see the "skeleton" of the image, not just the skin. It does this in three clever steps:

Step A: The "Shape-Preserving" Augmentations

Usually, when training AI, we mess with images (blur them, change colors) to make the data diverse. But if you blur a medical image too much, you might accidentally erase a tiny hole that is crucial for diagnosis.

TopoCL uses a special "Shape Ruler" (called Relative Bottleneck Distance). Before it messes with an image, it measures the "shape distance."

Weak Augmentation: It makes small changes that keep the shape mostly the same (like slightly wiggling the edge of a circle).
Strong Augmentation: It makes bigger changes that alter the shape a bit more (like turning a circle into an oval), but it ensures the change isn't too wild.

This is like a sculptor who knows exactly how much clay they can remove without breaking the statue's essential form.

Step B: The "Shape Detective" (Hierarchical Topology Encoder)

Once the AI has these shape-aware images, it needs to analyze the structure. TopoCL uses a special module called the Hierarchical Topology Encoder.

Think of this as a two-step detective team:

Team A (The Counters): They count how many separate pieces the object has (e.g., is it one big blob or three small islands?).
Team B (The Hole Hunters): They count how many holes or rings are inside the object.

Crucially, these two teams talk to each other. They ask, "Hey, is this hole sitting inside that specific blob?" This helps the AI understand complex relationships, like a gland inside a tumor, which is a key sign of cancer.

Step C: The "Smart Mixer" (Mixture of Experts)

Finally, the AI has two sets of notes: one about colors/textures (from the standard camera) and one about shapes/structures (from the Shape Detective).

Sometimes, the color is the most important clue (like in a skin rash). Sometimes, the shape is the only thing that matters (like a broken bone). TopoCL uses a Mixture-of-Experts system. Imagine a panel of five different consultants:

Consultant 1: "I only trust the colors."
Consultant 2: "I only trust the shapes."
Consultant 3: "Let's combine them."
Consultant 4: "Let's blend them carefully."
Consultant 5: "Let's see how they interact."

A smart "Manager" (the Gating Network) looks at the specific patient's image and decides which consultant to listen to. If the image is a skin lesion, the Manager might listen mostly to the Shape Consultant. If it's a retina scan, it might listen to the Color Consultant. This makes the AI incredibly flexible.

3. The Results: Why It Matters

The researchers tested TopoCL on five different types of medical images (skin, eyes, organs, etc.) and compared it against five of the best existing AI methods.

The Outcome: TopoCL consistently improved accuracy by about 3.26%.
The Analogy: In a medical diagnosis, a 3% improvement isn't just a number; it's the difference between catching a disease early or missing it entirely.
The Proof: In one test case, a standard AI misclassified a skin lesion because it looked "brown" like a different type of mole. TopoCL, however, noticed the circular boundary and the internal structure, correctly identifying it as the dangerous type.

Summary

TopoCL is a breakthrough because it teaches AI to stop just "looking" at pictures and start "understanding" the geometry of the human body. By combining standard visual learning with a mathematical understanding of shapes, holes, and connections, it creates a smarter, more reliable doctor's assistant that can spot the subtle structural clues that human eyes and standard computers often miss.

1. Problem Statement

The Limitation of Visual-Only Contrastive Learning:
Current Contrastive Learning (CL) methods (e.g., SimCLR, MoCo, DINO) have become powerful tools for learning representations from unlabeled medical images. However, they rely heavily on visual appearance features (textures, colors, intensities) derived from local pixel neighborhoods. They fundamentally neglect topological characteristics such as connectivity patterns, boundary configurations, and cavity formations.

The Medical Context:
In medical imaging, diagnostic decisions often rely on structural properties rather than just texture. For example, distinguishing between lesion types may depend on whether a boundary is circular or radial, or the presence of specific connectivity structures. Standard CL augmentations (random cropping, color jitter) can inadvertently destroy these critical topological structures, leading to misclassifications where visually similar but topologically distinct lesions are confused.

2. Methodology: The TopoCL Framework

The authors propose TopoCL, a general framework that augments standard CL with explicit topological feature learning. The framework consists of three core components:

A. Topology-Aware Augmentations

To ensure that data augmentations preserve medically relevant structures while providing diversity, the authors introduce a controlled augmentation strategy:

Metric: They use Relative Bottleneck Distance ( $d_B^{rel}$ ) computed on Regions of Interest (ROIs) (extracted via the Segment Anything Model, SAM) to quantify topological changes between a persistence diagram (PD) of the original image and the augmented image.
Strategy: Augmentations are categorized into Topology-Weak (preserving structure, $d_B^{rel} \approx 5-15\%$ ) and Topology-Strong (introducing controlled structural variation, $d_B^{rel} \approx 15-25\%$ ).
Operations: Specific operations (e.g., Gaussian noise, morphological dilation/erosion, contrast adjustments) are selected and parameterized to fall within these specific $d_B^{rel}$ ranges, ensuring that positive pairs in contrastive learning remain topologically consistent.

B. Hierarchical Topology Encoder (H-Topo. Encoder)

Since persistence diagrams are unordered sets of birth-death pairs, standard CNNs cannot process them directly. The authors design a specialized encoder:

Input Processing: The top- $k$ most persistent features ( $k=48$ for $H_0$ connected components, $k=96$ for $H_1$ holes) are extracted and augmented with one-hot encodings to distinguish homology dimensions.
Architecture:
1. PH Encoder: A PointNet-like network processes individual birth-death pairs.
2. Self-Attention: Applied within each homology dimension ( $H_0$ and $H_1$ ) to weigh the importance of specific features (e.g., distinguishing tumor regions from background).
3. Cross-Attention: Bidirectional cross-attention models the geometric dependencies between $H_0$ (components) and $H_1$ (holes), capturing relationships like "holes bounded by components."
4. Pooling & Projection: Max and mean pooling aggregate features, which are then projected to a 256-dimensional embedding.

C. Adaptive Mixture-of-Experts (MoE) Fusion

Recognizing that different medical images require different balances of visual and topological information (e.g., texture-heavy dermoscopy vs. structure-heavy histopathology), the authors replace fixed fusion strategies with an adaptive MoE module:

Five Experts: The module employs five parallel experts:
1. Vis-Only: Uses only visual features.
2. Topo-Only: Uses only topological features.
3. Concat: Simple concatenation of features.
4. Gated Blending: Learns a soft gate to blend features element-wise.
5. Cross-Attn: Applies cross-attention between visual and topological features.
Dynamic Gating: A multi-gating network analyzes the input and learns sample-specific weights to dynamically route the final representation to the most appropriate combination of experts.

Training Strategy:
The framework follows a pretrain-then-fuse approach:

Independent Pretraining: The Visual Encoder and Topology Encoder are pretrained separately using contrastive losses.
Joint Fine-tuning: Both encoders and the MoE fusion module are jointly fine-tuned using the contrastive objective, allowing the model to align the feature spaces and learn optimal fusion weights.

3. Key Contributions

Topology-Aware Augmentation: A systematic method to quantify and control topological perturbations using relative bottleneck distance on ROIs, ensuring structural preservation during contrastive learning.
Hierarchical Topology Encoder: A novel architecture using self- and cross-attention to capture inter-dimensional relationships between connected components ( $H_0$ ) and holes ( $H_1$ ).
Adaptive MoE Fusion: The first application of Mixture-of-Experts to fuse visual and topological features, allowing the model to adaptively select the best fusion strategy per sample.
Generalizability: The framework is designed to be seamlessly integrated with existing CL methods (SimCLR, MoCo-v3, BYOL, DINO, Barlow Twins).

4. Experimental Results

The authors evaluated TopoCL on five diverse medical image datasets (PathMNIST, OCTMNIST, OrganSMNIST, ISIC2019, Kvasir) across five baseline CL methods.

Performance Gains: TopoCL achieved a consistent average improvement of +3.26% in linear probe classification accuracy and +0.90% in AUC across all benchmarks.
Statistical Significance: The improvements were statistically significant ( $p < 0.05$ ) in 86% of individual dataset-metric comparisons and 80% of dataset-averaged metrics reached $p < 0.001$ .
Best Performance: The integration with DINO yielded the largest gains (+4.60% ACC), while MoCo-v3 also showed robust improvements.
Ablation Studies:
- Removing pretraining caused severe performance drops.
- The hierarchical attention (specifically cross-attention) was critical for capturing structural relationships.
- The MoE fusion outperformed fixed fusion strategies, with the "Cross-Attn" expert being the most impactful single component.
Computational Cost: The overhead is modest, with training time increasing by ~13% and FLOPs by ~51%. The authors note that topological features can be precomputed offline, making inference practical for clinical deployment.

5. Significance

Bridging the Gap: TopoCL addresses a critical blind spot in medical AI by explicitly modeling structural topology, which is often more diagnostically relevant than texture.
Robustness: The method corrects failure cases where visual-only models misclassify lesions with similar appearances but different topological structures (e.g., distinguishing dermatofibroma from melanocytic nevi).
General Framework: By being compatible with any existing CL method, TopoCL provides a plug-and-play enhancement for the broader medical imaging community, offering a new paradigm for self-supervised learning that goes beyond pixel-level semantics.