Topological Inductive Bias fosters Multiple Instance Learning in Data-Scarce Scenarios

The Big Problem: Learning with Few Examples

Imagine you are a teacher trying to teach a student how to spot a sick cell in a blood sample.

The Challenge: You don't have thousands of blood samples to show the student. You only have a few (maybe 17 to 120).
The Complication: You can't point to a single cell and say, "This one is sick." You can only look at the whole slide (the "bag") and say, "This slide is sick" or "This slide is healthy." Inside a "sick" slide, there might be thousands of healthy cells and just a few sick ones.

This is called Multiple Instance Learning (MIL). The model has to figure out the "sick" cells just by looking at the whole group. But when you have very little data, the model gets confused. It starts guessing randomly or memorizing the few examples it has, failing when it sees new data.

The Solution: Giving the Model a "Shape Sense"

The authors propose a new method called TG-MIL. Instead of just teaching the model to recognize pixels, they teach it to understand the shape and structure of the data.

Think of it like this:

Standard MIL: Imagine trying to recognize a friend in a crowd by looking at their face. If you only see them once or twice, you might mistake a stranger for them.
TG-MIL: Now, imagine you also know your friend's personality and how they move. Even if you only see them from the back, or in a dark room, you know, "That's definitely my friend because of how they walk and stand in a group."

The "Topological Inductive Bias" is that extra "shape sense." It forces the computer to preserve the geometric relationships between the cells when it processes them.

How It Works: The "Point Cloud" Analogy

The paper treats every group of cells (a "bag") as a cloud of points floating in space.

The Input: Imagine a bag of cells. Some are healthy, some are sick. They form a specific 3D shape or pattern in the data space.
The Transformation: The computer tries to shrink this complex cloud into a simpler, smaller representation (a "latent space") to make a decision.
The Problem: Usually, when you shrink a cloud, you might squash it flat or twist it, losing the original shape.
The Fix (TG-MIL): The authors add a special rule (a "loss function") that acts like a rubber band. It checks: "Did we keep the shape of the cloud the same after shrinking it?"
- If the model squashes the shape too much, the rubber band pulls it back.
- If the model keeps the "connectivity" (who is close to whom) intact, it gets a reward.

This ensures that even with very few examples, the model learns the fundamental structure of what a "sick bag" looks like, rather than just memorizing specific pixels.

Why This Matters: The "Rare Disease" Superpower

The paper tested this on three things:

Fake Data: Made-up images to test the theory.
Standard Benchmarks: Classic computer science puzzles.
Real Life: Diagnosing rare anemias (blood diseases) where doctors have very few patient samples.

The Results:

In situations with very little data, standard models were like a student guessing in the dark (getting about 50-60% accuracy).
The TG-MIL model was like a student with a flashlight (getting 70-80%+ accuracy).
It improved performance by 15% on synthetic data and 5.5% on real rare disease cases.

The "Unit Test" Analogy: Did the Cheater Pass?

The researchers ran a special "lie detector test" (called a Unit Test) to see if the models were cheating.

The Trap: They created a scenario where a specific "poison" cell appeared only in healthy bags. A "cheating" model would learn: "If I see the poison cell, it's healthy. If I don't, it's sick." This is wrong because the bag label should depend on the sick cells, not the absence of the poison.
The Result: Standard models often fell for the trap. TG-MIL, however, passed the test. Because it was forced to understand the overall shape of the data, it couldn't rely on the easy "cheat code" of the poison cell. It learned the actual rule.

The Trade-off: A Little Slower, Much Smarter

Is there a downside? Yes. Calculating these "shapes" takes a bit more computing power.

Analogy: It's like driving a car. A standard model is a sports car that goes fast but might crash on a slippery road. TG-MIL is a car with all-wheel drive and stability control. It might go slightly slower (taking about 3.7x longer to train), but it handles the slippery, data-scarce roads much better and doesn't crash.

Summary

TG-MIL is a new way to teach computers to diagnose diseases when there aren't many examples to learn from. Instead of just memorizing pictures, it teaches the computer to understand the shape and structure of the data. This helps the computer make better, more reliable decisions even when it's working with very little information, which is crucial for diagnosing rare diseases.

1. Problem Statement

Multiple Instance Learning (MIL) is a weakly supervised learning framework where labels are assigned to "bags" (sets of instances) rather than individual data points. A bag is positive if it contains at least one positive instance; otherwise, it is negative. While effective in domains like histopathology and drug discovery, standard MIL models suffer significantly when training data is scarce (e.g., rare disease classification).

In data-scarce regimes, MIL models struggle to learn robust instance representations, often leading to:

Overfitting to limited samples.
Failure to capture the intrinsic topological structure of the data distribution within a bag.
Poor generalization to unseen data, particularly when relying on complex aggregation mechanisms (like attention) that may not uniformly enhance all instances.

The core challenge is how to enforce structural constraints on the instance encoder without requiring additional instance-level labels, thereby improving generalizability in low-data scenarios.

2. Methodology: Topologically Guided MIL (TG-MIL)

The authors propose TG-MIL, a framework that incorporates a topological inductive bias into the standard MIL pipeline. The method treats each bag as a point cloud in a high-dimensional space and ensures that the topological structure of this point cloud is preserved when instances are mapped to the model's latent space.

Key Components:

Topological Representation:
- Each bag $X_{bm}$ is treated as a point cloud.
- Persistent Homology is used to compute multi-scale topological descriptors (persistence diagrams, $\pi$ ).
- The authors primarily focus on 0D topological features (connected components) to balance computational cost and effectiveness, though the framework supports higher dimensions (loops, voids).
- A Vietoris-Rips complex is constructed based on pairwise Euclidean distances between instances within a bag.
Topological Loss Function ( $L_{topo}$ ):
- The goal is to ensure the topological signature of the input space ( $X$ ) is preserved in the latent space ( $Z$ ).
- The loss penalizes inconsistencies between the distance matrices and persistence pairs of the input and latent spaces.
- It is defined as a bidirectional consistency check:
  $L_{topo} = L_{X \to Z} + L_{Z \to X}$
  Where $L_{X \to Z}$ compares the latent distances using input persistence pairs, and $L_{Z \to X}$ compares input distances using latent persistence pairs.
- This loss is permutation-invariant, meaning the order of instances within a bag does not affect the loss, which is crucial for MIL.
Training Objective:
- The total loss is a weighted sum of the standard classification loss ( $L_{class}$ ) and the topological loss:
  $L_{total} = L_{class} + \lambda L_{topo}$
- $\lambda$ is a hyperparameter controlling the strength of the topological bias.
- The method is end-to-end trainable and compatible with any aggregation strategy (Max, Average, Attention, Regressor-guided).
Theoretical Justification:
- The topological loss acts as a regularizer on the instance encoder ( $f_\theta$ ), restricting the hypothesis space to preserve the "shape" of the data.
- This improves instance learnability (reducing the gap between empirical and true instance risk) without altering the bag-level decision rule or requiring instance-level supervision.

3. Key Contributions

Novel Framework: Introduction of TG-MIL, the first method to utilize topological inductive biases specifically to improve MIL generalizability in data-scarce scenarios.
Aggregation Agnosticism: The method can be integrated with any existing MIL aggregation strategy (e.g., Attention, Max, Average, Regressor-guided) without architectural overhaul.
Robustness to Scarcity: Demonstrates that preserving connectivity-based topology (0D) significantly boosts performance when only 17–120 samples per class are available.
Comprehensive Evaluation: Validated across synthetic data, standard MIL benchmarks, and a real-world rare anemia classification task.

4. Experimental Results

A. Synthetic Datasets (MNIST & Fashion-MNIST)

Setup: Bags of images with varying numbers of training bags (10–200) and instance counts.
Findings:
- TG-MIL showed average performance improvements of 15.3% over state-of-the-art (SOTA) MIL models.
- It significantly narrowed the performance gap between simple pooling (Max/Average) and advanced pooling (Attention/Regressor), suggesting topological guidance helps simpler aggregators perform better.
- Statistical significance (Wilcoxon rank-sum test with Bonferroni correction) confirmed improvements in 7 out of 48 specific configurations, particularly with Average Pooling.

B. MIL Benchmarks (MUSK, FOX, TIGER, ELEPHANT)

Setup: Compared TG-MIL (specifically TG-RGMIL) against SOTA methods like APMIL, GAPMIL, DistNet, and RGMIL.
Findings:
- TG-RGMIL consistently outperformed the baseline RGMIL and other SOTA methods.
- MUSK1: Improved accuracy from 94.0% (RGMIL) to 98.2% (TG-RGMIL with 0D, 1D, 2D features).
- FOX/TIGER/ELEPHANT: Consistent gains across all datasets.
- Higher-dimensional topological features (1D loops, 2D voids) provided additional gains in some datasets, though 0D features were often sufficient.

C. Rare Anemia Classification (Real-World Application)

Dataset: 521 microscopy images of blood samples with 5 classes (SCD, Thalassemia, Xero, HS, Healthy). Only 17–120 samples per class.
Findings:
- TG-MIL achieved an average performance improvement of 5.5% over current SOTA.
- Average Pooling with topological guidance outperformed Attention and Anomaly-aware pooling, achieving 81.3% Accuracy (vs. 72.3% without topological guidance).
- Interpretability: TG-MIL produced more consistent anomaly scores for visually similar deformed cells, whereas standard MIL showed high variance. It preserved the relative distances between instances in the latent space better than standard MIL.

D. Unit Test (Raff & Holt, 2023)

Purpose: To verify if the model learns valid existential rules or exploits invalid shortcuts (e.g., detecting a "bait" instance only in negative bags).
Results:
- Average Pooling (TG-MIL): Passed the test (Test AUC = 0.90).
- Max Pooling: Failed (Test AUC = 0.50), indicating it still relies on shortcuts, though topological bias improved it from 0.00 to 0.50.
- Attention Pooling: Passed (Test AUC = 0.91) but showed lower stability in balanced accuracy compared to Average Pooling.

5. Significance and Impact

Data Efficiency: TG-MIL addresses a critical bottleneck in medical AI: the lack of large annotated datasets for rare diseases. By leveraging the intrinsic geometry of data, it reduces the need for massive training sets.
Stability and Robustness: The topological loss acts as a powerful regularizer, preventing the model from overfitting to noise or spurious correlations, leading to more stable predictions across different aggregation strategies.
Computational Trade-off: The method introduces a computational overhead (approx. 3.7x training time per iteration) due to pairwise distance calculations ( $O(n^2)$ ), but this is acceptable given the significant gains in accuracy and the lack of additional learnable parameters.
Clinical Relevance: The authors emphasize that while TG-MIL improves decision support, it should remain a tool for clinicians rather than a standalone diagnostic oracle due to the inherent risks of weak supervision.

Conclusion

The paper successfully demonstrates that incorporating topological inductive biases via persistent homology significantly enhances the performance and generalizability of Multiple Instance Learning models, particularly in data-scarce environments. By forcing the latent space to preserve the topological structure of the input data, TG-MIL enables models to learn more robust instance representations, outperforming current state-of-the-art methods across synthetic, benchmark, and real-world medical datasets.