Topological Inductive Bias fosters Multiple Instance Learning in Data-Scarce Scenarios

The paper proposes Topology Guided MIL (TG-MIL), a method that incorporates topological inductive biases to preserve instance distribution structures within bags, significantly improving the performance and generalizability of Multiple Instance Learning in data-scarce scenarios.

Salome Kazeminia, Carsten Marr, Bastian Rieck

Published 2026-03-03
📖 5 min read🧠 Deep dive

The Big Problem: Learning with Few Examples

Imagine you are a teacher trying to teach a student how to spot a sick cell in a blood sample.

  • The Challenge: You don't have thousands of blood samples to show the student. You only have a few (maybe 17 to 120).
  • The Complication: You can't point to a single cell and say, "This one is sick." You can only look at the whole slide (the "bag") and say, "This slide is sick" or "This slide is healthy." Inside a "sick" slide, there might be thousands of healthy cells and just a few sick ones.

This is called Multiple Instance Learning (MIL). The model has to figure out the "sick" cells just by looking at the whole group. But when you have very little data, the model gets confused. It starts guessing randomly or memorizing the few examples it has, failing when it sees new data.

The Solution: Giving the Model a "Shape Sense"

The authors propose a new method called TG-MIL. Instead of just teaching the model to recognize pixels, they teach it to understand the shape and structure of the data.

Think of it like this:

  • Standard MIL: Imagine trying to recognize a friend in a crowd by looking at their face. If you only see them once or twice, you might mistake a stranger for them.
  • TG-MIL: Now, imagine you also know your friend's personality and how they move. Even if you only see them from the back, or in a dark room, you know, "That's definitely my friend because of how they walk and stand in a group."

The "Topological Inductive Bias" is that extra "shape sense." It forces the computer to preserve the geometric relationships between the cells when it processes them.

How It Works: The "Point Cloud" Analogy

The paper treats every group of cells (a "bag") as a cloud of points floating in space.

  1. The Input: Imagine a bag of cells. Some are healthy, some are sick. They form a specific 3D shape or pattern in the data space.
  2. The Transformation: The computer tries to shrink this complex cloud into a simpler, smaller representation (a "latent space") to make a decision.
  3. The Problem: Usually, when you shrink a cloud, you might squash it flat or twist it, losing the original shape.
  4. The Fix (TG-MIL): The authors add a special rule (a "loss function") that acts like a rubber band. It checks: "Did we keep the shape of the cloud the same after shrinking it?"
    • If the model squashes the shape too much, the rubber band pulls it back.
    • If the model keeps the "connectivity" (who is close to whom) intact, it gets a reward.

This ensures that even with very few examples, the model learns the fundamental structure of what a "sick bag" looks like, rather than just memorizing specific pixels.

Why This Matters: The "Rare Disease" Superpower

The paper tested this on three things:

  1. Fake Data: Made-up images to test the theory.
  2. Standard Benchmarks: Classic computer science puzzles.
  3. Real Life: Diagnosing rare anemias (blood diseases) where doctors have very few patient samples.

The Results:

  • In situations with very little data, standard models were like a student guessing in the dark (getting about 50-60% accuracy).
  • The TG-MIL model was like a student with a flashlight (getting 70-80%+ accuracy).
  • It improved performance by 15% on synthetic data and 5.5% on real rare disease cases.

The "Unit Test" Analogy: Did the Cheater Pass?

The researchers ran a special "lie detector test" (called a Unit Test) to see if the models were cheating.

  • The Trap: They created a scenario where a specific "poison" cell appeared only in healthy bags. A "cheating" model would learn: "If I see the poison cell, it's healthy. If I don't, it's sick." This is wrong because the bag label should depend on the sick cells, not the absence of the poison.
  • The Result: Standard models often fell for the trap. TG-MIL, however, passed the test. Because it was forced to understand the overall shape of the data, it couldn't rely on the easy "cheat code" of the poison cell. It learned the actual rule.

The Trade-off: A Little Slower, Much Smarter

Is there a downside? Yes. Calculating these "shapes" takes a bit more computing power.

  • Analogy: It's like driving a car. A standard model is a sports car that goes fast but might crash on a slippery road. TG-MIL is a car with all-wheel drive and stability control. It might go slightly slower (taking about 3.7x longer to train), but it handles the slippery, data-scarce roads much better and doesn't crash.

Summary

TG-MIL is a new way to teach computers to diagnose diseases when there aren't many examples to learn from. Instead of just memorizing pictures, it teaches the computer to understand the shape and structure of the data. This helps the computer make better, more reliable decisions even when it's working with very little information, which is crucial for diagnosing rare diseases.