Deep EM with Hierarchical Latent Label Modelling for Multi-Site Prostate Lesion Segmentation

The Big Problem: "The Art of Contouring"

Imagine you are trying to teach a computer to find prostate cancer tumors in MRI scans. To do this, you need to show the computer examples where experts have drawn lines around the tumors.

However, there's a catch: Experts don't all draw the lines the same way.

Doctor A (at Hospital X) might draw a very tight, precise circle around a tumor.
Doctor B (at Hospital Y) might draw a slightly larger, fuzzier circle because they were trained differently or use different MRI machines.
Doctor C (at Hospital Z) might be very conservative and only draw the darkest parts.

If you train a computer using only Doctor A's drawings, the computer learns to be "Doctor A." When you send that computer to Hospital Y to look at new patients, it gets confused. It sees a tumor that looks like Doctor B's style, but the computer is looking for Doctor A's style. It fails to generalize.

This is the problem the paper solves: How do we teach a computer to find the real tumor, even when the human experts disagree on where the edges are?

The Solution: The "Detective" Approach (Hierarchical EM)

The authors propose a method called Hierarchical EM (Expectation-Maximization). Think of this not as a simple teacher-student relationship, but as a detective investigation.

1. The "Clean" Truth vs. The "Noisy" Clues

The computer assumes that there is a "Perfect, Clean Truth" (the actual tumor) that no one has seen yet. The drawings provided by the doctors are just noisy clues or "imperfect observations" of that truth.

The Goal: The computer wants to reconstruct the "Perfect Truth" by combining all the noisy clues, while figuring out how "reliable" each doctor is.

2. The Two-Step Dance (The EM Algorithm)

The computer learns by doing a two-step dance, over and over again:

Step A: The "Guesstimate" (E-Step)
The computer looks at the MRI scan and the doctor's drawing. It asks: "If the doctor is usually very strict, and the MRI looks a bit fuzzy, where is the tumor really likely to be?"
It creates a "Soft Map." Instead of saying "This pixel is definitely tumor," it says, "There is a 70% chance this is tumor, based on the image and the doctor's habit." This is the computer trying to infer the "Clean Truth."
Step B: The "Report Card" (M-Step)
Now the computer looks at its "Soft Map" and the doctor's original drawing. It asks: "How good was this doctor?"
- If the doctor's drawing matches the computer's "Soft Map" well, the computer gives them a high score (High Sensitivity/Specificity).
- If the doctor's drawing is weirdly different, the computer realizes, "Ah, this hospital has a weird style. I need to trust their drawing less."
Crucially, the computer doesn't just learn the doctors individually. It learns a hierarchy:
- Global Level: What is the average quality of all doctors?
- Site Level: How does Hospital X differ from the average?
- Case Level: Is this specific tumor just really hard to see (ambiguous), or is the doctor just bad at this one case?

By doing this, the computer learns to ignore the "noise" (the specific drawing style of one hospital) and focus on the signal (the actual tumor).

The Analogy: The "Weather Forecast"

Imagine you want to know the true temperature in a city, but you have three different weather stations reporting data.

Station A always reads 2 degrees too high (maybe their thermometer is in the sun).
Station B is accurate but only checks once a day.
Station C is very accurate but sometimes makes typos.

If you just average them, you get a bad forecast.
The Hierarchical EM method is like a smart meteorologist who:

Looks at the data from all three stations.
Realizes Station A is consistently "hot" and adjusts for that bias.
Realizes Station C is usually right but has random typos, so it trusts them less on weird days.
Combines the corrected data to guess the True Temperature.

Once the meteorologist learns these patterns, they can predict the temperature for a new station they've never seen before, because they understand the system of errors, not just the specific numbers.

What Did They Find? (The Results)

The researchers tested this on data from three different hospitals.

The "Old Way" (Standard AI): When they trained on all three hospitals mixed together, the AI did okay. But when they tested it on a new hospital it hadn't seen before, it performed poorly. It was just memorizing the drawing styles of the training hospitals.
The "New Way" (Hierarchical EM): The AI learned to separate the "tumor" from the "drawing style."
- Result: It generalized much better. When tested on a new hospital, it found the tumors more accurately than any other method.
- Bonus: It also gave the researchers a "Report Card" for each hospital. It could say, "Hospital X tends to draw tumors slightly larger than reality," or "Hospital Y is very strict." This helps doctors understand why their data looks the way it does.

Why Does This Matter?

In the real world, you can't always get a perfect "Gold Standard" label for every patient. You have to work with the messy, inconsistent data that real doctors produce.

This paper shows that if you build AI that understands human inconsistency (by modeling it mathematically), you can create medical tools that work reliably across different hospitals, without needing to re-train the AI every time you move to a new city. It turns "bad data" into "good training" by understanding the source of the noise.

1. Problem Statement

The paper addresses the critical challenge of label variability in multi-site medical image segmentation, specifically for prostate lesion segmentation using multiparametric MRI (mpMRI).

The Core Issue: Annotations in multi-site datasets often reflect site-specific contouring protocols and institutional styles rather than a single "ground truth." This leads to label bias, where deep learning models overfit to the local annotation style of the training sites.
Consequence: Models trained on pooled multi-site data generalize poorly to unseen sites (out-of-distribution performance drops significantly). Standard approaches like test-site finetuning are often impractical or introduce bias by forcing the model to match imperfect local labels.
Current Limitations: Inter-reader agreement for prostate lesions is only moderate (Dice $\approx$ 0.4), and existing methods struggle to disentangle true lesion characteristics from site-specific annotation noise.

2. Methodology: Hierarchical EM (HierEM)

The authors propose a Deep Expectation-Maximization (EM) framework that treats observed annotations as noisy observations of an underlying, unobserved "clean" latent lesion mask. The method integrates a segmentation network with a hierarchical probabilistic model of label quality.

A. Core Concept: Latent Label Modelling

Instead of treating the observed label $Y_k$ as ground truth, the model assumes:

There exists a latent "clean" mask $G_k$ .
The observed label $Y_k$ is a noisy version of $G_k$ , generated by a site-specific annotation process.
This process is governed by sensitivity ( $\alpha$ ) and specificity ( $\beta$ ) parameters that vary by site and case.

B. Hierarchical Prior Structure

To prevent overfitting to sparse data and to model the structure of variability, the authors introduce a logistic-normal hierarchical prior for the sensitivity and specificity parameters:

Global Mean ( $\mu_\alpha, \mu_\beta$ ): Represents the average label quality across all sites.
Site-Level Deviations ( $a_s, b_s$ ): Capture systematic biases specific to a particular institution (e.g., a tendency to over-segment).
Case-Level Deviations ( $u_k, v_k$ ): Capture intrinsic ambiguity of specific cases (e.g., small or low-contrast lesions) that affect all annotators.
Regularization: Zero-mean Gaussian priors are placed on the deviations to enforce shrinkage toward the global mean, stabilizing estimates.

C. The EM Learning Procedure

The framework alternates between two steps to optimize the segmentation network parameters ( $\theta$ ) and the latent label-quality parameters ( $\phi$ ):

E-Step (Inference):
- Computes the voxel-wise posterior distribution of the latent mask $G_k(x)$ given the image $X_k$ , the observed label $Y_k$ , and current model parameters.
- This posterior acts as a soft consensus mask, fusing the network's image-based prediction with the likelihood of the observed label given the estimated sensitivity/specificity.
- Formula: $q_k(x) = p(G_k(x)=1 | X_k, Y_k(x))$ , combining the network prior $\pi_k(x)$ and the site-case likelihood.
M-Step (Optimization):
- Update Network ( $\theta$ ): The segmentation backbone (UNet) is trained using the soft posterior $q_k(x)$ as the target (soft labels) via a combination of Cross-Entropy and Dice loss.
- Update Label Quality ( $\phi$ ): The sensitivity/specificity parameters are re-estimated by maximizing the marginal likelihood (MAP estimate) under the hierarchical prior. This is done efficiently using aggregated sufficient statistics (True Positives, False Positives, etc., derived from the posterior $q$ ) and an L-BFGS optimizer.

D. Uncertainty Quantification

The method quantifies uncertainty using predictive entropy of the segmentation probability map. This allows for "selective segmentation," where the model can abstain from predicting on high-uncertainty voxels, improving reliability in critical regions.

3. Key Contributions

Hierarchical Latent Label Framework: A novel EM-based approach that explicitly models label noise as a function of site-specific protocols and case difficulty, rather than treating all labels as equally reliable.
Decoupling of Bias and Evidence: By inferring a latent "clean" mask, the method separates image evidence from site-dependent annotation styles, reducing overfitting to local contouring habits.
Interpretability: The model outputs interpretable estimates of per-site sensitivity and specificity, providing diagnostic insights into annotation quality and variability across different institutions.
Robustness without Finetuning: The method improves cross-site generalization without requiring test-site finetuning or calibration, making it suitable for real-world deployment where new sites may not have labeled data available for adaptation.

4. Experimental Results

The method was evaluated on three distinct cohorts (Sites 1, 2, and 3) containing mpMRI scans with expert annotations.

Evaluation Settings:
- Split A (Pooled): Mixed training and held-out patient-level testing.
- Split B (LOSO - Leave-One-Site-Out): Training on two sites, testing on the third (simulating a completely unseen site).
Performance Metrics (Dice Score):
- LOSO Setting: HierEM significantly outperformed baselines (Standard UNet, Bootstrap, and non-hierarchical Site-EM).
  - Site 1: HierEM 28.11% vs. UNet 25.50%.
  - Site 2: HierEM 27.91% vs. UNet 24.66%.
  - Site 3: HierEM 32.67% vs. UNet 31.20%.
- Statistical Significance: Improvements were statistically significant ( $p < 0.039$ ) compared to comparison methods.
- Boundary Accuracy: HierEM also achieved lower HD95 (Hausdorff Distance), indicating better boundary delineation.
Uncertainty & Calibration:
- Risk-coverage curves showed that HierEM concentrates errors in the rejected (high uncertainty) region, allowing for more reliable abstention compared to standard supervised models.
- At a high specificity ( $\approx 0.99$ ), HierEM achieved the highest sensitivity across all sites.

5. Significance and Conclusion

This work demonstrates that explicitly modeling annotation variability is a more effective strategy for multi-site medical AI than simply pooling data or using standard self-training.

Clinical Impact: The ability to generalize to unseen sites without retraining or calibration is crucial for deploying AI in diverse healthcare systems.
Data Curation: The estimated sensitivity/specificity metrics can guide clinicians and researchers in identifying which sites or cases have poor annotation quality, aiding in data curation and protocol standardization.
Future Directions: The framework is compatible with various backbone networks and can be extended to multi-annotator datasets and more complex clinical workflows.

In summary, HierEM provides a robust, interpretable, and statistically sound solution to the "label noise" problem in multi-site medical imaging, significantly narrowing the performance gap between training and unseen deployment environments.