COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics

Imagine you are a doctor looking at an X-ray or a microscope slide. You use a smart computer program (an AI) to draw a line around a tumor or a specific organ. This is called segmentation.

But here's the thing: Doctors don't just care about the line itself. They care about the numbers that come from that line.

"How big is this tumor?" (Area/Volume)
"Is it growing?"

If the AI draws the line slightly too wide, the calculated size might be 10% too big. If it's too narrow, the size is too small. In medicine, getting that number right is critical for deciding on surgery or medication.

The problem is: How much can we trust that number?

The Old Way: Guessing in the Dark

Traditionally, if you want to know how uncertain an AI is, you treat the whole process like a "black box." You feed an image in, get a size out, and say, "I'm 95% sure the size is between 10 and 20 millimeters."

But because the AI is so complex, this "black box" guess is often very lazy. To be safe, the computer makes the range huge (e.g., "It's between 5 and 50 millimeters"). It's technically correct (it covers the truth), but it's useless. A doctor can't make a decision based on a range that wide.

The New Solution: COMPASS

The paper introduces COMPASS (Conformal Metric Perturbation Along Sensitive Subspaces). Think of COMPASS as a smart, surgical navigator that understands how the AI thinks, rather than just guessing the result.

Here is how it works, using a simple analogy:

1. The "Knob" Analogy

Imagine the AI is a giant radio with thousands of tiny knobs (these are the "features" inside the computer's brain).

The Old Way: You just turn the volume up and down randomly to see how the sound changes. It's messy and inefficient.
The COMPASS Way: COMPASS knows exactly which one specific knob controls the "size" of the tumor. It doesn't touch the other 999 knobs. It gently turns only that specific knob to see how much the size changes.

2. Finding the "Sensitive" Direction

COMPASS looks at the AI's internal brain and asks: "If I wiggle the image slightly, which part of the AI's thinking changes the tumor size the most?"

It finds a "sensitive direction" (a specific combination of knobs). It then says:

"Okay, if I wiggle this specific direction by a tiny bit, the size changes by 1mm. If I wiggle it a medium bit, it changes by 5mm. If I wiggle it a lot, it changes by 20mm."

Because it understands the mechanism (the knobs), it can calculate a tight, precise range (e.g., "The size is definitely between 12 and 14mm") instead of a huge, useless guess.

3. The "Safety Net" (Conformal Prediction)

The paper uses a statistical trick called Conformal Prediction. Think of this as a safety net.

The computer tests itself on a bunch of known examples first (calibration).
It learns: "When I wiggle the knobs this much, I'm usually right 95% of the time."
When a new patient comes in, it applies that exact amount of wiggle.
The Result: It guarantees that the true size is inside the range, but because it used the "smart knob" method, the range is much smaller and more helpful than the old "black box" method.

Why is this a big deal?

It's Tighter: The paper shows that COMPASS gives ranges that are much narrower (more precise) than previous methods, while still being 100% statistically safe.
It Handles "Drift": Sometimes, the data changes (e.g., a new hospital uses a different camera). Old methods break or give bad guesses. COMPASS can adjust its "safety net" to account for these changes, keeping the doctor safe even when the data looks different.
It's Practical: It doesn't require rebuilding the AI. It just adds a smart layer on top that understands the AI's internal "feelings" about the image.

Summary

COMPASS is like giving a doctor a precision ruler instead of a fuzzy tape measure.

Instead of saying, "The tumor is somewhere between the size of a grape and a watermelon," COMPASS says, "The tumor is between the size of a grape and a cherry, and I am mathematically guaranteed to be right." This allows doctors to make life-saving decisions with much more confidence.

1. Problem Statement

In clinical applications, the utility of medical image segmentation models is often determined not by pixel-level accuracy (e.g., Dice score), but by downstream derived metrics such as organ volume, lesion area, or texture patterns. These metrics drive critical clinical decisions.

Existing uncertainty quantification methods face two main limitations in this context:

Pixel-level Conformal Prediction (CP): Traditional CP methods generate uncertainty bounds for pixel masks. While valid, these do not translate efficiently into tight, meaningful bounds for derived scalar metrics (e.g., total area), often resulting in intervals that are too wide to be clinically useful.
Black-Box Metric CP: Applying CP directly to the final scalar metric treats the complex, non-linear pipeline (Image $\to$ Segmentation $\to$ Metric) as a black box. This ignores the internal inductive biases of the neural network, leading to inefficient (overly conservative) prediction intervals.
Computational Intractability: Previous attempts to use Feature Conformal Prediction (FCP) require solving complex optimization problems to find adversarial feature vectors for every data point, which is computationally prohibitive for high-dimensional feature spaces in modern architectures (CNNs/Transformers).

2. Methodology: COMPASS

The authors introduce COMPASS (Conformal Metric Perturbation Along Sensitive Subspaces), a framework that generates efficient, metric-based prediction intervals by calibrating directly in the model's latent representation space.

Core Concept

Instead of perturbing the input image or the final output mask, COMPASS perturbs the intermediate feature representations ( $\hat{z}$ ) of the neural network along specific, data-dependent directions ( $\Delta$ ) that are highly sensitive to the target metric.

Key Technical Components

Linear Latent Perturbations:
- The method defines a prediction set $S_\beta(x)$ as the range of the metric function when the latent features are perturbed by a magnitude $\beta$ along a direction $\Delta$ .
- Theoretical Guarantee: The authors prove that under the assumption of exchangeability, this construction satisfies the nestedness condition (a larger $\beta$ yields a larger interval), ensuring valid marginal coverage for the target metric.
Identifying Sensitive Subspaces (COMPASS-J):
- To avoid the intractable optimization of FCP, COMPASS uses a data-driven approach to find the "sensitive direction."
- It computes the Jacobian of the target metric with respect to the latent features for the training set.
- It applies Principal Component Analysis (PCA) to these Jacobians to identify the low-dimensional subspace (principal directions) where the metric is most sensitive.
- For a new sample, the perturbation direction is found by projecting its Jacobian onto this learned subspace.
Efficient Endpoint Approximation:
- Computing the full range of the metric over a perturbation interval is expensive. However, the authors empirically demonstrate that the relationship between the latent perturbation and the metric is monotonic.
- This allows the use of an Endpoint Method: The interval is computed simply by evaluating the metric at $+\beta$ and $-\beta$ , requiring only two forward passes per sample during calibration, rather than a full sweep.
Handling Distribution Shift (Weighted COMPASS):
- To address covariate shifts (where test data distribution differs from calibration data), COMPASS integrates Weighted Conformal Prediction (WCP).
- It uses auxiliary classifiers (trained on latent features or Jacobians) to estimate density ratios and re-weight calibration samples, restoring target coverage under distribution shifts.

3. Key Contributions

Novel Framework: Introduction of COMPASS, the first practical framework to perform feature-level conformal prediction specifically for derived segmentation metrics.
Theoretical Validity: Proof that linear perturbations in latent space achieve valid marginal coverage under exchangeability, provided the prediction sets are constructed as metric ranges.
Computational Efficiency: A tractable algorithm that replaces adversarial search with PCA-based sensitivity analysis and endpoint evaluation, making it feasible for high-dimensional medical imaging tasks.
Robustness to Shift: Extension of the framework to handle covariate shifts via weighted calibration using internal model features (Jacobian-based weighting), outperforming standard class-label weighting.

4. Experimental Results

The authors evaluated COMPASS on four medical image segmentation tasks:

H&E Histopathology: Colorectal cancer segmentation (EBHI).
Dermoscopy: Skin lesion segmentation (HAM10000).
Ultrasound: Thyroid nodule segmentation (TN3K).
Endoscopy: Polyp segmentation (Kvasir).

Key Findings:

Tighter Intervals: COMPASS (specifically the COMPASS-J variant using internal features) produced significantly tighter prediction intervals (up to 2-3x smaller) compared to standard output-space CP (SCP, CQR) and end-to-end methods (E2E-CQR), while maintaining valid coverage (e.g., 95%).
Feature vs. Logits: Using deeper internal features (COMPASS-J) generally yielded tighter intervals than perturbing final logits (COMPASS-L), as deeper features capture more semantic information relevant to the metric.
Monotonicity: The experiments confirmed that perturbing features along the PCA-derived directions results in monotonic metric changes, validating the efficient endpoint approximation.
Distribution Shift: Under adversarial covariate shifts, unweighted methods failed to maintain coverage. Weighted COMPASS-J (using Jacobian or feature weights) was the only method to consistently recover target coverage while maintaining the most efficient interval sizes.
Comparison to FCP: COMPASS achieved much tighter intervals than a conceptual oracle version of the original FCP method, which struggled with convergence and produced extremely wide intervals.

5. Significance and Impact

Clinical Utility: By providing uncertainty quantification for derived metrics (e.g., tumor volume) rather than just pixel masks, COMPASS offers decision support that is directly aligned with clinical workflows.
Efficiency: The method bridges the gap between the theoretical rigor of conformal prediction and the computational constraints of modern deep learning, making principled uncertainty quantification practical for real-world medical AI.
Generalizability: The framework is architecture-agnostic (tested on U-Net and SegResNet) and applicable to any differentiable metric derived from segmentation, paving the way for robust, trustworthy medical AI systems.

In summary, COMPASS leverages the internal structure of neural networks to create efficient, statistically valid uncertainty bounds for clinical metrics, solving the inefficiency of black-box approaches and the irrelevance of pixel-level bounds.