When should we trust the annotation? Selective prediction for molecular structure retrieval from mass spectra

Imagine you are a detective trying to identify a suspect from a blurry, grainy security camera photo (the mass spectrum). You have a massive database of mugshots (the molecular database) and you need to find the matching face.

In the world of chemistry, this is called metabolomics. Scientists use machines to scan tiny molecules, but the resulting data is messy. Often, the computer guesses the wrong molecule, or it's just not sure at all. In a hospital or an environmental lab, a wrong guess could mean misdiagnosing a patient or missing a toxic chemical.

This paper asks a simple but crucial question: "When should we trust the computer's guess, and when should we say, 'I don't know'?"

Here is the breakdown of their solution, using some everyday analogies.

1. The Problem: The "Confident but Wrong" Trap

Current AI models are getting better at guessing molecules, but they still make mistakes. The danger is that they often sound very confident even when they are wrong.

The Analogy: Imagine a student taking a test. They might answer a question with 100% confidence, but if they studied the wrong textbook, they are confidently wrong. In high-stakes fields like medicine, we need a system that knows when to raise its hand and say, "I'm not sure, let me skip this one."

2. The Solution: "Selective Prediction"

The authors propose a framework called Selective Prediction. Think of this as a Quality Control Gatekeeper.

How it works: The AI looks at a molecule. If it feels very confident, it gives the answer. If it feels shaky or confused, it abstains (refuses to answer).
The Trade-off: By refusing to answer the hard questions, the AI makes fewer mistakes overall. You get a smaller list of answers, but you can be almost 100% sure they are correct.

3. The Big Discovery: Where to Look for "Confidence"

The researchers tested many different ways to measure "confidence." They found that where you look for confidence matters more than how you look for it.

❌ The Wrong Approach: Looking at the "Ingredients" (Fingerprint Level)

The AI breaks a molecule down into tiny chemical "ingredients" (like a list of features: "has a ring," "has an oxygen," etc.).

The Mistake: The researchers found that checking if the AI is confident about each individual ingredient is useless.
The Analogy: Imagine you are trying to identify a specific car (a red 2020 Ford Mustang). The AI is 100% sure the car has "four wheels," "a steering wheel," and "a red paint job." But it could be a red Ford F-150, a red Ferrari, or a red Mustang. Being confident about the parts doesn't mean you know the whole car.
Result: Checking the confidence of the ingredients gave results no better than random guessing.

✅ The Right Approach: Looking at the "Final Verdict" (Retrieval Level)

Instead of checking the ingredients, the researchers looked at how confident the AI was about the final ranking of candidates.

The Strategy: They asked: "How much better is the #1 guess compared to the #2 guess?" or "How stable is the top 5 list?"
The Analogy: This is like a judge in a talent show. Instead of asking, "Is the singer confident about hitting the high note?" (the ingredient), the judge asks, "Is the singer clearly the best act compared to everyone else?"
Result: Simple, fast calculations based on the final ranking worked incredibly well. They could tell when the AI was likely to be right or wrong.

4. The "Magic" Guarantee: The Safety Net

The most exciting part of the paper is how they handle the "How sure are we?" part.

The Problem: Usually, you have to guess a threshold (e.g., "If confidence is above 80%, accept it"). But what if you're wrong about that 80%?
The Solution: They used a mathematical tool (called SGR) that acts like a legal contract.
The Analogy: Imagine you are a factory manager. You tell the AI: "I am willing to accept a 5% error rate. I don't care how many products you reject, but you must guarantee that the ones you keep are 95% correct."
The Result: The system calculates a specific cutoff point that mathematically guarantees this safety. If you set the rule, the system obeys it with near-perfect statistical certainty.

5. What They Found (The Takeaways)

Don't overcomplicate it: You don't need fancy, slow, complex math (like "Bayesian Epistemic Uncertainty") to know if a guess is good. Simple, fast checks on the final ranking work best.
Context is King: The "ingredients" (fingerprint bits) don't tell you if the final answer is right. You have to look at the competition (the other candidates).
Safety First: Scientists can now set a "tolerable error rate" (e.g., "I can only accept 1 mistake in 100") and the system will automatically filter out everything that doesn't meet that standard.

Summary

This paper teaches us that in the high-stakes world of identifying molecules, knowing when not to guess is just as important as guessing correctly.

By building a "gatekeeper" that looks at the final ranking rather than the tiny details, and by using math to guarantee safety, we can turn a noisy, error-prone AI into a trustworthy tool for doctors and scientists. It transforms molecular identification from a "wild guess" into a calculated, uncertainty-aware decision.

Here is a detailed technical summary of the paper "When should we trust the annotation? Selective prediction for molecular structure retrieval from mass spectra."

1. Problem Statement

In untargeted metabolomics, Tandem Mass Spectrometry (MS/MS) generates vast amounts of data, yet only ~10% of detected features can be annotated with molecular structures. This "dark matter" of metabolomics stems from the mismatch between the vast chemical space and limited reference libraries, as well as the complexity of molecular fragmentation.

While machine learning (ML) methods for molecular retrieval (predicting structures from spectra) have advanced, they still exhibit significant error rates. In high-stakes applications like clinical diagnostics and environmental screening, incorrect annotations can have severe consequences. The core problem addressed is how to determine when a specific prediction is trustworthy without relying solely on overall model accuracy. The authors propose using Selective Prediction, a framework where a model can "abstain" from making a prediction when its uncertainty is too high, thereby trading off coverage (the fraction of predictions made) for lower risk (error rate) on accepted predictions.

2. Methodology

2.1 Framework: Selective Prediction & Risk-Coverage Tradeoff

The authors formulate molecular structure retrieval as a selective classification problem.

Task: Given a spectrum $x$ , predict a molecular fingerprint $\theta$ and rank candidate structures from a database based on cosine similarity.
Selection Function: A scoring function $\kappa(x)$ quantifies confidence. A threshold $\tau$ determines whether to accept ( $\kappa(x) \ge \tau$ ) or abstain.
Metrics:
- Coverage ( $\phi$ ): The fraction of inputs accepted.
- Selective Risk ( $R$ ): The error rate (miss rate) restricted to accepted instances.
- Risk-Coverage Curve: Visualizes the trade-off as $\tau$ varies.
- AURC (Area Under Risk-Coverage): Lower values indicate better performance. The authors use relative AURC to normalize against an oracle (perfect rejection) and a random baseline.

2.2 Uncertainty Quantification Strategies

The paper evaluates scoring functions at two levels of granularity and three families of methods:

Fingerprint-Level Uncertainty: Uncertainty over the predicted binary bits of the molecular fingerprint.
Retrieval-Level Uncertainty: Uncertainty over the ranking of candidate molecules.

Scoring Function Families:

First-Order (Single Prediction):
- Confidence ( $\kappa_{conf}$ ): Maximum softmax probability of the top candidate.
- Score Gap ( $\kappa_{gap}$ ): Difference between the top two similarity scores.
Second-Order (Posterior Distribution): Requires ensembles or stochastic sampling (e.g., Deep Ensembles, MC Dropout, Laplace Approximation) to decompose uncertainty into:
- Aleatoric Uncertainty: Inherent noise/ambiguity in the data (e.g., isomers).
- Epistemic Uncertainty: Model ignorance due to limited training data.
- Total Uncertainty: Sum of aleatoric and epistemic.
- Rank Variance ( $\kappa_{rank}$ ): Variance in the ranking of top- $K$ candidates across posterior samples.
Distance-Based:
- Deep k-NN ( $\kappa_{knn}$ ) & Mahalanobis Distance ( $\kappa_{mah}$ ): Measure distance of the input from the training distribution in latent space.

2.3 Risk Control with Statistical Guarantees

To provide finite-sample guarantees, the authors employ the Selection with Guaranteed Risk (SGR) algorithm. This distribution-free method uses a held-out calibration set to select a threshold $\tau^*$ such that the true selective risk is bounded by a user-specified target $r^*$ with high probability ($1-\delta$).

2.4 Experimental Setup

Dataset: MassSpecGym benchmark (231,104 spectra, 28,929 unique molecules).
Model: A Multi-Layer Perceptron (MLP) trained to predict molecular fingerprints (4096-dim Morgan fingerprints).
Uncertainty Estimation: Deep Ensembles (S=5), MC Dropout (S=50), and Laplace Approximation (S=50).
Evaluation: Hit@K (K=1, 5, 20) as the primary success metric.

3. Key Results

3.1 Retrieval-Level vs. Fingerprint-Level

Fingerprint-level uncertainty is a poor proxy for retrieval success. Scores based on bit-level uncertainty (even total uncertainty) performed near the random baseline (relAURC $\approx$ $\approx$ 0.9).
- Reason: A spectrum can have a confidently predicted fingerprint but still fail retrieval if multiple candidates share similar substructures. Conversely, a noisy fingerprint can succeed if the correct candidate is structurally distinct.
Retrieval-level uncertainty is superior. Scoring functions operating directly on candidate rankings significantly outperformed fingerprint-level scores.

3.2 Epistemic vs. Aleatoric vs. Total Uncertainty

Epistemic uncertainty alone is ineffective. It consistently underperformed compared to total or aleatoric uncertainty.
- Reason: For selective prediction, the optimal criterion is the total predictive uncertainty (conflating data noise and model ignorance). Isolating epistemic uncertainty discards the aleatoric component, which is crucial for distinguishing difficult cases (e.g., isomers).
First-order measures are highly effective. Computationally inexpensive first-order scores (Confidence and Score Gap) achieved strong risk-coverage trade-offs, often matching or exceeding complex second-order Bayesian methods.

3.3 Impact of $K$ (Retrieval Granularity)

For $K=1$ (Exact Match): The Score Gap ( $\kappa_{gap}$ ) was the best criterion, as it directly measures the separation of the top candidate.
For $K>1$ (Set Retrieval): Rank Variance ( $\kappa_{rank}$ ) became the strongest criterion. It captures the stability of the top- $K$ set composition, which is critical when the goal is to include the correct molecule in a list rather than identifying a single winner.

3.4 Candidate Set Size

The size of the candidate set ( $|C|$ ) is a strong indicator of difficulty. However, when $|C|$ is constant (e.g., capped at 256), it becomes uninformative.
Rank Variance captures both the difficulty induced by large candidate sets and general ranking instability, making it robust even when candidate set sizes are fixed.

3.5 Distance-Based Scores

Measures like Deep k-NN and Mahalanobis distance performed poorly (near random), suggesting that the learned latent space organizes spectra by the training objective (retrieval) rather than by "difficulty" or proximity to training data in a way that predicts retrieval failure.

3.6 Risk-Controlled Annotation

Using the SGR algorithm, practitioners can specify a tolerable error rate (e.g., 50% miss rate for Hit@20) and obtain a subset of annotations satisfying this constraint with high probability.
At a target risk of 0.5 for Hit@20, the model could retain 87% of test spectra while guaranteeing the error bound. For Hit@1 (exact match), coverage was much lower due to the high base error rate, highlighting the need for better base models for exact identification.

4. Key Contributions

Systematic Evaluation: The first comprehensive study of selective prediction for molecular structure retrieval, comparing uncertainty strategies across fingerprint and retrieval levels.
Granularity Insight: Demonstrated that uncertainty must be quantified at the retrieval level (candidate ranking) rather than the fingerprint level to effectively filter predictions.
Epistemic vs. Total: Showed that total predictive uncertainty (or aleatoric) is more informative for selective prediction than epistemic uncertainty alone, challenging the common assumption that model ignorance is the primary driver of rejection.
Practical Efficiency: Proved that computationally inexpensive first-order confidence measures (like score gaps) often outperform expensive Bayesian approximations (Deep Ensembles, MC Dropout) for this specific task.
Guaranteed Risk Control: Applied the SGR algorithm to provide provable, distribution-free guarantees on annotation quality, allowing practitioners to define acceptable error rates and receive a guaranteed subset of reliable annotations.

5. Significance

This work transforms molecular identification from a "black box" prediction task into an uncertainty-aware decision process. By providing a framework to specify tolerable error rates and obtaining subsets of annotations that satisfy these constraints with statistical guarantees, the paper enables the safe deployment of ML in critical fields like clinical metabolomics and environmental screening. It suggests that for retrieval tasks, simple, fast confidence metrics are often sufficient and preferable to complex uncertainty quantification methods, provided they operate at the correct level of granularity (retrieval vs. fingerprint).