ConfHit: Conformal Generative Design with Oracle Free Guarantees

Imagine you are a master chef trying to invent a new, revolutionary dish. You have a powerful AI assistant (a Generative Model) that can dream up thousands of unique recipes in seconds.

The problem? The AI is a bit of a dreamer. It might suggest a dish that sounds amazing but tastes terrible, or worse, is actually poisonous. In the real world (like drug discovery), you can't just taste-test every single recipe. You have to send them to a lab for expensive, time-consuming chemical tests. You only have a limited budget for these tests.

The Big Question: How do you pick a small group of recipes from the AI's thousands of suggestions, guaranteeing that at least one of them will actually work, without wasting your money on the duds?

This is exactly what the paper CONFHIT solves.

The Old Way vs. The New Way

The Old Way (The "Oracle" Problem):
Previous methods tried to solve this by assuming you had a "Magic Taste Tester" (an Oracle) who could instantly tell you if a recipe was good or bad before you sent it to the lab.

Reality Check: In drug discovery, there is no magic taste tester. You have to actually make the drug and test it. So, these old methods were stuck in theory, unable to be used in the real world.

The New Way (CONFHIT):
CONFHIT is a new framework that says, "We don't need a magic taste tester. We just need to be smart about how we look at the data." It uses a statistical trick called Conformal Prediction to give you a mathematically guaranteed safety net.

How CONFHIT Works (The Analogy)

Think of the process like a Quality Control Inspector at a factory.

1. The "Calibration" (Learning the Rules)

First, the inspector looks at a box of "known bad" products from the past (historical data). They learn what these bad products look like.

The Twist: The new products coming off the line (the AI's new designs) might look slightly different because the AI is creative. This is called Distribution Shift.
The Fix: CONFHIT uses a "Density Ratio" (think of it as a magnifying glass) to adjust for these differences. It says, "Okay, these new designs look a bit different, so let's weigh them differently so we can compare them fairly to the old bad ones."

2. The "Certification" (The Safety Net)

The AI generates a batch of 50 new recipes. The inspector doesn't check them one by one. Instead, they run a statistical test on the whole batch.

They ask: "Is there any chance that none of these 50 recipes work?"
CONFHIT calculates a Confidence Score (a p-value). If the score is low enough, the inspector can say with 95% (or 99%) certainty: "I guarantee that at least one of these 50 recipes is a winner."
If the score is too high, they say, "I'm not confident enough. Don't waste your money testing this batch."

3. The "Design" (The Shortlist)

Once the inspector is confident the batch has a winner, they don't just send all 50 to the lab. That's too expensive!

They start trimming the list. They check smaller and smaller groups (40, then 30, then 10...).
They stop as soon as they find the smallest possible group that still holds the guarantee: "This tiny group of 5 recipes definitely contains a winner."
This saves huge amounts of money and time.

Why This is a Big Deal

No Magic Required: It works even though you can't test the drugs until the very end. It removes the need for an impossible "Oracle."
It Handles the "Weirdness": AI models often generate things that look different from the data they were trained on. CONFHIT has a built-in correction for this, so it doesn't get confused.
It's Efficient: Instead of testing 100 things and hoping for the best, it tells you exactly how many to test to be safe, and then prunes that number down to the absolute minimum.

The Bottom Line

Imagine you are looking for a needle in a haystack.

Old methods said: "We can't find the needle unless we have a metal detector that works perfectly on every type of hay." (Impossible).
CONFHIT says: "We don't need a perfect detector. We just need to look at the hay in a specific way, weigh the different types of hay, and use a statistical rule to say, 'I promise, if you grab this small handful, there is definitely a needle in here.'"

This allows scientists to use powerful AI to design new drugs and materials with mathematical confidence, saving millions of dollars in failed experiments and speeding up the discovery of life-saving medicines.

1. Problem Statement

The paper addresses a critical bottleneck in scientific discovery (specifically drug discovery) using deep generative models. While generative models can produce novel molecules or materials, they lack reliable statistical guarantees that the generated candidates satisfy desired properties (e.g., binding affinity, drug-likeness).

Existing Conformal Prediction (CP) methods for generative models face three major limitations in resource-constrained settings:

Oracle Dependency: Standard CP requires an "oracle" to evaluate generated samples immediately to calibrate the model. In drug discovery, this implies wet-lab synthesis and testing, which is too expensive and slow for the iterative generation process.
Distribution Shift: Generated candidates often follow a different distribution ( $Q$ ) than the historical calibration data ( $P$ ), violating the exchangeability assumption required for standard CP.
Budget Constraints: Researchers have a limited budget for generating and validating candidates. They need to know before synthesis whether a batch contains a "hit" (a valid candidate) and, if so, how to refine that batch to a compact set without wasting resources.

Core Questions:

Certification: Given a batch of $N$ generated samples, can we guarantee with confidence $1-\alpha$ that the batch contains at least one hit?
Design: Can we refine this batch into a smaller, compact subset that still maintains the $1-\alpha$ guarantee?

2. Methodology: CONFHIT Framework

CONFHIT is a model-agnostic framework that solves these problems without requiring an experimental oracle during the generation phase. It relies on weighted exchangeability and conformal nested testing.

A. Theoretical Foundation: Weighted Exchangeability

The method assumes a covariate shift between the calibration data (historical inactive molecules) and the generated test samples.

Let $P$ be the distribution of calibration data and $Q$ be the distribution of generated data.
The density ratio is defined as $w(x) = \frac{dQ}{dP}(x)$ .
Under the null hypothesis that a generated batch contains no hits (all samples are inactive), the paper proves that the inactive calibration samples and the generated samples satisfy a weighted exchangeability structure. This allows the construction of valid p-values even when distributions differ.

B. Certification: Joint Weighted Conformal P-Value

To certify a batch of $N$ samples, CONFHIT constructs a p-value testing the global null hypothesis: None of the $N$ samples are hits.

Conformity Score ( $V$ ): A pre-trained property predictor $\hat{\mu}$ estimates the likelihood of a sample being a hit. The score function aggregates these predictions (e.g., Max-pooling: $V = \max_j \hat{\mu}(x_j)$ ).
Permutation Test: The method treats the calibration data (inactive) and test data as a pool. It generates random permutations of this pool.
Weighted P-Value: For each permutation, it calculates a weighted rank of the observed conformity score. The weights are the density ratios $w(x)$ .
$p_{rand} = \frac{\sum_{b=0}^B \bar{w}(\pi^{(b)}) \mathbb{I}(V(\pi^{(0)}) \leq V(\pi^{(b)}))}{\sum_{b=0}^B \bar{w}(\pi^{(b)})}$
Where $\bar{w}$ is the joint likelihood ratio of the permuted test samples.
Decision: If $p_{rand} \leq \alpha$ , the batch is certified to contain at least one hit with probability $1-\alpha$ .

C. Design: Conformal Nested Testing

To find a compact subset, CONFHIT employs a nested testing strategy:

Consider nested subsets of the generated batch: $C_1 \subset C_2 \subset \dots \subset C_N$ .
Compute p-values $p_k$ for each subset $C_k$ .
Monotonicity Adjustment: Ensure the sequence of p-values is monotonic ( $p_1 \geq p_2 \geq \dots \geq p_N$ ) by taking the running maximum.
Stopping Rule: Select the smallest index $\hat{N}$ such that $p_{\hat{N}} \leq \alpha$ .
Guarantee: Theorem 3.4 proves that this procedure controls the error rate (probability of returning a set with no hits) at level $\alpha$ , regardless of the generative model or scoring function.

D. Practical Implementation

Density Ratio Estimation: Since $w(x)$ is unknown, it is estimated using Kernel Density Estimation (KDE) on latent features extracted from a property predictor.
Robustness: The framework includes diagnostics (balance checks, sensitivity analysis) to ensure the estimated weights do not degrade the statistical guarantees significantly.

3. Key Contributions

Oracle-Free Guarantees: CONFHIT is the first framework to provide finite-sample validity guarantees for generative design without requiring an experimental oracle to evaluate generated samples during the process.
Handling Distribution Shift: It introduces a density-ratio-weighted conformal p-value specifically designed for multiple test samples under covariate shift, extending previous single-sample weighted CP methods.
Nested Testing for Design: It proposes a novel conformal nested testing procedure that simultaneously certifies the existence of a hit and refines the candidate set to the smallest possible size while maintaining statistical validity.
Model Agnosticism: The method works with any generative model (VAE, Diffusion, Transformers) and any property predictor, relying only on the density ratio and a conformity score.

4. Experimental Results

The authors evaluated CONFHIT on two tasks:

Constrained Molecule Optimization (CMO): Optimizing molecules for DRD2 binding and QED scores while maintaining similarity to a seed.
Structure-Based Drug Discovery (SBDD): Generating ligands for specific protein pockets using 3D diffusion models.

Key Findings:

Valid Coverage: Across various models (HGraph2Graph, SelfEdit, TargetDiff, MolCraft) and budgets, CONFHIT consistently maintained error rates below the target $\alpha$ (e.g., $\alpha=0.1$ ), whereas baselines without shift correction failed.
Compact Sets: The nested testing procedure successfully pruned large generated batches (e.g., $N=15$ ) down to small, actionable sets (e.g., 2–5 molecules) without sacrificing the validity guarantee.
Comparison to Baselines:
- Bonferroni Correction: Too conservative, often resulting in empty sets (100% failure rate in SBDD at $\alpha=0.1$ ).
- Oracle-Based Methods: Failed to control error rates when distribution shift was ignored, even with oracle access.
- Heuristic Baselines: Unreliable error control when predictor quality degraded.
Robustness: The method remained valid even when the property predictor was noisy or inverted, proving that statistical validity is decoupled from predictor accuracy (though accuracy affects power/efficiency).

5. Significance

CONFHIT bridges the gap between theoretical statistical guarantees and practical scientific discovery.

Resource Efficiency: By certifying batches before expensive wet-lab experiments, it prevents wasted resources on batches that are statistically unlikely to contain hits.
Actionable Output: It moves beyond simple "generation" to "certified design," providing researchers with a shortlist of candidates that are mathematically guaranteed to contain a success with high probability.
Generalizability: The framework is applicable beyond drug discovery to any scientific domain involving generative design under resource constraints and distribution shifts (e.g., materials science, protein engineering).

In summary, CONFHIT establishes a principled, reliable, and efficient workflow for using generative AI in high-stakes scientific discovery, ensuring that the "black box" of generative models is accompanied by rigorous, oracle-free statistical safety nets.