Original authors: Xinwu Ye, He Cao, Hao Li, Bin Feng, Zijing Liu, Xiangru Tang, Yu Li, Shenghua Gao

Published 2026-05-12

📖 4 min read☕ Coffee break read

Original authors: Xinwu Ye, He Cao, Hao Li, Bin Feng, Zijing Liu, Xiangru Tang, Yu Li, Shenghua Gao

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a master chef trying to invent a new dish. You have two main goals:

Taste (Utility): The dish must be delicious and meet specific criteria (e.g., "low calorie," "spicy," or "gluten-free").
Variety (Diversity): You don't want to make 100 versions of the exact same spicy pasta. You want a menu with many different, unique options.

The Problem:
In the world of AI-generated molecules (for drugs) or proteins (for biology), current AI chefs are great at making one perfect dish. But if you ask them to make 100 dishes that are all "delicious," they tend to get stuck in a rut. They might make 100 slightly different versions of the same pasta dish. They lose variety because they are so focused on the "perfect taste."

Existing methods try to fix this by telling the AI, "Don't make anything you've made before." But this is like telling a chef, "Don't use salt because you used it yesterday." It's an indirect fix that can sometimes make the food taste weird or cause the chef to wander off into making inedible nonsense.

The Solution: SGRPO (Supergroup Relative Policy Optimization)
The paper introduces a new training method called SGRPO. Think of it as a "Group Taste-Test" strategy.

Instead of judging every single dish one by one, SGRPO organizes the AI's attempts into groups (like a "Supergroup" of 100 dishes).

Here is how it works, step-by-step, using our kitchen analogy:

The Group Challenge: The AI is asked to make a batch of 100 dishes for a specific request (e.g., "Make a spicy, low-calorie meal").
The Group Score: The system looks at the entire batch and asks: "How different are these 100 dishes from each other?"
- If the batch has 100 unique recipes (a salad, a curry, a soup, a taco, etc.), the group gets a high diversity score.
- If the batch has 100 versions of the same pasta, the group gets a low diversity score.
The Comparison: The AI makes several of these batches. It compares them. "Batch A was very diverse; Batch B was boring."
The "Leave-One-Out" Trick (The Secret Sauce): This is the clever part. The system doesn't just reward the whole group; it figures out which specific dishes made the group diverse.
- It asks: "If we removed this specific taco from the group, would the group still be diverse?"
- If removing the taco makes the group boring, that taco gets a huge bonus for being unique.
- If removing the taco doesn't change much (because there are 10 other tacos), that taco gets a smaller bonus.
The Reward: The AI learns that to get the best score, it needs to create dishes that are both tasty and genuinely unique contributors to the group's variety.

Why is this better?

No More "Echo Chambers": Old methods often forced the AI to be different just for the sake of being different, leading to bad results. SGRPO rewards useful diversity.
The "Pareto Frontier": In math terms, this paper claims SGRPO pushes the "frontier" outward. Imagine a graph where the X-axis is "Taste" and the Y-axis is "Variety."
- Old AI could get high taste but low variety, or high variety but low taste.
- SGRPO allows the AI to get high taste AND high variety at the same time. It expands the menu of possible options.

Where did they test this?
The authors tested this "Group Taste-Test" on three real-world cooking challenges:

Inventing new small molecules (like creating new chemical structures for drugs from scratch).
Designing molecules that fit a specific pocket (like designing a key that fits a specific lock, such as a protein binding site).
Designing new proteins (creating new biological sequences).

The Results:
In all three cases, SGRPO produced a wider range of high-quality options than previous methods. It didn't just make the AI "random"; it made the AI smarter at balancing the need for a specific function with the need for creative variety.

In a Nutshell:
SGRPO is a training method that teaches AI to stop copying itself. Instead of judging every attempt in isolation, it judges groups of attempts, rewarding the AI for creating batches where every single item adds something new and valuable to the mix. This leads to a wider, more useful selection of scientific discoveries.

Technical Summary: Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization

1. Problem Statement

Biomolecular generation aims to produce candidates (small molecules or proteins) that satisfy specific chemical or biological objectives. While Reinforcement Learning (RL) is a standard framework for post-training pretrained generators to maximize utility (e.g., drug-likeness, docking scores, stability), optimizing for utility alone often leads to mode collapse, where the generator concentrates probability mass on a narrow family of high-scoring candidates. Conversely, maintaining diversity is challenging because diversity is a set-level property (defined over collections of samples), whereas standard policy optimization updates individual rollouts.

Existing diversity-promoting RL methods often rely on indirect proxies, such as novelty penalties relative to historical samples or memory-based filters. These approaches can over-penalize useful high-density modes or induce distributional drift. The central challenge addressed by this paper is: How can we optimize sample-set diversity directly as a first-class objective while still assigning useful credit to individual generated candidates?

2. Methodology: Supergroup Relative Policy Optimization (SGRPO)

The authors propose Supergroup Relative Policy Optimization (SGRPO), a flexible framework inspired by Group Relative Policy Optimization (GRPO) that directly optimizes the trade-off between rollout-level utility and set-level diversity.

Core Mechanism

SGRPO operates by sampling multiple candidate sets (groups) under the same condition and comparing them to redistribute diversity rewards. The process involves four key steps:

Same-Condition Supergroup Sampling:
For a given condition $C$ (e.g., a specific protein pocket or no condition), the policy samples $M$ groups, each containing $K$ independently generated rollouts. This collection $S(C) = \{G_1, \dots, G_M\}$ forms a "supergroup" of $N=MK$ candidates. Comparisons are restricted to this supergroup to avoid confounding policy quality with the intrinsic difficulty of different conditions.
Group-Level Diversity Scoring:
Each group $G_m$ is assigned a diversity score $R_m = D(G_m)$ based on a user-specified metric (e.g., internal Tanimoto distance for molecules, Levenshtein similarity for proteins). The group relative diversity signal is computed via a leave-one-out comparison against other groups in the supergroup:
$A_m^{grp} = R_m - \frac{1}{M-1}\sum_{h \neq m} R_h$
A positive $A_m^{grp}$ indicates the group is more diverse than its alternatives.
Set-Aware Redistribution (Leave-One-Out Contribution):
To bridge the gap between set-level signals and individual policy updates, SGRPO redistributes the group diversity reward to individual rollouts based on their contribution to the group's diversity.
- For each rollout $x_{m,i}$ , a leave-one-out contribution is calculated: $c_{m,i} = D(G_m) - D(G_m \setminus \{x_{m,i}\})$ .
- These contributions are standardized and used to form sign-aware softmax weights.
- If a group is highly diverse ( $A_m^{grp} > 0$ ), the positive signal is concentrated on candidates with high diversity contributions. If a group is less diverse ( $A_m^{grp} < 0$ ), the negative signal is concentrated on candidates with low contributions.
- This ensures the average group reward is preserved while assigning credit to the specific molecules driving diversity.
Supergroup-Relative Policy Update:
The final reward for a rollout combines its individual utility $r_{m,i}$ and the redistributed diversity reward $\tilde{R}_{m,i}$ :
$\bar{r}_{m,i} = (1-\lambda)r_{m,i} + \lambda \tilde{R}_{m,i}$
where $\lambda$ controls the utility-diversity trade-off. The advantage function is computed as a leave-one-out baseline over all rollouts in the supergroup, and the policy is updated using a clipped PPO-style objective with KL regularization.

3. Key Contributions

Direct Set-Level Optimization: SGRPO is the first framework to directly optimize set-level diversity as a primary objective in biomolecular generation, rather than relying on indirect novelty proxies or history-dependent penalties.
Decoupled Design: The method is decoupled from specific generator architectures (autoregressive or diffusion), utility rewards, or diversity metrics, allowing instantiation with various GRPO-style optimizers (e.g., standard GRPO for autoregressive models, Coupled-GRPO for discrete diffusion).
Credit Assignment Mechanism: The introduction of a leave-one-out diversity contribution mechanism allows the system to assign credit to individual candidates that genuinely support set diversity, solving the problem of optimizing a set property with individual updates.
Robustness to Group Size: The method remains effective even with small group sizes, making it computationally feasible for large-scale biomolecular design.

4. Experimental Results

The authors evaluated SGRPO across three distinct biomolecular generation settings:

De Novo Small-Molecule Design: Using GenMol (discrete diffusion).
Pocket-Based Small-Molecule Design: Using GenMol-P (pocket-conditioned diffusion).
De Novo Protein Design: Using ProGen2 (autoregressive).

Performance Metrics:
The evaluation focused on the Utility-Diversity Pareto Frontier, measured by:

Hypervolume (HV): The area dominated by the non-dominated operating points.
Distance to Ideal Point (DIP): Distance to the theoretical best utility and diversity.
R2 Indicator: Average weighted worst-case shortfall.

Findings:

Frontier Expansion: Across all tasks and decoding sweeps, SGRPO consistently expanded the attainable utility-diversity Pareto frontier compared to pretrained generators, standard GRPO, and memory-assisted GRPO.
Preservation of Diversity: In high-utility regimes where baselines (especially GRPO) suffered from severe diversity collapse, SGRPO maintained significantly higher diversity. For example, in pocket-based design, SGRPO shifted the operating-point trajectory outward, retaining markedly higher diversity at comparable utility levels.
Ablation Studies: Removing the diversity reward reverted performance to standard GRPO levels. Removing the leave-one-out credit assignment weakened performance, confirming the importance of redistributing diversity signals to specific contributors.
Distribution Dynamics: Visualizations of sequence distributions showed that while GRPO and memory-assisted GRPO contracted into narrow regions or drifted, SGRPO explored and preserved multiple clusters, indicating broader coverage of the sequence space.

5. Significance and Claims

The paper claims that SGRPO offers a broadly applicable post-training principle for expanding the operating points available to biomolecular generation models. By directly rewarding diverse generated sets, the method pushes the utility-diversity frontier outward, allowing models to achieve better utility at fixed diversity levels (and vice versa) without sacrificing coverage.

The authors emphasize that this approach addresses a fundamental limitation in current RL-based biomolecular design: the inability to treat diversity as a first-class objective during policy optimization. The results suggest that directly optimizing set-level diversity is a practical and robust strategy for mitigating mode collapse in both small-molecule and protein generation tasks. The code is made available to facilitate further research and instantiation in different domains.

Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization