Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

🧪 The Big Picture: Teaching a Robot Chemist to Invent New Drugs

Imagine you have a brilliant but inexperienced Robot Chemist (a Large Language Model, or LLM). Your goal is to give it a starting chemical molecule and a specific instruction, like: "Make this molecule better at treating headaches, but don't change its shape too much, or it won't fit in the human body."

This is called Molecular Optimization. It's a balancing act: you want to improve a property (like effectiveness) while keeping the structure similar to the original (to ensure safety).

The paper argues that the current ways of teaching this robot are failing, and they propose a new, smarter method called RePO.

🚧 The Problem: Why Current Methods Fail

The researchers tried two standard ways to teach the robot, and both had major flaws:

1. The "Copycat" Method (Supervised Fine-Tuning / SFT)

How it works: You show the robot a list of problems and their perfect answers. It memorizes them.
The Flaw: The robot becomes a parrot. It stops thinking.
- Analogy: Imagine a student who is only shown the final answer to a math problem. They memorize the number "42" but have no idea how to solve the equation. If you give them a slightly different problem, they freeze.
- In the paper, this method made the robot skip the "thinking process" (reasoning) and just spit out a molecule. It failed to explore how to get there, so it couldn't handle new or complex instructions.

2. The "Trial and Error" Method (Reinforcement Learning / RLVR)

How it works: You let the robot try millions of random changes. If it gets a good result, you give it a "treat" (reward). If it fails, you give it a "thumbs down."
The Flaw: The "treats" are too rare.
- Analogy: Imagine trying to teach a dog to find a specific hidden key in a massive, dark forest. If the dog only gets a treat when it finds the exact key, it might wander for years without ever getting a reward. It gets discouraged and stops trying.
- In chemistry, finding a molecule that is both effective and structurally similar is very hard. The robot gets stuck making tiny, safe changes that don't actually improve the drug, because it never gets a "win" signal to encourage bigger leaps.

💡 The Solution: RePO (Reference-Guided Policy Optimization)

The authors created RePO, which is like giving the robot a Mentor and a Map at the same time.

RePO combines the best of both worlds:

The "Mentor" (Reference Guidance): The robot is shown a "Reference Molecule" (a good example of a solution).
The "Explorer" (RL Reasoning): The robot is allowed to think through the steps and try different paths to get there.

How RePO Works (The "Chef" Analogy)

Imagine you are teaching a Junior Chef to make a perfect soup.

The Goal: Make the soup tastier (improve the property) but keep the base ingredients recognizable (maintain similarity).
The Reference: You have a photo of a delicious soup made by a Master Chef.
The RePO Process:
1. The Chef thinks out loud: "Okay, I need to add more salt. Maybe I'll swap the carrots for celery..." (This is the Reasoning/Trajectory).
2. The Mentor checks the final bowl: The robot makes the soup. You compare the final taste to the Master Chef's photo.
3. The Feedback Loop:
  - If the soup tastes good, the robot gets a reward for its thinking process (encouraging it to keep exploring).
  - Crucially: The robot is also told, "Hey, your final bowl looks a lot like the Master Chef's photo. Good job matching the target!" (This is the Reference Guidance).

Why this is genius:

The Reference keeps the robot from wandering off into nonsense (like adding chocolate to the soup). It anchors the robot to a valid solution.
The Reasoning part allows the robot to figure out how to get there, rather than just copying the photo. It learns the logic of cooking, not just the picture.

🏆 The Results: Why RePO Wins

The paper tested RePO on real-world chemical datasets (TOMG-Bench and MuMOInstruct). Here is what happened:

Better Balance: RePO found molecules that were both effective and safe. Other methods either made great drugs that were too different from the original (unsafe) or kept the shape but didn't improve the drug (useless).
Better Thinking: Unlike the "Copycat" method, RePO actually generated step-by-step reasoning. It could explain why it changed a molecule (e.g., "I swapped Bromine for Chlorine to reduce toxicity").
Generalization: When the instructions changed (e.g., "Make it taste like strawberries" instead of "Make it sweet"), RePO adapted. The other methods got confused.

🚀 Summary in One Sentence

RePO teaches AI chemists to explore new ideas freely, but uses a "good example" as a safety net to ensure they don't wander off into chemical nonsense, resulting in smarter, safer, and more innovative drug designs.

1. Problem Statement

The paper addresses the challenge of instruction-based molecular optimization using Large Language Models (LLMs). The goal is to modify an input molecule ( $m_0$ ) to optimize a specific chemical property (e.g., QED, LogP) while maintaining high structural similarity to the original molecule.

Key Challenges Identified:

Competing Objectives: Improving a target property often requires significant structural edits, which inherently reduces structural similarity. Balancing these conflicting goals is difficult.
Supervision Mismatch: Existing datasets typically provide only a single optimized reference molecule ( $m_{ref}$ ) without the intermediate reasoning steps (trajectories) used to derive it.
Failure of Existing Paradigms:
- Supervised Fine-Tuning (SFT): When trained only on answer-only data (input $\to$ reference molecule), SFT collapses the model's reasoning capabilities. It produces short, deterministic outputs that mimic the reference but fail to explore the chemical space effectively, leading to poor similarity control.
- Reinforcement Learning with Verifiable Rewards (RLVR/GRPO): Starting from a base model, RLVR struggles due to sparse rewards. In the vast chemical space, valid molecules that satisfy both property improvement and similarity constraints are rare. Consequently, the model learns conservative, near-identity edits or fails to converge. Initializing RL from an SFT model (GRPO-SFT) fails to recover multi-step reasoning, inheriting the SFT model's lack of exploration.

2. Methodology: Reference-Guided Policy Optimization (RePO)

The authors propose RePO, a novel optimization framework that combines reward-driven exploration with answer-level reference guidance, without requiring labeled intermediate trajectories.

Core Objective Function:
RePO optimizes the policy $\pi_\theta$ using a composite objective function (Eq. 4) that includes three terms:

Exploration Term (RLVR): Uses Group Relative Policy Optimization (GRPO) to update the policy based on the entire reasoning trajectory ( $t_i$ $t_{i}$ ) and the final molecule ( $\hat{m}_i$ $\overset{m}{^}_{i}$ ). This term encourages the model to explore diverse molecular edits that yield high rewards.
- Reward Design: A scalar reward $r(m, m_0)$ combines Structural Similarity (Tanimoto similarity of fingerprints) and Property Improvement (binary reward if the target property improves).
Answer-Level Reference Guidance: A supervised term that increases the likelihood of the reference molecule ( $m_{ref}$ $m_{r e f}$ ) conditioned on the model's own sampled reasoning trajectory ( $t_i$ $t_{i}$ ).
- Key Mechanism: The gradient is applied only to the answer tokens (the final molecule), while the reasoning tokens are masked. This allows the model to learn how to reach a valid solution (via the RL term) while being anchored to a known good solution (via the guidance term) without forcing the model to copy the reasoning path of the reference.
KL Regularization: A term to stabilize training and prevent the policy from drifting too far from the reference policy.

Training Procedure:

Sample $G$ candidate responses (reasoning + molecule) from the current policy.
Compute rewards for the generated molecules.
Calculate group-relative advantages for the RL term.
Compute the reference guidance loss using the dataset's reference molecule as the target for the final answer tokens, conditioned on the sampled reasoning prefix.
Update the policy using the combined gradient.

3. Key Contributions

Diagnosis of Supervision Mismatch: The paper empirically demonstrates that answer-only SFT collapses reasoning, while pure RLVR suffers from sparse feedback in constrained chemical spaces. It highlights that naive token-level imitation of references suppresses necessary exploration.
RePO Framework: Introduces a method that decouples reasoning exploration (driven by RL rewards on full trajectories) from solution anchoring (driven by supervised guidance on the final answer). This allows the model to explore diverse edit paths while ensuring the final output remains chemically valid and similar to the reference.
Gradient Masking: A critical design choice where the reference guidance loss is applied only to the answer tokens, preserving the model's ability to generate diverse and valid reasoning steps.

4. Experimental Results

The authors evaluated RePO on TOMG-Bench (single-objective) and MuMOInstruct (multi-objective) benchmarks using Qwen-2.5-3B and Llama-3.1-8B backbones.

Performance Metrics: Success Rate (SR), Similarity (Sim), and the combined metric SR $\times$ Sim.
Single-Objective Tasks (TOMG-Bench):
- RePO achieved the best SR $\times$ Sim in 4 out of 6 tasks.
- It improved the Success Rate by up to 17.4% over the strong GRPO baseline.
- RePO successfully balanced the trade-off: it achieved higher success rates than SFT (which struggled with similarity) and higher similarity than pure GRPO (which struggled with property improvement).
Multi-Objective Tasks (MuMOInstruct):
- RePO outperformed all baselines (SFT, GRPO, GRPO-SFT) in balancing competing objectives (e.g., increasing BBB permeability while maintaining LogP).
- It demonstrated superior generalization to unseen instruction styles, maintaining performance where SFT and GRPO degraded significantly.
Ablation Studies:
- Gradient Masking: Removing the mask (applying guidance to reasoning tokens) caused performance to drop below baselines, confirming that forcing reasoning imitation harms exploration.
- Reference Quality: RePO remained robust even with 30-50% corrupted reference data.
- Inference Scaling: RePO benefits significantly from increased sampling (Best-of-k), showing improved success rates and similarity as the compute budget increases.
Qualitative Analysis: Case studies show RePO generates chemically sound reasoning (e.g., correctly identifying steric hindrance and electronegativity) and valid modifications, whereas GRPO often produces hallucinated chemical logic or invalid SMILES.

5. Significance

Bridging the Gap: RePO provides a principled solution to the "sparse reward" problem in scientific domains where ground-truth trajectories are unavailable. It leverages the "answer" as a guide without constraining the "thought process."
Scientific AI: The method demonstrates that general-purpose LLMs, when properly optimized, can outperform domain-specific models (like Bio-T5) and traditional evolutionary algorithms in molecular design tasks.
Generalizability: The framework is domain-agnostic and suggests a pathway for applying LLM reasoning to other scientific optimization problems (e.g., retrosynthesis, drug-drug interaction prediction) where solutions are verifiable but the derivation path is complex and unannotated.

In conclusion, RePO effectively stabilizes the training of LLMs for molecular optimization by using reference molecules as an anchor for the final answer while allowing the model the freedom to explore diverse reasoning paths via reinforcement learning, resulting in superior performance across single and multi-objective chemical design tasks.