SEISMO: Increasing Sample Efficiency in Molecular… — Plain-Language Explanation

Original authors: Fabian P. Krüger, Andrea Hunklinger, Adrian Wolny, Tim J. Adler, Igor Tetko, Santiago David Villalba

Published 2026-02-19

📖 4 min read☕ Coffee break read

Original authors: Fabian P. Krüger, Andrea Hunklinger, Adrian Wolny, Tim J. Adler, Igor Tetko, Santiago David Villalba

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a master chef trying to invent the perfect new recipe for a dish that needs to be delicious, healthy, and cheap to make. However, there's a catch: tasting the dish is incredibly expensive and slow. You can only afford to taste it 50 times before your budget runs out.

Most computer programs trying to solve this problem act like a blindfolded chef throwing random ingredients into a pot, tasting the result, and hoping for the best. They might taste thousands of bad dishes before finding a good one. This is inefficient and wasteful.

This paper introduces SEISMO, a new "chef" powered by a Large Language Model (an AI that has read almost every cookbook and scientific paper ever written). SEISMO doesn't just guess; it learns from every single taste test and uses its vast knowledge of cooking chemistry to figure out exactly what to change next.

Here is how SEISMO works, broken down into simple concepts:

1. The "Blind" vs. The "Trajectory-Aware" Chef

Old Methods (The Blind Chef): Imagine a chef who tastes a dish, gets a score of "6/10," and then forgets the recipe. They try a completely new, random recipe next time. They don't remember why it was a 6 or what specific ingredient caused the problem. They are like a gambler rolling dice.
SEISMO (The Trajectory-Aware Chef): SEISMO is different. It keeps a detailed diary of every single attempt. When it tastes a dish and gets a "6," it doesn't just see the number. It reads the "taste notes" (explanations) that say, "Too salty, not enough spice, and the meat is too tough."
- It uses this diary to say: "Okay, last time I added too much salt. This time, I'll reduce the salt and add more paprika, because I know from my training that paprika pairs well with this meat."
- It connects the dots between the history of its attempts and the future of its guesses.

2. The "Oracle" (The Expensive Taste Test)

In the real world, testing a new drug molecule is like that expensive taste test. You have to run complex lab experiments or supercomputer simulations to see if it works. These are called Oracles.

Because these tests are so costly, scientists want to find the perfect molecule in as few tests as possible.
SEISMO is designed to be ultra-efficient. While other methods might need 1,000 taste tests to find a winner, SEISMO often finds a near-perfect recipe in just 50 tests.

3. The Secret Sauce: "Explanations"

The paper found that just giving the AI a score (like "6/10") isn't enough. The AI needs to know why.

Scenario A (No Explanation): The AI gets a score of "6." It guesses it needs to change the salt, but it might actually need to change the heat. It's guessing in the dark.
Scenario B (With Explanation): The AI gets a score of "6" and a note saying, "The salt is fine, but the heat was too low, making the meat tough."
- SEISMO uses this note to make a precise correction.
- The paper showed that when SEISMO gets these "explanations" (generated by AI tools that analyze why a molecule scored poorly), it becomes much faster at finding the solution. It's like having a sous-chef whispering, "Don't add more salt, turn up the heat!"

4. Why This Matters

Think of drug discovery as searching for a needle in a haystack the size of the Earth.

Traditional AI is like a robot that randomly grabs handfuls of hay, checks for a needle, and throws them away. It takes forever.
SEISMO is like a robot that has read every book on needles and hay. It knows that needles are usually metallic and sharp. It uses its "common sense" (pre-trained knowledge) to ignore the straw and focus on the metal. When it picks up a piece of hay that almost looks like a needle, it gets a note saying, "It's metal, but it's bent." It then straightens it out immediately.

The Bottom Line

SEISMO is a smart, memory-keeping AI agent that optimizes molecules by talking to itself about its past failures and successes. By combining its vast library of chemical knowledge with detailed feedback from expensive tests, it finds better drugs much faster than previous methods.

In short: It turns the process of drug discovery from a game of "guess and check" into a strategic conversation, saving time, money, and resources in the race to cure diseases.

1. Problem Statement

Molecular optimization is a critical bottleneck in drug discovery, requiring the identification of molecules with specific properties (e.g., potency, selectivity, drug-likeness). The primary challenge is sample efficiency:

Costly Evaluations: Real-world evaluation of molecular properties often relies on experimental assays or high-fidelity computational simulations (e.g., protein-ligand co-folding), which are expensive, time-consuming, and rate-limited.
Inefficiency of Current Methods: Existing approaches, such as Reinforcement Learning (REINVENT), Genetic Algorithms (Graph-GA), and Bayesian Optimization, typically operate in batches or require thousands of oracle calls to converge. They often treat the optimization process as a black-box search, discarding rich contextual information available in the optimization history.
The Gap: There is a need for an agent that can optimize molecules strictly online (updating after every single oracle call) while leveraging domain knowledge and structured feedback to minimize the number of expensive evaluations required.

2. Methodology: SEISMO

The authors introduce SEISMO (Sample-Efficient Inference-Stage Molecular-Optimization agent), a goal-directed Large Language Model (LLM) agent that performs optimization entirely at inference time.

Core Architecture

Trajectory-Aware Conditioning: Unlike traditional methods that learn a policy or surrogate model over batches, SEISMO conditions every new molecule proposal on the full optimization trajectory ( $H_{t-1}$ ). This includes the initial task description, all previous molecule proposals (SMILES), their scores, and any explanatory feedback.
Strictly Online Loop: The agent operates in a cyclic workflow:
1. Generation: The LLM proposes a new molecule (SMILES) and a rationale based on the full conversation history.
2. Parsing & Validation: The output is parsed into JSON; invalid SMILES trigger a re-generation.
3. Oracle Evaluation: The molecule is evaluated by an oracle (simulator or experimental proxy) returning a score.
4. Feedback Integration: The score, sub-scores, and structured explanations (if available) are appended to the history.
5. Iteration: The loop repeats immediately without waiting for a batch to complete.

Information Modalities

SEISMO leverages three distinct types of information to guide the search:

Task Descriptions: Natural language descriptions of the objective (e.g., "Minimize IC50 while maintaining QED > 0.6").
Scalar Scores: The aggregate objective value and sub-component scores (e.g., individual QED metrics).
Explanatory Feedback (XAI): Post-hoc explanations generated by methods like SHAP (for binding affinity) or property decomposition (for QED). These explain why a molecule received a specific score (e.g., "High molecular weight reduced QED" or "Sulfur atom increased binding affinity").

Implementation Details

LLM Backbone: The system uses Claude Opus 4.5 (selected via ablation studies for its balance of performance and cost).
Workflow: Implemented using LangGraph to manage stateful, multi-step workflows including generation, validation, and final summarization.
Oracles: Tested on the Practical Molecular Optimization (PMO) benchmark, custom hit-identification proxies (SARS-CoV-2 Mpro), and expensive co-folding oracles (Boltz-2).

3. Key Contributions

Inference-Time Optimization: SEISMO shifts the paradigm from learning a policy to performing optimization as a sequential decision process within the LLM's context window, eliminating the need for training or batch updates.
Trajectory Conditioning: It demonstrates that conditioning on the full history of trials allows the LLM to reason about cause-and-effect across iterations, effectively acting as an "experience replay" mechanism without explicit memory buffers.
Integration of Explanatory Feedback: The paper introduces the use of structured explanations (beyond scalar scores) as control signals. This allows the agent to understand which structural features drive performance, significantly accelerating convergence.
Zero-Shot Adaptability: SEISMO requires no fine-tuning or hyperparameter tuning. It leverages the pre-trained chemical knowledge of the LLM and adapts to new tasks purely through prompt engineering and in-context learning.

4. Experimental Results

A. PMO Benchmark (23 Tasks)

Performance: SEISMO achieved a 2–3× improvement in the Area Under the Optimization Curve (AUC) compared to strong baselines (REINVENT, Graph-GA, GP-BO).
Sample Efficiency: SEISMO reached near-maximal task scores (≥0.9 of the maximum) within 50 oracle calls. In contrast, baselines typically require thousands of calls to reach similar performance.
Ablation: The "No Description" variant (black-box setting) performed significantly worse, proving that explicit task context is crucial for leveraging the LLM's prior knowledge.

B. Impact of Information Levels

Experiments on medicinal chemistry tasks (Hit Identification and Lead Optimization) revealed a hierarchy of efficiency:

Full Explanation (Score + Sub-scores + Task Description + XAI): Highest efficiency. The agent rapidly identified structural changes needed to improve scores.
Partial Explanation: Moderate efficiency.
No Explanation (Score + Sub-scores + Task Description): Lower efficiency.
No Description: Worst performance, often failing to converge.

Finding: Providing why a molecule failed (via explanations) is as critical as knowing what the score was.

C. Co-folding Oracle (Boltz-2)

Scenario: Optimizing binding to a novel protein with no known binders.
Result: SEISMO successfully increased binding probabilities. Crucially, providing residue-level explanations (which amino acids were close to the ligand) significantly outperformed providing only the protein sequence or the score alone. This suggests the LLM can utilize structural context to guide search even for unseen targets.

5. Significance and Implications

Redefining Sample Efficiency: SEISMO demonstrates that for expensive, rate-limited evaluations, the most efficient strategy is not a larger batch size, but a smarter, trajectory-aware single-step agent.
Bridging AI and Chemistry: By integrating post-hoc explainability (XAI) directly into the optimization loop, SEISMO transforms opaque oracle scores into actionable chemical insights, mimicking the iterative reasoning of human medicinal chemists.
Scalability: As LLMs improve in chemical reasoning and long-horizon planning, the SEISMO framework is expected to scale naturally without algorithmic changes.
Practical Application: The method is particularly suited for early-stage drug discovery where computational or experimental budgets are tight, allowing researchers to find high-quality candidates with minimal resource expenditure.

In conclusion, SEISMO establishes a new standard for molecular optimization by treating the LLM not just as a generator, but as an intelligent, context-aware optimizer that learns continuously from every interaction.

SEISMO: Increasing Sample Efficiency in Molecular Optimization with a Trajectory-Aware LLM Agent