Original authors: Zhiyuan Yan, Chen Liu, Boxuan Zhao, Kaiqing Lin, Jixiang Zhao, Yimi Wang, Liuzhenghao Lv, Hao Li, Shanzhuo Zhang, Li Yuan, Fanyang Mo

Published 2026-05-19

📖 5 min read🧠 Deep dive

CC0 1.0

Original authors: Zhiyuan Yan, Chen Liu, Boxuan Zhao, Kaiqing Lin, Jixiang Zhao, Yimi Wang, Liuzhenghao Lv, Hao Li, Shanzhuo Zhang, Li Yuan, Fanyang Mo

Original paper dedicated to the public domain under CC0 1.0 (http://creativecommons.org/publicdomain/zero/1.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Problem: The "Mystery Box" of Molecules

Imagine you are an architect trying to build a house, but instead of giving you a blueprint with clear walls, doors, and windows, someone hands you a single, long sentence that describes the house.

"Start with a brick, go left, turn right, put a window here, then a door, then a loop of bricks that connects back to the start..."

This is how current AI models (Large Language Models or LLMs) usually see molecules. The standard way to write a molecule is called SMILES. It's a compact string of text that hides the actual 3D shape and connections of the molecule inside a line of code.

When an AI tries to understand a molecule written in SMILES, it has to do a lot of mental gymnastics. It has to read the sentence, pause, and say, "Wait, let me reconstruct the whole house in my head before I can tell you if the roof is safe or if I can add a new room." This "reconstruction" step is slow, prone to errors, and gets much harder when the house (molecule) is huge or the AI has never seen that specific design before.

The Solution: MoleCode (The "Lego Blueprint")

The researchers introduced a new language called MoleCode. Think of MoleCode not as a sentence, but as a digital Lego instruction manual or a spreadsheet.

Instead of a long, confusing sentence, MoleCode lists every single piece (atom) and every connection (bond) explicitly.

Atom 1 is a Carbon.
Atom 2 is an Oxygen.
Connection: Atom 1 is linked to Atom 2.

In this format, the "blueprint" is right there in front of the AI. It doesn't need to guess or reconstruct the shape; the shape is already visible and editable.

What Happened When They Tried It?

The team tested this new language on several tasks using top-tier AI models. Here is what they found, using simple comparisons:

1. Solving Puzzles (Reasoning)

The Old Way: When asked to count the windows in a complex, unfamiliar house, the AI using SMILES often got lost in the long sentence and gave the wrong answer.
The MoleCode Way: With MoleCode, the AI could just look at the list of parts and count them instantly. The AI got significantly better at tasks like predicting chemical reactions or counting atoms, especially for complex or unfamiliar molecules.

2. Renovating Houses (Optimization)

The Old Way: If you asked the AI to "make this house more energy-efficient," it sometimes tore down the whole structure and built something totally different, or it made changes that broke the house.
The MoleCode Way: Because the AI could see exactly which "brick" (atom) was where, it made small, precise changes (like swapping a window for a better one) that improved the house without breaking the structure. It made smarter, safer edits.

3. Thinking Faster (Efficiency)

The Old Way: The AI spent most of its "thinking time" just trying to figure out what the molecule looked like. It was like a student spending 10 minutes drawing the map before solving the math problem.
The MoleCode Way: The AI spent less time drawing the map because the map was already there. Even though the MoleCode "instructions" were longer to read, the AI spent less time thinking, resulting in a faster and more accurate total process.

4. Building Skyscrapers (Polymers)

The Old Way: Polymers are like giant chains of repeating links. Writing them out in a sentence (SMILES) creates a massive, unreadable block of text. The AI would get confused and fail.
The MoleCode Way: MoleCode treats these chains like a "Repeat this block 100 times" instruction. The AI could handle these giant, repetitive structures perfectly, whereas the old method collapsed under the weight of the long text.

5. Reading Complex Documents

The researchers also showed that MoleCode works for more than just single molecules. It can read scientific papers and patents that mix text with diagrams, turning them into a single, organized graph. It can even handle "Markush structures" (chemical formulas with variable parts, like "add any fruit here"), which are very hard for standard text formats to describe.

The Big Takeaway

The main lesson of this paper is about how we talk to AI about science.

Currently, we force AI to translate scientific shapes into text, and then translate that text back into shapes in its mind. This paper argues that if the object we are studying is a structure (like a molecule), we should give the AI a structural language to work with.

By switching from "mystery sentences" (SMILES) to "explicit blueprints" (MoleCode), the AI stops wasting energy guessing what the molecule looks like and starts using its brain to actually solve chemical problems.

Note on Limitations: The paper clarifies that MoleCode doesn't magically give the AI new chemical knowledge it didn't already have. If the AI doesn't know chemistry, it still won't know chemistry. But, it allows the AI to use the knowledge it does have much more effectively. Also, the new language is longer to type than the old one, but the trade-off is worth it because the AI thinks less and achieves more.

Technical Summary: MoleCode Unlocks Structural Intelligence in Large Language Models

Problem Statement

Large Language Models (LLMs) are increasingly applied to molecular science, yet they face a fundamental interface problem: molecules are inherently graphs (nodes and edges), but the dominant representation, SMILES, encodes them as one-dimensional linear strings. In SMILES, connectivity, branches, and ring closures are implicit, requiring the model to internally reconstruct the molecular topology before it can perform chemical reasoning. This "structural reconstruction" bottleneck forces LLMs to decode syntax rather than operate directly on structure, leading to performance degradation in tasks requiring topological awareness, such as predicting reactions, counting atoms in complex rings, or editing unfamiliar molecules. Current approaches, including Graph Neural Networks (GNNs) and hybrid systems, either lack general reasoning capabilities or compress graph information into vectors/sequences, losing locality and editability.

Methodology: MoleCode

The authors introduce MoleCode, an LLM-native, training-free, graph-explicit molecular language designed to make molecular topology directly readable and operable within the language context.

Grammar and Primitives: MoleCode is built on a simple Subgraph–Node–Edge grammar:
- Subgraph: Defines structural scopes (e.g., a whole molecule, a repeat unit, or a Markush scaffold).
- Node: Represents typed entities (atoms or higher-level chemical groups) with persistent identifiers that remain stable across reasoning steps.
- Edge: Encodes explicit relations (bonds) between nodes, including bond order (single, double, triple, aromatic).
Explicit Representation: Unlike SMILES, where topology is inferred from sequence position, MoleCode states connectivity directly. For example, an atom is declared with an ID, and bonds are explicitly linked to source and target IDs.
Deterministic Conversion: MoleCode supports bidirectional, deterministic conversion with standard formats (SMILES, MOL files) without loss of structural information, using RDKit for parsing and serialization.
Extensibility: The grammar extends beyond small molecules to:
- Polymers: Represented as explicit subgraphs with multiplicity operators (e.g., $\times n$ ) rather than expanded chains.
- Markush Structures: Variable substituents and logical relations are encoded as explicit nodes and edges.
- Reaction Mechanisms: Represented as sequences of graph transformations (states and electron-transfer paths).
- Multimodal Documents: Parses interleaved text and images from patents and research articles into unified structural graphs.

Key Contributions

A New Representation Paradigm: MoleCode shifts molecular representation from implicit sequential syntax to explicit, editable graph objects within the LLM context window.
Training-Free Integration: It requires no modification of LLM weights; it functions as a prompt engineering strategy that unlocks structural intelligence in existing frontier models.
Agentic Workflow Enablement: The explicit nature of MoleCode allows for localized, auditable graph operations, enabling coding agents to iteratively reason, edit, and validate molecular structures.
Unified Interface: It provides a single language for diverse chemical objects, from small molecules to polymers, reaction mechanisms, and complex scientific documents.

Results

The authors evaluated MoleCode across three frontier LLMs (DeepSeek-R1, Gemini-2.5-Flash, Gemini-3-Pro) and multiple benchmarks:

Molecular Reasoning: MoleCode consistently outperformed SMILES and SELFIES, particularly in tasks requiring structural generalization rather than memorization.
- Unfamiliar Molecules: Accuracy on novel molecules remained stable (~76–80%) with MoleCode, whereas SMILES accuracy dropped significantly (to ~20%) as familiarity decreased.
- Complexity Scaling: As molecular size and complexity increased (e.g., carbon count >50), SMILES performance degraded monotonically, while MoleCode maintained high accuracy.
- Task Gains: Reaction prediction accuracy for Gemini-3-Pro improved from 58.8% (SMILES) to 95.0% (MoleCode); molecular formula prediction rose from 58.0% to 90.0%.
Molecular Optimization: In goal-directed optimization (LogP and solubility), MoleCode enabled more chemically interpretable and localized edits.
- Models using MoleCode achieved higher mean property improvements (e.g., +1.15 LogP for Gemini-2.5-Flash vs. 0.0 with SMILES).
- Edits preserved structural similarity (Tanimoto similarity) while targeting specific property changes, whereas SMILES often produced structurally degraded candidates.
Inference Efficiency: While MoleCode inputs are longer, they shift the inference cost profile.
- Token Allocation: MoleCode reduces the need for long Chain-of-Thought (CoT) reasoning traces devoted to reconstructing topology. CoT token cost scaled sublinearly ( $C^{0.52}$ ) with molecular size, compared to superlinear growth for SMILES ( $C^{1.65}$ ).
- Productivity: Longer reasoning with MoleCode correlated positively with optimization success, whereas longer reasoning with SMILES did not.
Polymers and Higher-Order Structures:
- Polymers: MoleCode maintained near-perfect accuracy in carbon counting and editing for polymers with >250 repeat units, where full-chain SMILES collapsed.
- Markush & Mechanisms: MoleCode substantially outperformed E-SMILES in parsing Markush structures (38.1% to 84.0% accuracy) and improved accuracy in representing reaction mechanisms and electron-transfer paths.
- Multimodal Parsing: Successfully parsed complex, newly published research articles and patent disclosures into unified structural graphs, preserving relations between text and images.

Significance and Claims

The paper argues that the interface between scientific objects and LLMs should not treat structure as something to be decoded from text. When the object of reasoning is relational, the structure itself should be part of the language.

Reframing Representation: The authors claim that representations shape reasoning capabilities. By making topology explicit, MoleCode allows LLMs to operate directly on atoms and bonds, reallocating computational effort from structural reconstruction to chemically meaningful operations.
General Principle: This approach suggests a broader principle for scientific language interfaces: compressing relational objects into implicit sequences forces models to recover structure before using it. Exposing structure directly allows LLMs to apply reasoning capabilities more reliably across structured scientific domains.
Practical Impact: The integration of MoleCode into AtomFlow (an agentic system) demonstrates how graph-explicit languages can move LLM-based chemistry from molecule-level prompting to atom-level interaction, supporting tasks like natural-language editing and retrosynthesis planning.

The authors remain modest, noting that MoleCode does not create chemical knowledge that a model lacks and that smaller models may still generate invalid structures. They suggest that the next step is incorporating graph-explicit representations into pretraining to learn structured domains from the outset.

MoleCode unlocks structural intelligence in large language models