MoleCode unlocks structural intelligence in large language models

MoleCode introduces a training-free, graph-explicit molecular language that replaces implicit linear representations like SMILES with explicit structural relations, enabling large language models to directly reason about and manipulate molecular topology for improved performance in complex chemical tasks.

Original authors: Zhiyuan Yan, Chen Liu, Boxuan Zhao, Kaiqing Lin, Jixiang Zhao, Yimi Wang, Liuzhenghao Lv, Hao Li, Shanzhuo Zhang, Li Yuan, Fanyang Mo

Published 2026-05-19
📖 5 min read🧠 Deep dive

Original authors: Zhiyuan Yan, Chen Liu, Boxuan Zhao, Kaiqing Lin, Jixiang Zhao, Yimi Wang, Liuzhenghao Lv, Hao Li, Shanzhuo Zhang, Li Yuan, Fanyang Mo

Original paper dedicated to the public domain under CC0 1.0 (http://creativecommons.org/publicdomain/zero/1.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Problem: The "Mystery Box" of Molecules

Imagine you are an architect trying to build a house, but instead of giving you a blueprint with clear walls, doors, and windows, someone hands you a single, long sentence that describes the house.

"Start with a brick, go left, turn right, put a window here, then a door, then a loop of bricks that connects back to the start..."

This is how current AI models (Large Language Models or LLMs) usually see molecules. The standard way to write a molecule is called SMILES. It's a compact string of text that hides the actual 3D shape and connections of the molecule inside a line of code.

When an AI tries to understand a molecule written in SMILES, it has to do a lot of mental gymnastics. It has to read the sentence, pause, and say, "Wait, let me reconstruct the whole house in my head before I can tell you if the roof is safe or if I can add a new room." This "reconstruction" step is slow, prone to errors, and gets much harder when the house (molecule) is huge or the AI has never seen that specific design before.

The Solution: MoleCode (The "Lego Blueprint")

The researchers introduced a new language called MoleCode. Think of MoleCode not as a sentence, but as a digital Lego instruction manual or a spreadsheet.

Instead of a long, confusing sentence, MoleCode lists every single piece (atom) and every connection (bond) explicitly.

  • Atom 1 is a Carbon.
  • Atom 2 is an Oxygen.
  • Connection: Atom 1 is linked to Atom 2.

In this format, the "blueprint" is right there in front of the AI. It doesn't need to guess or reconstruct the shape; the shape is already visible and editable.

What Happened When They Tried It?

The team tested this new language on several tasks using top-tier AI models. Here is what they found, using simple comparisons:

1. Solving Puzzles (Reasoning)

  • The Old Way: When asked to count the windows in a complex, unfamiliar house, the AI using SMILES often got lost in the long sentence and gave the wrong answer.
  • The MoleCode Way: With MoleCode, the AI could just look at the list of parts and count them instantly. The AI got significantly better at tasks like predicting chemical reactions or counting atoms, especially for complex or unfamiliar molecules.

2. Renovating Houses (Optimization)

  • The Old Way: If you asked the AI to "make this house more energy-efficient," it sometimes tore down the whole structure and built something totally different, or it made changes that broke the house.
  • The MoleCode Way: Because the AI could see exactly which "brick" (atom) was where, it made small, precise changes (like swapping a window for a better one) that improved the house without breaking the structure. It made smarter, safer edits.

3. Thinking Faster (Efficiency)

  • The Old Way: The AI spent most of its "thinking time" just trying to figure out what the molecule looked like. It was like a student spending 10 minutes drawing the map before solving the math problem.
  • The MoleCode Way: The AI spent less time drawing the map because the map was already there. Even though the MoleCode "instructions" were longer to read, the AI spent less time thinking, resulting in a faster and more accurate total process.

4. Building Skyscrapers (Polymers)

  • The Old Way: Polymers are like giant chains of repeating links. Writing them out in a sentence (SMILES) creates a massive, unreadable block of text. The AI would get confused and fail.
  • The MoleCode Way: MoleCode treats these chains like a "Repeat this block 100 times" instruction. The AI could handle these giant, repetitive structures perfectly, whereas the old method collapsed under the weight of the long text.

5. Reading Complex Documents

  • The researchers also showed that MoleCode works for more than just single molecules. It can read scientific papers and patents that mix text with diagrams, turning them into a single, organized graph. It can even handle "Markush structures" (chemical formulas with variable parts, like "add any fruit here"), which are very hard for standard text formats to describe.

The Big Takeaway

The main lesson of this paper is about how we talk to AI about science.

Currently, we force AI to translate scientific shapes into text, and then translate that text back into shapes in its mind. This paper argues that if the object we are studying is a structure (like a molecule), we should give the AI a structural language to work with.

By switching from "mystery sentences" (SMILES) to "explicit blueprints" (MoleCode), the AI stops wasting energy guessing what the molecule looks like and starts using its brain to actually solve chemical problems.

Note on Limitations: The paper clarifies that MoleCode doesn't magically give the AI new chemical knowledge it didn't already have. If the AI doesn't know chemistry, it still won't know chemistry. But, it allows the AI to use the knowledge it does have much more effectively. Also, the new language is longer to type than the old one, but the trade-off is worth it because the AI thinks less and achieves more.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →