Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition

This paper introduces MolSeek-OCR, a DeepSeek-OCR-2 adaptation for Optical Chemical Structure Recognition that employs a two-stage progressive fine-tuning strategy on synthetic and patent data to achieve competitive sequence-level accuracy, though it still lags behind state-of-the-art image-to-graph models and fails to improve further with reinforcement or data-curation post-training.

Original authors: Haocheng Tang, Xingyu Dang, Junmei Wang

Published 2026-04-07
📖 5 min read🧠 Deep dive

Original authors: Haocheng Tang, Xingyu Dang, Junmei Wang

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a massive library of old chemistry textbooks and patent documents. These books are filled with beautiful, hand-drawn or computer-generated pictures of molecules (the tiny building blocks of life and medicine). To a computer, these pictures are just a jumble of lines and circles. To a scientist, they are complex instructions.

The Problem:
Computers are great at reading text, but they struggle to look at a picture of a molecule and instantly write down the "secret code" (called a SMILES string) that tells a computer exactly how that molecule is built. Previous attempts to teach computers this skill were like trying to teach a toddler to read a novel by just showing them the cover; they often got stuck or gave up.

The Solution: "MolSeek-OCR"
The researchers in this paper took a very smart, pre-trained AI (called DeepSeek-OCR-2) that was already an expert at reading documents and taught it specifically to read chemistry drawings. They called their new model MolSeek-OCR.

Here is how they did it, using some simple analogies:

1. The Two-Step Dance (The Training Strategy)

The researchers tried to teach the AI all at once, but the AI got confused and the training crashed. So, they invented a two-step dance:

  • Step 1: The "Training Wheels" Phase (LoRA):
    Imagine you are teaching someone to drive a race car. You don't let them touch the engine or the transmission yet. You just let them practice steering and braking.
    In this step, the researchers only tweaked a tiny, specific part of the AI's brain (using a technique called LoRA). This allowed the AI to learn how to look at a molecule and start guessing the code without breaking its existing ability to understand language.

  • Step 2: The "Full Engine" Phase (Progressive Fine-Tuning):
    Once the driver was comfortable, they let them touch the engine. But they didn't let them overhaul the whole car at once.
    They kept the "eyes" of the AI (the part that sees the image) frozen and steady, but they let the "brain" (the part that writes the code) learn more deeply. They used a special trick where they taught the visual part slowly and the writing part quickly, ensuring the AI didn't forget how to see while learning how to write.

2. The Practice Grounds (The Data)

To make the AI a pro, they didn't just show it perfect, computer-generated drawings. They gave it a mixed diet:

  • The "Video Game" Level: Perfectly clean, synthetic drawings (like a video game rendering).
  • The "Real World" Level: Scanned images from old patents and journals. These are messy! They have coffee stains, weird fonts, and lines that are too thick or too thin.
    By training on both, the AI learned to recognize molecules whether they were drawn on a pristine whiteboard or scribbled on a crumpled napkin.

3. The Results: Good, but Not Perfect

When they tested MolSeek-OCR, it did a fantastic job. It was almost as good as the best "Image-to-Text" models currently in existence. It could look at a messy patent drawing and type out the correct chemical code about 70-75% of the time.

However, there is a catch:
There is another type of AI (like MolScribe) that doesn't just "read" the picture like a book; it "rebuilds" the molecule like a 3D puzzle.

  • MolSeek-OCR is like a translator: It looks at the picture and guesses the words.
  • MolScribe is like an architect: It looks at the picture and draws the blueprints from scratch.

The "Architect" (MolScribe) is still better at this specific job because chemical structures are so complex that guessing the code word-by-word often leads to small, fatal errors.

4. The "Reinforcement" Experiment (Why it didn't work)

The researchers tried one more thing. They tried to use a "reward system" (like training a dog with treats). They told the AI: "If you guess a molecule that is chemically valid, even if the code isn't perfect, you get a treat."

  • The Result: The AI got better at understanding the shape of the molecule, but it got worse at typing the exact code. It was like a student who learned the concept of a math problem but kept making typos in the final answer. Since the goal was to get the exact code right, this method didn't help.

The Bottom Line

The researchers successfully taught a general document-reading AI to read chemistry, creating a tool called MolSeek-OCR. It's a powerful new tool that can digitize old chemical knowledge faster than before. However, for the most complex and precise tasks, the old-school method of "rebuilding the molecule from scratch" is still the gold standard.

In short: They taught a smart robot to read chemistry books. It's very good, but it still needs a little help from a specialist to get the details 100% perfect.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →