Fine-tuning DeepSeek-OCR-2 for Molecular Structure… — Plain-Language Explanation

Imagine you have a massive library of old chemistry textbooks and patent documents. These books are filled with beautiful, hand-drawn or computer-generated pictures of molecules (the tiny building blocks of life and medicine). To a computer, these pictures are just a jumble of lines and circles. To a scientist, they are complex instructions.

The Problem:
Computers are great at reading text, but they struggle to look at a picture of a molecule and instantly write down the "secret code" (called a SMILES string) that tells a computer exactly how that molecule is built. Previous attempts to teach computers this skill were like trying to teach a toddler to read a novel by just showing them the cover; they often got stuck or gave up.

The Solution: "MolSeek-OCR"
The researchers in this paper took a very smart, pre-trained AI (called DeepSeek-OCR-2) that was already an expert at reading documents and taught it specifically to read chemistry drawings. They called their new model MolSeek-OCR.

Here is how they did it, using some simple analogies:

1. The Two-Step Dance (The Training Strategy)

The researchers tried to teach the AI all at once, but the AI got confused and the training crashed. So, they invented a two-step dance:

Step 1: The "Training Wheels" Phase (LoRA):
Imagine you are teaching someone to drive a race car. You don't let them touch the engine or the transmission yet. You just let them practice steering and braking.
In this step, the researchers only tweaked a tiny, specific part of the AI's brain (using a technique called LoRA). This allowed the AI to learn how to look at a molecule and start guessing the code without breaking its existing ability to understand language.
Step 2: The "Full Engine" Phase (Progressive Fine-Tuning):
Once the driver was comfortable, they let them touch the engine. But they didn't let them overhaul the whole car at once.
They kept the "eyes" of the AI (the part that sees the image) frozen and steady, but they let the "brain" (the part that writes the code) learn more deeply. They used a special trick where they taught the visual part slowly and the writing part quickly, ensuring the AI didn't forget how to see while learning how to write.

2. The Practice Grounds (The Data)

To make the AI a pro, they didn't just show it perfect, computer-generated drawings. They gave it a mixed diet:

The "Video Game" Level: Perfectly clean, synthetic drawings (like a video game rendering).
The "Real World" Level: Scanned images from old patents and journals. These are messy! They have coffee stains, weird fonts, and lines that are too thick or too thin.
By training on both, the AI learned to recognize molecules whether they were drawn on a pristine whiteboard or scribbled on a crumpled napkin.

3. The Results: Good, but Not Perfect

When they tested MolSeek-OCR, it did a fantastic job. It was almost as good as the best "Image-to-Text" models currently in existence. It could look at a messy patent drawing and type out the correct chemical code about 70-75% of the time.

However, there is a catch:
There is another type of AI (like MolScribe) that doesn't just "read" the picture like a book; it "rebuilds" the molecule like a 3D puzzle.

MolSeek-OCR is like a translator: It looks at the picture and guesses the words.
MolScribe is like an architect: It looks at the picture and draws the blueprints from scratch.

The "Architect" (MolScribe) is still better at this specific job because chemical structures are so complex that guessing the code word-by-word often leads to small, fatal errors.

4. The "Reinforcement" Experiment (Why it didn't work)

The researchers tried one more thing. They tried to use a "reward system" (like training a dog with treats). They told the AI: "If you guess a molecule that is chemically valid, even if the code isn't perfect, you get a treat."

The Result: The AI got better at understanding the shape of the molecule, but it got worse at typing the exact code. It was like a student who learned the concept of a math problem but kept making typos in the final answer. Since the goal was to get the exact code right, this method didn't help.

The Bottom Line

The researchers successfully taught a general document-reading AI to read chemistry, creating a tool called MolSeek-OCR. It's a powerful new tool that can digitize old chemical knowledge faster than before. However, for the most complex and precise tasks, the old-school method of "rebuilding the molecule from scratch" is still the gold standard.

In short: They taught a smart robot to read chemistry books. It's very good, but it still needs a little help from a specialist to get the details 100% perfect.

1. Problem Statement

Optical Chemical Structure Recognition (OCSR) is the critical task of converting 2D molecular diagrams from printed literature (patents, journals) into machine-readable formats like SMILES strings or molecular graphs.

The Challenge: While Vision-Language Models (VLMs) have shown promise in general OCR, their direct application to OCSR is difficult. Standard full-parameter supervised fine-tuning (SFT) often fails to converge or perform well on this specific domain.
The Gap: Existing specialized models (e.g., MolScribe) use image-to-graph approaches that explicitly predict atoms and bonds, outperforming image-to-sequence models. However, there is a need to adapt powerful, general-purpose document VLMs (like DeepSeek-OCR-2) to handle molecular recognition without relying on complex, multi-stage pipelines.

2. Methodology

The authors propose MolSeek-OCR, a fine-tuned version of DeepSeek-OCR-2, utilizing a novel two-stage progressive supervised fine-tuning strategy to overcome training instabilities.

A. Task Formulation

The task is framed as image-conditioned SMILES generation. Given a molecular image and a fixed instruction prompt, the model autoregressively generates the corresponding SMILES string.

B. Two-Stage Training Strategy

Direct full-parameter fine-tuning was found to fail. Instead, the authors implemented:

Stage 1: Parameter-Efficient Fine-Tuning (LoRA):
- Goal: Adapt the text generation pathway and the cross-modal alignment interface between the visual encoder and the decoder.
- Configuration: LoRA modules are applied to main attention/feed-forward projections and visual-language projection layers.
- Data: 192k samples (64k each from three sources: MolScribe-style synthetic, ChemDraw-style synthetic, and realistic USPTO-MOL images).
Stage 2: Progressive Full-Parameter Fine-Tuning:
- Goal: Refine higher-level modules while preserving low-level visual stability.
- Configuration:
  - Freezing: The lowest-level visual tokenizer and input token embeddings are frozen.
  - Optimization: The LM-as-vision-encoder, compression/projection interface, and autoregressive decoder are updated.
  - Split Learning Rates: A smaller learning rate is used for the visual branch, and a larger rate for the language generation branch.
- Data: Expanded to 800k samples (300k MolScribe-style, 300k ChemDraw-style, 200k realistic USPTO-MOL).

C. Dataset Construction

To ensure robustness against diverse drawing styles and document artifacts, the training corpus combines:

Synthetic Data: Rendered from PubChem structures using two styles:
- MolScribe-like: Stronger appearance variation and perturbations.
- ChemDraw-like: Cleaner, fewer perturbations.
Realistic Data: Images extracted from USPTO patents (USPTO-MOL), containing real-world artifacts like imperfect scan quality, non-uniform line thickness, and specific patent drawing conventions.

D. Post-Training Exploration

The authors investigated Reinforcement Learning (GSPO) and Representation Finetuning (ReFT) to improve performance. However, these methods failed to improve strict sequence-level fidelity. While they sometimes improved graph-level consistency, they degraded the exact SMILES matching accuracy required for the benchmark.

3. Key Contributions

Adaptation of DeepSeek-OCR-2: Successfully adapted a state-of-the-art document VLM for molecular recognition by formulating it as an image-to-SMILES task.
Novel Training Strategy: Introduced a two-stage progressive SFT approach (LoRA $\to$ Selective Full-Parameter) with split learning rates to stabilize training and prevent catastrophic forgetting of visual features.
Comprehensive Benchmarking: Evaluated the model against a wide range of datasets, including synthetic (Indigo, ChemDraw), realistic (USPTO, CLEF, Staker, UOB, ACS), and perturbed (noisy) versions.
Analysis of Limitations: Provided empirical evidence that reinforcement-style post-training fails to maintain the strict sequence-level fidelity required for exact SMILES matching in VLMs, highlighting a fundamental gap between graph-equivalent structures and serialized string correctness.

4. Results

The performance was evaluated based on Exact Matching Accuracy (the generated SMILES must exactly match the ground truth).

Comparison with Image-to-Sequence Models:
- MolSeek-OCR achieved competitive results, broadly comparable to DECIMER (the best-performing Image-to-Sequence baseline).
- Example Performance: On the Indigo (synthetic) dataset, MolSeek-OCR achieved 74.3% accuracy vs. DECIMER's 69.6%. On CLEF (realistic), it achieved 63.3% vs. DECIMER's 62.7%.
Comparison with Image-to-Graph Models:
- MolSeek-OCR remains inferior to state-of-the-art Image-to-Graph models like MolScribe.
- Example Performance: On Indigo, MolScribe achieved 97.5% vs. MolSeek-OCR's 74.3%. On CLEF, MolScribe achieved 88.9% vs. MolSeek-OCR's 63.3%.
Failure of Post-Training: GSPO and ReFT attempts resulted in a decrease in exact-match accuracy, confirming that optimizing for graph equivalence does not translate to better serialized string generation in this context.

5. Significance and Conclusion

Feasibility of VLMs for OCSR: The paper demonstrates that general-purpose document VLMs can be effectively adapted for chemical structure recognition, achieving performance on par with specialized Image-to-Sequence models without needing complex graph-prediction heads.
Architectural Insight: The success of the two-stage training strategy (freezing low-level vision, optimizing high-level alignment) provides a blueprint for fine-tuning VLMs on specialized scientific tasks where direct full-parameter tuning fails.
Limitation Identification: The study highlights a critical bottleneck: while VLMs can learn the semantic structure of molecules, they struggle with the syntactic precision required for exact SMILES string generation compared to models that explicitly predict geometric graphs.
Future Direction: The results suggest that for high-fidelity OCSR, explicitly modeling geometric layouts (Image-to-Graph) remains superior to autoregressive text generation, though VLMs offer a promising, unified approach for multimodal chemical reasoning.

Availability: The code, datasets, and parameters are publicly available on GitHub (MolSeek-OCR).

Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition