Tokenization for Molecular Foundation Models

Original authors: Alexius Wadell, Anoushka Bhutani, Venkatasubramanian Viswanathan

Published 2026-01-29

📖 1 min read☕ Coffee break read

Original authors: Alexius Wadell, Anoushka Bhutani, Venkatasubramanian Viswanathan

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

No explanation available in this language yet.

Try: DE, EN, ES, FR, IT, JA, KO, NL, PT, ZH

Technical Summary: Tokenization for Molecular Foundation Models

Problem Statement
Accurate prediction of chemical properties is critical for industries ranging from energy storage to pharmaceutical discovery. While transformer architectures have revolutionized Natural Language Processing (NLP), their application to molecular foundation models faces a fundamental bottleneck: tokenization. Current molecular models predominantly rely on "Atom-wise" tokenization, where Simplified Molecular Input Line Entry System (SMILES) strings are split into atom-level tokens using fixed vocabularies.

The primary limitation of this approach is its inability to fully cover the OpenSMILES specification. Atom-wise tokenizers treat bracketed atoms (which encode isotopes, chiral centers, charges, and explicit hydrogen counts) as single, indivisible tokens. To cover every possible permutation of these features, a vocabulary would require over 28 trillion tokens. Consequently, existing models utilize vocabularies of fewer than 3,000 tokens, resulting in significant coverage gaps. When encountering novel bracketed atoms, these closed-vocabulary tokenizers resort to a generic unknown token [UNK], potentially obscuring critical chemical information such as chirality or specific isotopic composition. Furthermore, existing open-vocabulary alternatives (like BPE-based schemes) often suffer from ambiguity, where distinct chemical entities (e.g., a sulfur-carbon bond vs. the element Scandium) are conflated into the same token.

Methodology
The authors propose a new tokenization framework, Smirk, and a compressed variant, Smirk-GPE, designed to achieve complete coverage of the OpenSMILES specification while maintaining computational efficiency.

Smirk Tokenization: This scheme employs a two-stage, character-level decomposition of SMILES strings based on the glyphs defined by the OpenSMILES specification.
- Stage 1: Decomposition into atoms (e.g., OC[C@@H][OH] $\rightarrow$ O C [C@@H] [OH]).
- Stage 2: Decomposition of bracketed atoms into constituent glyphs (e.g., [C@@H] $\rightarrow$ [ C @ @ H ]).
- This approach distinguishes between ambiguous sequences (e.g., Sc as a bond vs. [Sc] as Scandium) by treating the brackets and internal symbols as distinct tokens. The resulting vocabulary is fixed at 165 tokens, requires no training, and guarantees that any OpenSMILES-encoded molecule can be tokenized without using an [UNK] token.
Smirk-GPE (Glyph Pair Encoding): To address the increased sequence length (fertility) caused by fully decomposing bracketed atoms, the authors implemented Smirk-GPE. This variant applies a Byte-Pair Encoding (BPE)-like compression strategy specifically to the glyph tokens. Unlike standard BPE which merges strings, Smirk-GPE learns merge rules on token IDs, ensuring that chemically meaningful merges (e.g., combining a sulfur and carbon glyph) do not create ambiguity with atomic symbols (e.g., Scandium).
Evaluation Framework:
- Intrinsic Metrics: The authors evaluated tokenizers using fertility (mean sequence length), normalized entropy (compression efficiency), token imbalance, and the frequency of the [UNK] token.
- Low-Cost Proxy: Recognizing that training full transformer models for every tokenizer is computationally expensive, the authors utilized n-gram models as a proxy. They trained n-gram models on 1.6 billion SMILES strings and measured cross-entropy loss and information loss (via KL-divergence) to estimate downstream performance.
- Extrinsic Validation: To validate the n-gram proxy, the authors pre-trained 18 encoder-only RoBERTa models (from scratch) using 11 different tokenizers and three molecular encodings. These models were fine-tuned on six regression and seven classification tasks from MoleculeNet and tmQM.

Key Results

Coverage: Smirk and Smirk-GPE are the only tokenizers evaluated that achieve 100% coverage of the OpenSMILES specification, eliminating the use of the [UNK] token. In contrast, existing chemistry-specific tokenizers (including SPE, APE, and various BPE variants) emit the [UNK] token with non-negligible frequency (up to ~50% on the tmQM dataset).
Information Loss: Tokenizers with limited coverage exhibit substantial information loss, particularly on datasets rich in transition metals and stereochemistry (e.g., tmQM). For instance, the MoLFormer tokenizer incurs a loss of 40.3 nats/molecule on tmQM due to unknown tokens, whereas Smirk variants mitigate this degradation.
Performance Correlation: The study found a strong linear correlation between n-gram metrics (cross-entropy and information loss) and the downstream performance of transformer-based models. This validates the use of n-grams as a low-cost proxy for evaluating tokenizer quality.
Downstream Impact:
- Smirk showed a positive effect on pretraining quality and downstream performance on the tmQM dataset.
- On MoleculeNet tasks (dominated by small organic molecules), Smirk performed similarly to standard Atom-wise tokenization.
- Tokenizers with poor coverage (SPE/APE) negatively impacted both pretraining and downstream performance relative to the baseline.
- The choice of molecular encoding (SMILES vs. SELFIES) was found to have a negligible impact compared to the choice of tokenizer.

Significance and Claims
The paper argues that a foundation model for chemistry must encode the entire breadth of chemical space to avoid obscuring critical features. The authors claim that current tokenizers inadvertently obscure atom-level information (such as chirality in Cisplatin or specific isotopes), triggering potentially significant information loss that is not merely theoretical but impacts clinically and industrially relevant molecules.

The significance of this work lies in:

Robustness: Demonstrating that open-vocabulary tokenizers (Smirk/Smirk-GPE) provide robust coverage of chemical space, preventing the loss of information associated with unknown tokens.
Efficiency: Establishing that n-gram models can serve as a reliable, low-cost proxy for evaluating tokenizer performance, reducing the computational burden of hyperparameter tuning and model selection.
Interpretability: Highlighting that Smirk allows researchers to directly manipulate the information-rich content of bracketed atoms, expanding on the interpretability benefits of Atom-wise tokenization while removing the risk of out-of-vocabulary errors.

The authors conclude that while current benchmarks (like MoleculeNet) may not fully expose the deficiencies of limited-coverage tokenizers due to a lack of diversity in elements and stereochemistry, transitioning to tokenizers capable of encoding the entirety of chemical space is necessary for reliable molecular foundation models. They encourage the community to rigorously assess benchmark scopes and expand datasets to include diverse chemical features.

More like this