Pushing the limits of one-dimensional NMR spectroscopy… — Plain-Language Explanation

Original authors: Frank Hu, Jonathan M. Tubb, Dimitris Argyropoulos, Sergey Golotvin, Mikhail Elyashberg, Grant M. Rotskoff, Matthew W. Kanan, Thomas E. Markland

Published 2026-06-10

📖 4 min read☕ Coffee break read

View on arXiv ↗PDF ↗

CC BY 4.0

Original authors: Frank Hu, Jonathan M. Tubb, Dimitris Argyropoulos, Sergey Golotvin, Mikhail Elyashberg, Grant M. Rotskoff, Matthew W. Kanan, Thomas E. Markland

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to solve a mystery, but instead of finding fingerprints or a witness, you only have a single, blurry photograph of the suspect's shadow. Your job is to reconstruct the suspect's entire face, body, and clothing just from that one shadow.

This is essentially what chemists face when they try to figure out the structure of a new molecule using only 1D NMR spectroscopy.

The Impossible Puzzle

In the world of chemistry, a molecule is like a complex Lego structure. For a medium-sized molecule (one with about 36 to 40 "heavy" atoms like carbon, nitrogen, or oxygen), there are more possible ways to snap those Legos together than there are grains of sand on all the beaches on Earth. The paper estimates this number to be between $10^{20}$ and $10^{60}$ .

Traditionally, figuring out which specific Lego structure you have using only a simple 1D NMR "shadow" (a spectrum) was considered impossible. It's like trying to guess the exact arrangement of a billion Lego bricks just by looking at a single, flat shadow. Usually, chemists need more clues, like 2D NMR (which gives a 3D map) or knowing the exact list of ingredients (the molecular formula) to solve the puzzle.

The AI Detective

The researchers in this paper built a super-smart AI detective (a "Transformer" model, the same type of technology behind many modern chatbots) that can solve this puzzle using only the 1D NMR shadow.

Here is how they trained it, using a clever two-step process:

Step 1: Learning the Language of Shapes (Pre-training)
Before the AI could look at the NMR shadows, they taught it a different game. They gave it "Morgan fingerprints"—which are like digital barcodes that describe the small pieces (fragments) of a molecule—and asked the AI to build the full Lego structure from those barcodes.

The Analogy: Imagine teaching a child to build a house by showing them a list of bricks (windows, doors, walls) and asking them to assemble the house.
The Result: The AI became a master builder. It could look at a list of fragments and correctly reconstruct the full house 97.8% of the time.

Step 2: The Real Test (Spectrum to Structure)
Once the AI was a master builder, they taught it the real task: looking at the NMR "shadow" and guessing the Lego structure directly.

They didn't give it the list of ingredients (the molecular formula).
They didn't give it a 3D map.
They only gave it the 1D NMR spectrum.

The Results: Solving the Unsolvables

The AI performed miracles on this impossible task:

Accuracy: For molecules up to 40 atoms long, the AI guessed the correct structure within its top 15 guesses about 60% of the time.
The "Shadow" vs. The "Map": Even if the AI didn't get the exact right answer, it was usually very close. If it guessed wrong, the structure it suggested was often 82% similar to the real molecule. It's like the detective guessing the suspect is wearing a red hat instead of a blue one, but getting the rest of the outfit right.
One Eye is Enough: Surprisingly, the AI could do most of this work using only the Hydrogen (1H) NMR spectrum, without needing the Carbon (13C) data. It still got the right answer 46.6% of the time in its top 15 guesses.
Real-World Adaptability: The AI was trained on computer simulations, but the researchers showed it could be "fine-tuned" with just 50 real-world experimental spectra. Even with this tiny amount of real data, it jumped from 0% accuracy on real data to 21.5% accuracy.

Why This Matters

Think of the chemical space as a library with $10^{60}$ books. Finding the one specific book you need by reading just the cover (the 1D NMR spectrum) was thought to be impossible. This AI doesn't just find the book; it narrows the search down to a small stack of 15 books, 6 out of which are likely the one you want.

The paper concludes that this tool allows scientists to skip the expensive, time-consuming steps of getting more complex data. It acts as a powerful filter, rapidly narrowing down the infinite possibilities of chemical structures to a manageable few, all based on the simplest, most common data available in a chemistry lab.

Technical Summary: Pushing the Limits of One-Dimensional NMR Spectroscopy for Automated Structure Elucidation Using Artificial Intelligence

Problem Statement
One-dimensional (1D) NMR spectroscopy is a primary tool for characterizing organic compounds; however, determining a molecule's full structure (formula and connectivity) from 1D ¹H and/or ¹³C NMR spectra alone—known as de novo structure generation—is traditionally considered intractable for molecules with more than a few atoms. This is due to the combinatorial explosion of chemical space, where the number of possible structures for molecules with up to 36 non-hydrogen atoms ranges from $10^{20}$ to $10^{60}$ . Existing computer-assisted structure elucidation (CASE) approaches typically require additional data (e.g., 2D NMR, HR-MS, molecular formulas) or rely on matching against candidate libraries, which limits their applicability to novel compounds or situations where such context is unavailable. Current machine learning methods often fail to address the full spectrum-to-structure task without intermediate steps or extensive conditioning information.

Methodology
The authors propose an end-to-end deep learning framework based on transformer architectures to solve the spectrum-to-structure and spectrum-to-substructure tasks using only 1D ¹H and ¹³C NMR spectra, without requiring the molecular formula or other contextual data.

Pretraining (Substructure-to-Structure): The framework utilizes a pretraining phase where a transformer model learns to reconstruct SMILES strings from Morgan fingerprints (binary vectors representing molecular substructures). This task conditions the model on the semantics and syntactic validity of molecular representations. The model was trained on 88 million unique SMILES strings from PubChem (as of February 2025) containing up to 40 heavy atoms (C, N, O, H, B, P, S, Si, F, Br, Cl, I).
Multitask Architecture: The pretrained weights are transferred to initialize the structure elucidation branch of a multitask model.
- Input: The model takes 1D ¹H NMR spectra (encoded via a convolutional neural network) and ¹³C NMR chemical shifts (embedded representation).
- Processing: A combined latent representation is fed into two parallel branches:
  - A substructure elucidation branch (4-layer transformer encoder) that predicts the probability of specific molecular fragments being present.
  - A structure prediction branch (8-layer encoder-decoder transformer) that autoregressively generates the SMILES string.
Training Data: The multitask model was trained on a curated set of 2 million molecules (selected from the 88M pool to ensure diversity and prevent data leakage) with forward-simulated ¹H and ¹³C NMR spectra generated using ACD/Labs predictors.

Key Results

Substructure-to-Structure Performance: The pretraining model achieved a Top-15 accuracy of 97.8% in reconstructing SMILES strings from Morgan fingerprints for molecules up to 40 heavy atoms. Even for the largest molecules (40 heavy atoms), accuracy remained high (88.8%), and incorrect predictions showed high Tanimoto similarity (average MTS of 0.82) to the target, indicating the model recovers substantial structural information even when failing exact reconstruction.
Spectrum-to-Structure Performance: The multitask framework achieved a Top-15 structure accuracy of 60.4% on the test set using only ¹H and ¹³C NMR spectra. This performance was maintained across the full range of molecule sizes (10–40 heavy atoms), despite the chemical space growing by over 30 orders of magnitude within this range.
- Using only ¹H NMR spectra resulted in a Top-15 accuracy of 46.6%.
- Using only ¹³C NMR spectra resulted in a Top-15 accuracy of 19.4%.
- Pretraining improved the Top-15 structure accuracy by 22 percentage points compared to training from random initialization.
Elemental Coverage: The model successfully generalized to elements beyond C, N, O, and H, including P, S, Si, B, and halogens. While accuracy varied by element (e.g., higher for S, lower for P due to valency diversity), the model demonstrated the ability to predict structures containing rare elements (e.g., B, I) with accuracies exceeding 20%.
Substructure Prediction: The model achieved an F1 score of 0.84 for substructure prediction. Predictions were highly confident, with 98.1% of probabilities falling outside the 0.1–0.9 range.
Experimental Validation: When fine-tuned on a small set of 50 experimental spectra from the BMRB, the model achieved a Top-15 structure accuracy of 21.5% on experimental test data, a significant improvement from 0.0% zero-shot accuracy, while retaining its performance on simulated data.
Candidate Generation: In cases where the exact structure was not predicted, the model's best incorrect prediction was often closer to the target molecule than any molecule found in the 85M-molecule PubChem training set (Top-1 position in 32.2% of failures for 40-heavy-atom systems).

Significance and Claims
The paper claims that this framework overcomes the combinatorial scaling of chemical space to enable automated de novo structure generation using only routine 1D NMR data. By leveraging insights from natural language processing and transformer architectures, the authors demonstrate that it is possible to predict the correct molecule with 60.4% accuracy within the first 15 predictions for systems with up to 40 heavy atoms.

The authors position this work as a foundational step toward fully automated structure elucidation. They argue that the framework:

Removes the bottleneck of requiring complex 2D NMR or molecular formulas for initial structure generation.
Provides a computationally efficient alternative to brute-force search or iterative genetic algorithms.
Offers a "foundational model" capability, where pretraining on large datasets allows for effective fine-tuning on small experimental datasets.
Generates high-quality candidate molecules that can constrain the chemical search space even when the exact structure is not immediately identified, potentially serving as seeds for more exhaustive search-based methods or CASE tools.

The authors acknowledge remaining challenges, including stereochemical determination and the gap between simulated and experimental data, but assert that their approach provides a robust foundation for scaling automated elucidation across the drug-like chemical space.

Pushing the limits of one-dimensional NMR spectroscopy for automated structure elucidation using artificial intelligence

The Impossible Puzzle

The AI Detective

The Results: Solving the Unsolvables

Why This Matters

More like this