Toward Controllable Catalyst Inverse Design via Large-Scale Autoregressive Pretraining

This paper introduces a large-scale autoregressive generative model pretrained on 133 million catalyst structures that enables controllable inverse design by generating valid catalysts with specific categorical and continuous properties, thereby significantly improving screening efficiency for reaction-targeted discovery.

Original authors: Dong Hyeon Mok, Jonggeol Na, Seoin Back

Published 2026-06-17
📖 5 min read🧠 Deep dive

Original authors: Dong Hyeon Mok, Jonggeol Na, Seoin Back

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to invent a new type of engine part, but instead of building it out of metal, you are building it out of atoms. In the world of chemistry, finding the perfect arrangement of atoms to make a "catalyst" (a substance that speeds up chemical reactions) is like looking for a needle in a haystack the size of the universe.

Traditionally, scientists have used a "trial-and-error" approach. They guess a shape, test it, and if it fails, they try again. Later, they used computers to screen millions of guesses, but this is still slow and expensive because the computer has to check every single possibility one by one.

This paper introduces a new tool called CatGPT (Catalyst Generative Pretrained Transformer). Think of it not as a calculator that checks answers, but as a creative chef who has read every cookbook in the world and can now invent new recipes that are guaranteed to taste good.

Here is how the paper explains this breakthrough, broken down into simple concepts:

1. The "Chef" Needs to Read the Menu First (Pretraining)

Before the chef can cook a specific dish, they need to understand the basics of cooking. The researchers fed their AI model 133 million different catalyst structures. This is like the chef reading 133 million cookbooks to learn the "grammar" of atoms: which atoms like to hang out together, how they bond, and what shapes are physically possible.

  • The Result: The model learned the rules of chemistry so well that it can now generate new structures that are physically valid (atoms aren't crashing into each other) 98% of the time.

2. Ordering a Specific Dish (Conditional Generation)

In the past, if you asked this chef to cook, they might just make any random dish. But scientists need specific things: "I need a catalyst that works with this specific gas" or "I need one that binds with this specific energy level."

The researchers taught the model to listen to two types of orders:

  • The "Category" Order: Like saying, "I want a pizza with mushrooms and cheese." The model learned to generate structures with specific chemical ingredients (adsorbates and compositions) almost perfectly (93% accuracy).
  • The "Number" Order: Like saying, "I want the pizza to be exactly 12 inches in diameter." This is harder because numbers are continuous. The researchers built a special "numerical ear" into the model's brain. Now, if you say, "I need a binding energy of -1.5," the model tries to cook a structure that matches that number.

3. The "Magic" of the Recipe Book (The Results)

The paper claims this new chef is a massive improvement over previous methods:

  • Efficiency: If you were looking for a catalyst with a specific energy level, the old way was like searching a library for a book with a specific page number. You might find it 5% of the time. This new model finds it 20% of the time. That is a four-fold improvement. It means scientists can find the right catalyst 4 times faster without wasting time on bad guesses.
  • Precision: When the researchers asked the model to make a catalyst for a specific reaction (like splitting water or reducing oxygen), the model successfully generated candidates that were much closer to the "perfect" target than random guessing.

4. Learning New Cuisines with Limited Ingredients (Foundation Model)

What if the chef needs to cook a dish they've never seen before, like a "Single Atom Catalyst" (a very rare type of structure)? Usually, you would need thousands of examples to teach a chef a new cuisine.

The researchers tested if their model could learn these rare dishes with very little data. They found that because the model had already read the "133 million cookbooks" during pretraining, it could adapt to these new, rare styles of cooking very quickly. It performed much better than a chef who tried to learn the new style from scratch with only a few recipes.

The Limitations (What the Chef Can't Do Yet)

The paper is honest about what the model cannot do:

  • The Vocabulary Limit: The chef can only use ingredients they have seen in the 133 million cookbooks. If you ask for a brand-new element that doesn't exist in their training data, the model will get confused.
  • The "Stability" Puzzle: While the model can build a great-looking "slab" (the surface of the catalyst), it's sometimes hard to know exactly what the "bulk" (the solid block underneath) looks like. It's like building a beautiful house facade but not knowing if the foundation is solid without doing extra work.

The Bottom Line

This paper presents a tool that moves catalyst discovery from "searching for a needle in a haystack" to "asking a master chef to cook exactly what you need." By training on a massive amount of data and teaching the AI to listen to specific numerical and categorical instructions, the researchers have created a system that can generate high-quality, target-specific catalysts much faster than ever before.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →