PLaID++: A Preference Aligned Language Model for… — Plain-Language Explanation

Original authors: Andy Xu, Rohan Desai, Larry Wang, Ethan Ritz, Gabriel Hope

Published 2026-06-12

📖 4 min read☕ Coffee break read

Original authors: Andy Xu, Rohan Desai, Larry Wang, Ethan Ritz, Gabriel Hope

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a master chef trying to invent a new, delicious, and safe recipe. You have a giant cookbook (a database of known materials) and a very smart, but slightly chaotic, sous-chef (an AI language model). Your goal isn't just to copy an existing recipe; you want the AI to invent brand new recipes that are safe to eat (stable) and taste unique (novel).

This paper introduces PLaID++, a new way to train that AI sous-chef to be a better recipe inventor. Here is how it works, broken down into simple concepts:

1. The Problem: The "Copycat" Trap

The researchers tried to teach the AI to design crystal structures (the microscopic building blocks of materials like batteries or solar cells).

The Old Way: They taught the AI to list the exact 3D coordinates of every single atom, like writing down the GPS location of every grain of salt in a shaker.
The Issue: When they tried to "reward" the AI for making good crystals, it got lazy. It started memorizing a few "perfect" recipes and just repeating them over and over. In AI terms, this is called mode collapse. It stopped being creative and just copied what it knew worked, ignoring the vast universe of other possibilities.

2. The Solution: The "Symmetry Shortcut" (Wyckoff Text)

To fix the copycat problem, the researchers changed how they asked the AI to write the recipes.

The Analogy: Instead of listing every single brick in a castle, they taught the AI to describe the blueprint.
How it works: Crystals have hidden patterns called symmetries (like a snowflake where one arm looks like the others). The researchers used a special text format called Wyckoff positions. Instead of saying "put a carbon atom here, and another carbon atom there," the AI just says, "Put a carbon atom in this specific spot, and the symmetry rules will automatically fill in the rest of the pattern."
The Result: This is like giving the AI a magic stamp. It makes the instructions shorter, faster to read, and forces the AI to understand the rules of the crystal rather than just memorizing coordinates. This stopped the "copycat" behavior and encouraged the AI to explore new, valid designs.

3. The Training: The "Taste-Test" Loop (RLIP)

Once the AI had the right blueprint format, they needed to teach it which recipes were actually good. They used a method called Reinforcement Learning from Interatomic Potentials (RLIP).

The Analogy: Imagine the AI generates 100 new recipes. A super-fast computer "taste-test" (called a Machine Learning Interatomic Potential) checks them.
- If a recipe is unstable (it would fall apart), it gets a "thumbs down."
- If it's stable and unique, it gets a "thumbs up."
The Process: The researchers didn't just show the AI the "thumbs up" recipes. They showed it pairs: "Here is a good recipe (Winner) and here is a bad one (Loser)." The AI learns to prefer the Winner.
The Secret Sauce: To keep the AI from getting too confident and repeating the same "perfect" recipe, they turned up the "chaos dial" (sampling temperature) slightly with every round of training. This forced the AI to keep exploring slightly different variations, ensuring a diverse menu of new materials.

4. The Results: A Better Chef

The paper claims that this new system (PLaID++) is significantly better than previous methods:

More Stable: It creates materials that are less likely to fall apart (thermodynamically stable).
More Unique: It invents structures that haven't been seen before, rather than just copying old ones.
Faster: It generates these materials much faster than older, complex 3D models.
Versatile: It works well whether you ask it to invent any new material (unconditional) or ask it to invent a material with a specific shape or symmetry (conditional).

Summary

In short, the researchers took a smart AI, taught it to speak the "language of symmetry" (Wyckoff text) instead of just listing coordinates, and then trained it using a "taste-test" loop that rewards it for finding stable, unique, and novel materials. The result is an AI that acts like a creative, reliable chef, capable of inventing new materials for things like better batteries and solar cells without getting stuck in a rut.

Technical Summary: PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design

Problem Statement

The discovery of new solid-state materials is hindered by the immense scale of chemical space, where previous explorations have uncovered only a fraction of potential stable inorganic compounds. While generative models like Variational Autoencoders (VAEs) and Diffusion Models have been applied to generate stable structures, they often face challenges regarding computational efficiency, the explicit encoding of crystallographic symmetry, and the ability to satisfy specific constraints without mode collapse.

Furthermore, while Reinforcement Learning from Verifiable Rewards (RLVR) has improved correctness in Large Language Models (LLMs), scientific material design often requires generating a diverse array of candidates satisfying constraints (e.g., stability, novelty, specific symmetry) rather than a single "correct" answer. Naive application of preference optimization to coordinate-based crystal representations has been observed to lead to mode collapse, where models generate stable but repetitive structures, failing to explore the chemical space effectively.

Methodology

The authors introduce PLaID++, a framework that combines a novel text representation for crystals with a Reinforcement Learning from Interatomic Potentials (RLIP) approach based on Direct Preference Optimization (DPO).

1. Wyckoff-Based Text Representation
To address the limitations of coordinate-based representations, the authors propose a compact, symmetry-informed text representation using Wyckoff positions.

Mechanism: Instead of listing all atomic coordinates, the model generates text encoding the space group and the fractional coordinates of atoms within the asymmetric unit. The full crystal structure is implicitly defined by applying symmetry operations.
Benefits: This representation reduces token count (14% reduction on the MP-20 dataset), improves computational efficiency, and forces the model to generalize from physical priors. By tying atoms to Wyckoff sites, local changes propagate through symmetry operations, mitigating the mode collapse observed in coordinate-based RL training.

2. Reinforcement Learning from Interatomic Potentials (RLIP)
The authors adapt Direct Preference Optimization (DPO) to align the LLM with physical properties.

Reward Signal: They utilize Machine Learning Interatomic Potentials (MLIPs), specifically EquiformerV2 (eqV2) and eSEN, to predict relaxed formation energies ( $E_{hull}$ ).
Preference Pairs: The training dataset consists of preference pairs $(y_w, y_l)$ $(y_{w}, y_{l})$ categorized by:
- Stability: Stable ( $E_{hull} \le 0$ ), metastable ( $0 < E_{hull} \le 0.08$ ), and unstable ( $E_{hull} > 0.08$ ).
- Novelty/Uniqueness: Distinguishing between crystals that are unique relative to the generation set and novel relative to the training data.
- Space Group Conditioning: Generating structures that match specific target space groups.
Iterative Training: The model undergoes iterative DPO where $\pi_{ref} = \pi_{\theta-1}$ . To prevent entropy collapse and maintain diversity, the sampling temperature is dynamically increased across iterations.
Unified Training: The framework jointly optimizes for unconditional generation and conditional generation (specific space groups), demonstrating that training signals from one task benefit the other, particularly in data-sparse regimes.

Key Contributions

RLIP Framework: Introduction of a diversity-aware reinforcement learning framework for fine-tuning LLMs using interatomic potentials as reward signals.
Symmetry-Informed Representation: Development of a novel Wyckoff-based text encoding that is compact, performant, and physically motivated, effectively preventing mode collapse during preference optimization.
Unified Training Efficacy: Demonstration that unified training across conditional and unconditional tasks is mutually beneficial in data-sparse regimes, achieving state-of-the-art results in both settings.

Results

Experiments were conducted on the MP-20 dataset (45,231 inorganic metastable crystalline materials) using a Qwen-2.5 7B base model.

Unconditional Generation: PLaID++ achieved a 22.27% stability rate and a 7.74% S.U.N. (Stable, Unique, Novel) rate. This represents a $\sim$ 50% improvement in S.U.N. rate over the best prior methods (e.g., Jointly-trained ADiT at 5.3% S.U.N.).
Conditional Generation: For space-group conditioned tasks, PLaID++ improved the S.S.U.N. (Symmetry, Stable, Unique, Novel) rate by an average of 47% over the base Wyckoff model. Notably, joint training (unconditional + conditional) outperformed models trained on conditional data alone, especially for space groups with low sample counts (<400).
Multi-Objective Generation: When extended to include bulk modulus (>325 GPa) as a third objective, joint preference optimization generated $\sim$ 40% more S.U.N. crystals satisfying the target compared to optimizing for bulk modulus alone.
Validation: Stability and S.U.N. rates were validated using Density Functional Theory (DFT) on a subset of 1,000 structures, yielding a 19.1% stability rate and 13% S.U.N. rate, consistent with MLIP predictions.
Efficiency: PLaID++ generates 10,000 crystals in approximately 23 minutes on a single NVIDIA H100 GPU, yielding 27.17 S.U.N. crystals per minute, which is 5x faster than FlowLLM.

Significance

The paper claims that PLaID++ demonstrates the potential of adapting post-training techniques from natural language processing to materials design. By incorporating inherent crystal symmetries and feedback from MLIPs, the method significantly increases the rate of generating thermodynamically stable, unique, and novel materials. The work suggests that reinforcement learning can effectively guide generative models toward chemically useful structures without requiring massive amounts of labeled data, paving the way for targeted and efficient discovery of novel materials for applications such as solar cells, batteries, and carbon capture. The authors note that while current random search methods have less than a 1% success rate for identifying stable materials, PLaID++ represents a significant acceleration toward real-world utility.

PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design