🔬 materials science

SCALAR: Quantifying Structural Hallucination, Consistency, and Reasoning Gaps in Materials Foundation Models

This paper introduces SCALAR, a benchmark designed to evaluate how materials foundation models handle geometric scale generalization and structural reasoning across diverse nanoparticle structures, revealing that while explicit physics-grounded reasoning can reduce hallucinations and errors, it often compromises output consistency and validity.

Original authors: Can Polat, Erchin Serpedin, Mustafa Kurban, Hasan Kurban

Published 2026-02-02

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Can Polat, Erchin Serpedin, Mustafa Kurban, Hasan Kurban

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a master architect who is incredibly good at reading blueprints for perfect, infinite skyscrapers. This architect (a type of AI called a "foundation model") can tell you everything about the building's materials, strength, and design just by looking at the blueprint.

But here's the catch: The architect has never been asked to design a tiny model of that skyscraper made of LEGOs, nor have they been asked to figure out what the original skyscraper looks like just by holding a single LEGO brick in their hand.

The paper introduces a new test called SCALAR to see if these AI architects can handle the jump from "infinite skyscraper" to "tiny LEGO model" without losing their minds.

The Core Problem: The "Hallucination" Trap

In the world of AI, a "hallucination" isn't just making something up; it's confidently stating something that sounds right but breaks the laws of physics.

Think of it like this: If you ask a human to imagine a perfect sphere made of water, they know it's round. If you ask them to imagine a cube of water, they might hesitate because water doesn't naturally form cubes. But if you ask an AI to imagine a "cubic water crystal" and it confidently says, "Yes, the corners are sharp and the density is high," it has hallucinated. It ignored the fact that water molecules don't work that way.

The paper argues that current AI models are great at describing the "infinite" version of a material (the bulk crystal) but often fail miserably when asked to describe the "finite" version (a tiny nanoparticle). They might get the numbers right but violate the underlying rules of how atoms stick together.

How the Test Works (The Three Challenges)

The researchers built a massive dataset of 100,000 structures, ranging from a few atoms to over 18,000 atoms. They then put the AI through three specific tests:

The "Zoom-Out" Test (CIF to Property):
- The Setup: You give the AI the blueprint of a perfect crystal (the "Unit Cell").
- The Task: The AI must predict the properties of a tiny, chopped-off piece of that crystal (a "Nanoparticle").
- The Twist: The AI has to figure out how the properties change as the piece gets bigger or smaller.
- The Result: Many AIs got the basic math right but failed to understand the trend. They couldn't consistently say, "As the piece gets bigger, the density should stay the same," or "As it gets smaller, the surface area changes."
The "Think Aloud" Test (Chain-of-Thought):
- The Setup: The researchers told the AI: "Don't just give me the answer; explain your reasoning step-by-step using physics."
- The Result: This was a double-edged sword. Sometimes, forcing the AI to "think" made it more accurate. But often, it made the AI less consistent. It would give a great explanation in one try, and a completely different (and wrong) explanation in the next try, even for the exact same question. It's like a student who can solve a math problem perfectly if they write it down, but gets confused if they have to explain why they did it.
The "Reverse Detective" Test (Inverse Retrieval):
- The Setup: You give the AI a set of properties (e.g., "This material is heavy, has a specific volume, and is very dense").
- The Task: The AI must pick the correct blueprint from a lineup of candidates.
- The Result: Some AIs were surprisingly good at this, acting like detectives. However, others picked the wrong blueprint even when their description of the material was physically plausible. They found a "near miss" that sounded right but was actually the wrong material.

The Big Discovery: Accuracy is a Lie

The most important finding of the paper is that you cannot trust an AI just because it gets the right number.

Imagine a student taking a test.

Student A gets 90% of the answers right but changes their answer every time you ask the same question.
Student B gets 85% right, but their answers are always consistent and follow a logical pattern.

Current benchmarks usually just look at the score (90% vs 85%). This paper says, "Wait a minute! Student A is unreliable because they can't keep their story straight."

The researchers found that when they tested the AI on "Out-of-Distribution" data (sizes the AI hadn't seen before), the AI's ability to stay consistent and follow physical laws collapsed, even if its raw accuracy numbers looked okay.

The Takeaway

The paper concludes that we need a new way to measure AI in science. We can't just ask, "Is the answer right?" We have to ask:

"Is the answer consistent?"
"Does it follow the laws of physics?"
"Does it hallucinate when the size of the object changes?"

The SCALAR benchmark is a tool designed to catch these "smart but crazy" moments before we trust these AI models to design real-world materials for things like batteries or medicine. It's a reality check to ensure that when an AI talks about atoms, it's actually talking about atoms, not just making up a story that sounds scientific.

Technical Summary: SCALAR Benchmark for Materials Foundation Models

Problem Statement

Large language models (LLMs) and foundation models are increasingly applied to materials science reasoning. However, their behavior under physically structured distribution shifts—specifically changes in structural scale—remains poorly understood. While models may accurately predict properties for perfect bulk crystals (represented by unit cells), they often fail when tasked with reasoning about derived finite structures (nanoparticles) that break translational invariance.

Current evaluations typically focus on task accuracy or formatting correctness, rarely assessing cross-scale consistency. This oversight allows models to produce outputs that appear locally plausible but violate global physical invariants (e.g., crystallographic symmetry, conservation constraints, and scale-dependent geometric relations). Such violations constitute a principled form of structural hallucination: confident predictions that are physically incorrect due to the violation of underlying invariants under distribution shifts. There is a lack of datasets providing paired representations of the same object across multiple scales with controlled out-of-distribution (OOD) splits to diagnose these failures.

Methodology: The SCALAR Framework

The authors introduce SCALAR (Structural Consistency And Logic Across Regimes), a benchmark designed to evaluate geometric scale generalization and its connection to structural hallucination, consistency, and reasoning.

1. Dataset Construction

The dataset is derived from DFT-relaxed unit cells of chemically diverse crystalline materials (41 unique elements, including hydrogen-rich systems for energy storage).

Phase I (Nanoparticle Construction): Starting from a primitive unit cell, a $20 \times 20 \times 20$ supercell is generated. Finite nanoparticles are created by "spherical carving," retaining atoms within a sphere of radius $R$ centered at an origin. Radii range from $10$ to $30$ Å, creating structures spanning from a few atoms to over 18,000 atoms (totaling $\approx 100,000$ structures).
Phase II (Rotation Sampling): To mitigate orientation bias, structures are augmented with rigid rotations sampled on $SO(3)$ using unit quaternions. A greedy sampler ensures minimum geodesic spacing between rotations.
Phase III (Split-Aware Partitioning): The dataset is split into Training, In-Distribution (ID) Test, and Out-of-Distribution (OOD) Test sets.
- ID/OOD Separation: Rotations in ID and OOD sets are strictly separated from the training set by exclusion margins ( $\epsilon_{ID} = 8^\circ$ , $\epsilon_{OOD} = 8^\circ$ ) and specific spacing parameters.
- Radius Partitioning: Training includes radii $\{12, 14, 16, 18, 21, 23, 25, 26, 28\}$ . ID test uses $\{13, 15, 17, 20, 24, 27\}$ , while OOD test uses extreme radii $\{10, 11, 29, 30\}$ to probe scale extrapolation.

2. Evaluation Tasks

SCALAR defines three tasks to probe different aspects of model capability:

CIF to Property Prediction: Models predict nanoparticle properties (density, volume, nearest-neighbor distance) given a Crystallographic Information File (CIF).
Chain-of-Thought (CoT) Reasoning: A variant of Task 1 requiring explicit, physics-grounded reasoning steps before the final prediction.
Inverse Retrieval: Given target properties, models must identify the correct crystal structure from a set of candidates.

3. Metrics

Outputs are evaluated via structured metrics capturing:

Hallucination Rate: Frequency of predictions violating physical constraints (e.g., negative densities) or self-consistency failures.
Consistency: Standard deviation of numeric predictions across $N=5$ independent queries.
Reasoning Quality: Spearman rank correlation between predicted property changes across radii and ground-truth deltas.
Accuracy: Mean Absolute Error (MAE) for numeric predictions.
Physical Distance & Regret: For inverse retrieval, normalized $L_2$ distance between target and proposed property vectors, and the suboptimality of the selected candidate.

Key Results

Experiments across diverse foundation models (including GPT-5 Mini, o3-mini, Grok, Claude, and LLaMA variants) reveal significant findings:

Scale-Dependent Breakdowns: Geometric scale shifts expose systematic failures in physical reasoning and cross-scale consistency that are not apparent from aggregate accuracy alone. Hallucination rates and inconsistency increase sharply under OOD scales, even when numeric error degrades only modestly.
Model-Dependent Variability: Performance is highly model-dependent. For instance, in inverse retrieval, Grok 4.1 Fast achieved high top-1 accuracy ($0.808$ ID, $0.793$ OOD), while others like Claude 3 Haiku showed significant physical distance errors.
Material-Specific Sensitivity: Errors are not uniform; they are strongly structure-dependent. Some materials (e.g., $LiCaH_3$ ) show massive relative error increases ( $>30\%$ ) or sign reversals in OOD regimes, while others remain stable.
Chain-of-Thought (CoT) Trade-offs: CoT prompting yields heterogeneous results. While it often reduces numeric errors and hallucination rates, it frequently destabilizes consistency or degrades reasoning quality for specific models. Gains in intermediate explanations do not reliably translate to more stable or physically consistent predictions.
Inverse Retrieval Limitations: High accuracy in selecting the correct candidate does not guarantee physical fidelity. Some models maintain low physical distance (near-miss errors) despite moderate retrieval accuracy, while others fail to align with physical properties entirely.
Baseline Comparisons: Both text-based LLMs and geometry-native graph neural networks (e.g., SchNet, E(3)NN) exhibit substantial scale-dependent degradation, with GNNs showing MAPE increases from $\approx 100\%$ to $>300\%$ in OOD regimes. Analytical baselines based on volume scaling laws perform well on intensive properties but fail on extensive ones.

Significance and Claims

The paper claims that geometric scale generalization cannot be inferred from accuracy alone.

Principled Hallucination: The authors argue that errors violating global physical invariants under structural scale shifts constitute a specific, principled form of hallucination that requires targeted diagnosis.
Diagnostic Value: SCALAR provides a "principled lens" for diagnosing geometric generalization failures that are invisible to model-averaged accuracy metrics. It highlights that surface-level error metrics systematically understate failures in structural and physical reasoning.
Limitations of Current Prompts: The study demonstrates that CoT prompting is not a monotonic or universally beneficial intervention; it introduces trade-offs where improved reasoning may come at the cost of consistency.
Future Direction: By framing hallucination as a consequence of geometric and chemical inconsistency, SCALAR offers a setting for diagnosing and mitigating failures under realistic structural distribution shifts, paving the way for more reliable deployment of foundation models in materials science.

The authors explicitly note limitations, including the dataset's focus on deterministic supercell expansion (ignoring disorder/defects), the use of classical geometric computations rather than quantum observables, and the potential for CoT formatting instability to complicate attribution.