← Latest papers
🔬 materials science

SCALAR: Quantifying Structural Hallucination, Consistency, and Reasoning Gaps in Materials Foundation Models

This paper introduces SCALAR, a benchmark designed to evaluate how materials foundation models handle geometric scale generalization and structural reasoning across diverse nanoparticle structures, revealing that while explicit physics-grounded reasoning can reduce hallucinations and errors, it often compromises output consistency and validity.

Original authors: Can Polat, Erchin Serpedin, Mustafa Kurban, Hasan Kurban

Published 2026-02-02
📖 5 min read🧠 Deep dive

Original authors: Can Polat, Erchin Serpedin, Mustafa Kurban, Hasan Kurban

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a master architect who is incredibly good at reading blueprints for perfect, infinite skyscrapers. This architect (a type of AI called a "foundation model") can tell you everything about the building's materials, strength, and design just by looking at the blueprint.

But here's the catch: The architect has never been asked to design a tiny model of that skyscraper made of LEGOs, nor have they been asked to figure out what the original skyscraper looks like just by holding a single LEGO brick in their hand.

The paper introduces a new test called SCALAR to see if these AI architects can handle the jump from "infinite skyscraper" to "tiny LEGO model" without losing their minds.

The Core Problem: The "Hallucination" Trap

In the world of AI, a "hallucination" isn't just making something up; it's confidently stating something that sounds right but breaks the laws of physics.

Think of it like this: If you ask a human to imagine a perfect sphere made of water, they know it's round. If you ask them to imagine a cube of water, they might hesitate because water doesn't naturally form cubes. But if you ask an AI to imagine a "cubic water crystal" and it confidently says, "Yes, the corners are sharp and the density is high," it has hallucinated. It ignored the fact that water molecules don't work that way.

The paper argues that current AI models are great at describing the "infinite" version of a material (the bulk crystal) but often fail miserably when asked to describe the "finite" version (a tiny nanoparticle). They might get the numbers right but violate the underlying rules of how atoms stick together.

How the Test Works (The Three Challenges)

The researchers built a massive dataset of 100,000 structures, ranging from a few atoms to over 18,000 atoms. They then put the AI through three specific tests:

  1. The "Zoom-Out" Test (CIF to Property):

    • The Setup: You give the AI the blueprint of a perfect crystal (the "Unit Cell").
    • The Task: The AI must predict the properties of a tiny, chopped-off piece of that crystal (a "Nanoparticle").
    • The Twist: The AI has to figure out how the properties change as the piece gets bigger or smaller.
    • The Result: Many AIs got the basic math right but failed to understand the trend. They couldn't consistently say, "As the piece gets bigger, the density should stay the same," or "As it gets smaller, the surface area changes."
  2. The "Think Aloud" Test (Chain-of-Thought):

    • The Setup: The researchers told the AI: "Don't just give me the answer; explain your reasoning step-by-step using physics."
    • The Result: This was a double-edged sword. Sometimes, forcing the AI to "think" made it more accurate. But often, it made the AI less consistent. It would give a great explanation in one try, and a completely different (and wrong) explanation in the next try, even for the exact same question. It's like a student who can solve a math problem perfectly if they write it down, but gets confused if they have to explain why they did it.
  3. The "Reverse Detective" Test (Inverse Retrieval):

    • The Setup: You give the AI a set of properties (e.g., "This material is heavy, has a specific volume, and is very dense").
    • The Task: The AI must pick the correct blueprint from a lineup of candidates.
    • The Result: Some AIs were surprisingly good at this, acting like detectives. However, others picked the wrong blueprint even when their description of the material was physically plausible. They found a "near miss" that sounded right but was actually the wrong material.

The Big Discovery: Accuracy is a Lie

The most important finding of the paper is that you cannot trust an AI just because it gets the right number.

Imagine a student taking a test.

  • Student A gets 90% of the answers right but changes their answer every time you ask the same question.
  • Student B gets 85% right, but their answers are always consistent and follow a logical pattern.

Current benchmarks usually just look at the score (90% vs 85%). This paper says, "Wait a minute! Student A is unreliable because they can't keep their story straight."

The researchers found that when they tested the AI on "Out-of-Distribution" data (sizes the AI hadn't seen before), the AI's ability to stay consistent and follow physical laws collapsed, even if its raw accuracy numbers looked okay.

The Takeaway

The paper concludes that we need a new way to measure AI in science. We can't just ask, "Is the answer right?" We have to ask:

  • "Is the answer consistent?"
  • "Does it follow the laws of physics?"
  • "Does it hallucinate when the size of the object changes?"

The SCALAR benchmark is a tool designed to catch these "smart but crazy" moments before we trust these AI models to design real-world materials for things like batteries or medicine. It's a reality check to ensure that when an AI talks about atoms, it's actually talking about atoms, not just making up a story that sounds scientific.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →