LLM-Guided Open Hypothesis Learning from Autonomous… — Plain-Language Explanation

Original authors: Boris Slautin, Utkarsh Pratiush, Yu Liu, Kamyar Barakati, Sergei Kalinin

Published 2026-05-11

📖 5 min read🧠 Deep dive

Original authors: Boris Slautin, Utkarsh Pratiush, Yu Liu, Kamyar Barakati, Sergei Kalinin

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine a scientist working in a lab with a super-powerful microscope. In the past, this scientist would have to decide exactly what to measure, run the test, look at the results, and then decide what to do next. This is slow and relies heavily on the scientist's own intuition.

In recent years, scientists have built "self-driving" labs. These are like autonomous cars for science: the computer controls the microscope, runs experiments, and tweaks the settings to find the best results as fast as possible. However, there's a catch: these self-driving labs are usually very good at optimizing (finding the best setting) but terrible at discovering new laws. They can tell you "this voltage makes the biggest dot," but they can't tell you why or write down a new rule of physics that explains it. They are stuck inside a box of ideas the human programmer gave them.

This paper introduces a new system that breaks out of that box. It teaches the computer not just to find the best answer, but to invent new theories based on what it sees.

Here is how the system works, using a simple analogy:

The Two-Brain System

Think of this new system as a team of two very different robots working together on a puzzle.

1. The "Pattern Finder" (Symbolic Regression)
Imagine a robot that is incredibly good at math but has no common sense. You give it a few scattered data points (like a few dots on a graph), and it starts screaming out thousands of different math formulas that could connect those dots.

What it does: It generates wild guesses like "The size of the dot equals the voltage times the square root of time" or "The size equals the voltage plus a random number."
The Problem: Because it has no common sense, it might suggest formulas that are mathematically perfect but physically impossible (like saying a dot gets smaller when you turn up the power). It's like a student who memorized a math textbook but doesn't understand how the real world works.

2. The "Physics Professor" (The Large Language Model)
Now, imagine a second robot that is a super-smart physics professor. This robot has read every physics textbook ever written. It doesn't do the math itself; instead, it acts as a judge.

What it does: It looks at the thousands of wild formulas generated by the "Pattern Finder" and says, "Wait a minute. That formula says the dot grows backward in time? That's impossible. Throw it out."
The Magic: It ranks the formulas based on whether they make sense in the real world. It picks the ones that follow the rules of physics (like "dots should get bigger with more voltage") and explains why they are good.

The Experiment: Growing Tiny Electric Bubbles

To test this, the researchers used a special microscope to poke a tiny piece of material called PZT (a type of ceramic that holds an electric charge). When they zap it with electricity, a tiny "bubble" of switched charge grows.

The Goal: They wanted to find the rule that explains how big that bubble gets based on how long they zap it and how hard they zap it.
The Process:
1. Start: They started with just five random guesses (five different zap settings).
2. The Loop:
  - The "Pattern Finder" looked at the five results and wrote down 50 possible math rules.
  - The "Physics Professor" read them, gave them scores, and picked the best one.
  - The computer then used that best rule to decide where to zap next to learn more.
  - They did this 10 times, adding more data each round.

The Result: From Guessing to Understanding

At the very beginning, the "Pattern Finder" was confused. It suggested silly rules, like "The bubble size depends only on time, not voltage." The "Physics Professor" gave these low scores and said, "No, that doesn't make sense."

As the experiment continued and the computer gathered more data, the "Pattern Finder" started suggesting smarter rules. Finally, the "Physics Professor" picked a winner: a rule that said the bubble grows based on both the voltage and the time, specifically following a pattern where growth slows down over time (like a "creep" motion).

Why is this a big deal?
In previous experiments, scientists had to tell the computer, "Here are three possible rules; pick the best one." The computer just chose from the list.
In this new experiment, the computer created the rule itself from the data, and the "Physics Professor" confirmed it was real. The system didn't just find the best setting; it discovered a new way to describe how the material behaves.

The Bottom Line

This paper shows a way to turn autonomous science from a "search engine" (which just finds the best answer in a list) into a "scientist" (which can write new laws of physics). By combining a math-bot that generates ideas with an AI-bot that checks if those ideas make sense, the system can learn complex physical rules on its own, starting from almost nothing.

Technical Summary: LLM-Guided Open Hypothesis Learning from Autonomous Scanning Probe Microscopy Experiments

Problem Statement
Autonomous experimentation has established itself as a paradigm for closed-loop optimization in microscopy and materials science, primarily utilizing Bayesian optimization (BO) to tune instrument parameters or discover structure-property relationships within fixed search spaces. However, current workflows are fundamentally limited by their reliance on predefined objective functions and hypothesis spaces. While these systems efficiently converge to optima within human-specified models, they lack the mechanism to generate new physical laws or interpret experimental data through open hypothesis generation. Existing "hypothesis learning" approaches typically select among a pre-defined set of human-designed models rather than deriving relationships directly from data. This creates a gap in hierarchical autonomous systems where interpretable knowledge must be generated, evaluated, and propagated across decision layers without being constrained by prior human model classes.

Methodology
The authors propose an "open hypothesis learning" framework that integrates symbolic regression with Large Language Model (LLM)-based physical evaluation to transition from closed-loop optimization to open hypothesis discovery. The workflow consists of three core components:

Symbolic Regression for Hypothesis Generation:
Instead of fitting parameters within a fixed model structure, the system employs symbolic regression (using the PySR library) to search the space of mathematical expressions directly. Given sparse experimental data, this module generates a Pareto front of candidate analytical expressions that balance predictive accuracy (loss) against model complexity. This process is purely data-driven, producing explicit candidate equations without incorporating physical constraints during the generation phase.
LLM-Based Physical Evaluation:
To address the limitation that statistically optimal expressions may lack physical validity, the framework introduces a structured LLM evaluator. The LLM is provided with the candidate equations, variable definitions, and a context-rich summary of established physical principles (specifically regarding ferroelectric domain growth, generated via a deep-research pipeline). The LLM acts as a validator rather than a generator, scoring candidates based on:
- Non-triviality: Dependence on relevant variables (e.g., both voltage and time).
- Monotonicity: Consistency with expected trends.
- Scaling Behavior: Alignment with known regimes (e.g., thermodynamic equilibrium, kinetic growth, creep).
- Pathology: Absence of singularities or unphysical divergences.
  The output includes a quantitative plausibility score and a natural language reasoning trace explaining the assessment.
Adaptive Experimental Design:
The selected hypothesis guides the next experimental step. The system calculates residuals between the selected model and experimental data. A Gaussian Process (GP) is trained on these residuals to model uncertainty. Bayesian Optimization (BO) with an Upper Confidence Bound (UCB) acquisition function is then applied to the residual model to select new measurement conditions. This targets regions where the current hypothesis is incomplete, iteratively refining the model.

Experimental Implementation
The framework was demonstrated on autonomous Piezoresponse Force Microscopy (PFM) experiments investigating ferroelectric domain switching in a lead zirconate titanate (PZT) thin film.

Setup: Voltage pulses (1–10 V) with varying durations (0.01–10 s) were applied to a fixed location to induce domain switching. The resulting domain size (effective radius) was measured via DART-PFM.
Process: The experiment began with five random seed measurements. Over 10 iterative cycles, the system acquired 5 new data points per cycle based on the residual-driven BO strategy, resulting in a total of 55 measurements.
Evaluation: Candidate models were ranked by the LLM using a prompt engineered to enforce physical rules (e.g., requiring dependence on both $V$ and $t$ ).

Key Results

Evolution of Hypotheses: Starting from five seed points, the symbolic regression initially produced trivial or physically incomplete models (e.g., constant or single-variable dependencies). As the dataset grew through adaptive sampling, the hypothesis space evolved to include coupled voltage-time expressions.
Selection Criteria Comparison:
- Data-Driven Selection: The standard PySR "best" criterion (balancing loss and complexity) favored a simple voltage-only linear model ( $r \propto V$ ) in later iterations because it offered a favorable loss-to-complexity ratio, despite neglecting the time dependence critical for finite-time growth.
- LLM-Guided Selection: The LLM evaluator rejected the voltage-only model for violating the "non-triviality" rule (lack of time dependence). It selected a model of the form $r = V(0.000786 \log t + 0.00781)$ .
Physical Consistency: The LLM-selected model was identified as consistent with "disorder-controlled kinetics" or "creep behavior," where domain wall velocity depends exponentially on the inverse electric field, leading to logarithmic time dependence. This result aligns with prior findings in the field but was derived here without pre-specifying the model class.
Predictive Performance: When evaluated on future data points (acquired after model selection), the LLM-selected models demonstrated predictive accuracy comparable to the minimum-loss models but with lower variance. Crucially, the LLM-selected models generalized better than the purely complexity-penalized PySR "best" models, which often underfit the physics.

Significance and Claims
The paper claims to establish a route for integrating symbolic regression, physical reasoning, and adaptive experimentation into hierarchical autonomous scientific workflows. The primary contribution is the demonstration that autonomous microscopy can move beyond optimizing within fixed search spaces to generating interpretable physical laws directly from data.

Key claims include:

Open Hypothesis Discovery: The framework enables the generation of candidate physical relationships without relying on a predefined set of human-designed model classes.
Physics-Aware Ranking: The LLM evaluator successfully distinguishes between statistically efficient but physically incomplete models and those that are interpretable and consistent with known mechanisms (e.g., identifying the creep regime).
Knowledge Propagation: The natural language reasoning provided by the LLM serves as a transferable knowledge artifact, allowing downstream agents to understand why a model was selected, rather than just receiving a numerical score.
Validation: The approach successfully identified a kinetic, creep-like domain-wall growth law in PZT, a result consistent with previous hypothesis-learning studies that relied on predefined models, thereby validating the "open" generation approach.

The authors note that while the current implementation uses symbolic regression for analytical forms, the broader framework is designed to accommodate numerical or simulation-based models in future iterations. The work positions this methodology as a step toward autonomous systems that actively participate in the discovery process rather than merely executing predefined optimization tasks.

LLM-Guided Open Hypothesis Learning from Autonomous Scanning Probe Microscopy Experiments

The Two-Brain System

The Experiment: Growing Tiny Electric Bubbles

The Result: From Guessing to Understanding

The Bottom Line

More like this