Towards Reasoning for PDE Foundation Models: A… — Plain-Language Explanation

Original authors: Siddharth Mansingh, James Amarel, Ragib Arnab, Arvind Mohan, Kamaljeet Singh, Gerd J. Kunde, Nicolas Hengartner, Benjamin Migliori, Emily Casleton, Nathan A. Debardeleben, Ayan Biswas, Diane Oyen, Ear

Published 2026-01-26

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

CC BY 4.0

Original authors: Siddharth Mansingh, James Amarel, Ragib Arnab, Arvind Mohan, Kamaljeet Singh, Gerd J. Kunde, Nicolas Hengartner, Benjamin Migliori, Emily Casleton, Nathan A. Debardeleben, Ayan Biswas, Diane Oyen, Earl Lawrence

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Picture: Teaching a Physics "Genius" to Think Before It Speaks

Imagine you have a very smart robot designed to predict how fluids (like air or water) move. This robot is a "Foundation Model" trained on physics equations. Usually, this robot works like a student taking a test: it looks at the starting situation, makes a guess for the next second, then uses that guess to predict the second after that, and so on.

The Problem: If the robot makes a tiny mistake in the first second, that mistake gets bigger and bigger with every step, like a snowball rolling down a hill. By the end of the simulation, the prediction is completely wrong. This is especially bad when the robot faces a new, tricky situation it hasn't seen before.

The Solution: The authors of this paper introduced a new way for the robot to "think" before it commits to an answer. Instead of just making one guess and moving forward, the robot generates many different possible futures at every single step. It then acts like a judge, picking the one future that looks the most physically realistic before moving to the next step.

They call this "Test-Time Compute" (TTC). It's like giving the robot a little more time to "think" during the exam, rather than just memorizing answers during study time.

How It Works: The "Choose Your Own Adventure" Strategy

To make this work, the researchers used two main tools:

1. The "Stochastic" Trick (Making the Robot Guess)

Most physics models are deterministic, meaning if you give them the same input, they give the exact same output every time. To make the robot generate different guesses, the researchers kept a specific setting (called "dropout") turned on even while the robot was working.

The Analogy: Imagine asking a chef to cook a dish. Usually, they follow the recipe exactly. Here, the researchers told the chef, "For this dish, you can randomly swap out a few ingredients or change the cooking time slightly." This forces the chef to create 10 slightly different versions of the dish instead of just one.

2. The "Judge" (The Reward Model)

Once the robot generates 10 different guesses for the next second, it needs a way to pick the best one. They used two types of "Judges":

The Analytical Judge (The Rulebook): This judge checks the guesses against the strict laws of physics (like the Law of Conservation of Mass). If a guess says mass disappeared, the judge gives it a low score.
The Learned Judge (The Experienced Coach): This is a smaller AI trained to look at the guesses and say, "This one looks like a real fluid flow; that one looks weird." It learns from examples of good and bad predictions.

The Process:

The robot generates 10 possible next steps (Branching Factor).
The Judge scores all 10.
The robot picks the highest-scoring one and moves to the next second.
It repeats this until the simulation is done.

The Results: Smarter with Less Data

The researchers tested this on complex fluid simulations (like shockwaves and swirling vortices). Here is what they found:

Better Accuracy: By using this "think before you speak" method, the robot made much fewer mistakes over long periods. The more guesses it generated (the higher the "branching factor"), the better it performed.
Small Models, Big Wins: They achieved these results using a relatively small model (about 5 million parameters). Other similar models usually need to be massive (up to 700 million parameters) to get decent results.
Data Efficiency: This is the biggest win. Usually, to teach a model a new task, you need thousands of examples. This method allowed the model to learn a new task using only 6.25% of the data usually required.
- Analogy: Imagine a student who usually needs to read 100 textbooks to pass a test. With this new "thinking" strategy, they only needed to read 6 textbooks and still got an A+.

What They Did NOT Claim

It is important to stick to what the paper actually says:

They did not claim this works for medical diagnoses or clinical uses.
They did not claim this replaces all other physics simulation methods.
They did not claim the model is "human-like" in its reasoning; it is simply a mathematical way to select the best candidate solution based on physical rules.

Summary

The paper introduces a method where a physics AI model pauses to generate multiple possibilities at every step, uses a "judge" to pick the one that obeys the laws of physics best, and then proceeds. This allows smaller, cheaper models to perform better and learn from far less data than before, effectively giving them the ability to "reason" through complex problems without needing to be retrained from scratch.

Technical Summary: Towards Reasoning for PDE Foundation Models

Problem Statement
Partial Differential Equations (PDEs) are fundamental to computational science but remain computationally expensive to solve. While PDE Foundation Models (FMs) offer a promising alternative to traditional numerical methods, they face two critical limitations:

Error Accumulation in Autoregressive Rollouts: Existing models suffer from compounding errors and distribution shifts, particularly during long time-horizon predictions and in out-of-distribution (OOD) scenarios.
Data and Compute Inefficiency: Current approaches rely heavily on extensive fine-tuning datasets, which are often unavailable or prohibitively expensive to generate in real-world applications. Furthermore, large models require significant computational resources, limiting their utility in safety-critical contexts where efficiency is paramount.

The paper posits that the "reasoning" strategies recently successful in Large Language Models (LLMs)—such as Chain-of-Thought and Tree-of-Thought—could be adapted to PDEs. However, unlike LLMs where reasoning involves subjective solution spaces, PDEs offer objective physical constraints. The challenge is to define "reasoning" in this context as the systematic use of inference-time computation to evaluate, compare, and select among multiple candidate solutions guided by a reward signal, without requiring additional training data or massive parameter scaling.

Methodology
The authors introduce a Test-Time Compute (TTC) framework, described as the first of its kind for PDE foundation models. The core approach involves generating multiple candidate predictions at each inference step and selecting the most promising one based on a reward model.

Base Architecture: The foundation model is a Vision Transformer (ViT) adapted for image-to-image translation of fluid dynamics states. The authors utilize three variants (ViT-3, ViT-5, ViT-7) corresponding to different patch sizes (3x3, 5x5, 7x7) to better approximate PDE operators.
Inducing Stochasticity: Unlike standard deterministic PDE models, this framework requires stochasticity to generate multiple candidates for beam-search-style selection. The authors achieve this by keeping dropout active during inference, allowing the model to sample different dropout masks and produce diverse predictions for the same input.
Reward Models: Two types of reward models are employed to evaluate the quality of candidate predictions (specifically, the transition from time $t$ $t$ to $t+1$ $t + 1$ ):
1. Analytical Reward Models (ARMs): These are hand-crafted functions based on explicit physical conservation laws (mass, momentum, and energy). They calculate the deviation from conservation principles to assign a reward score.
2. Learned Process Reward Models (PRMs): These are neural networks trained via contrastive learning to predict the quality of a next-step snapshot. The PRM is trained on triplets of predictions (maximum, median, and minimum quality based on Mean Squared Error against ground truth) using a triplet margin loss. Notably, PRMs are trained on a fraction of the data (12.5% of original samples) and are sized similarly to the foundation model itself.
Inference Algorithm: The system employs a Greedy Selection Strategy. At each timestep, the base model generates $B$ candidate predictions (where $B$ is the branching factor). The reward model scores each candidate, and the one with the highest score is selected to proceed to the next timestep. This process repeats until the final time horizon is reached.

Key Contributions

Novel TTC Framework: The paper introduces the first test-time computation strategy for PDE FMs, demonstrating that inference-time scaling can improve accuracy without additional training data.
Sample Efficiency: The proposed method achieves state-of-the-art downstream accuracy after fine-tuning on only 6.25% of the training data required by an equivalent baseline FM without TTC.
Parameter Efficiency: The approach utilizes a compact foundation model of approximately 5 million parameters, a significant reduction compared to existing PDE models which range from 21M to 0.7 billion parameters.
Learned PRMs for PDEs: The introduction of Process Reward Models tailored for PDEs, which are trained efficiently on limited data and outperform analytical reward functions in many scenarios.

Results
The method was evaluated on the PDEGym benchmark, specifically focusing on compressible Euler equations (CE) involving complex phenomena like shocks and vortex structures.

Pretraining Performance: On pretraining datasets (RP, CRP, Gauss, KH), increasing the branching factor ( $B$ ) led to monotonic improvements in Mean Squared Error (MSE). Process Reward Models (PRMs) consistently outperformed Analytical Reward Models (ARMs), with sample gains reaching up to ~25% in certain tasks.
Downstream Generalization: The framework demonstrated robustness on OOD downstream tasks (RM and RPUI). While ARM performance sometimes degraded (potentially due to conservation violations in the training data), PRMs provided consistent improvements.
Data Efficiency: A model fine-tuned on a small number of trajectories ( $n_1$ ) using TTC with a high branching factor approached the performance of a model fine-tuned on a much larger dataset ( $n_2$ ) with standard inference ( $B=1$ ).
Physical Consistency: The TTC approach improved adherence to mass and energy conservation laws during inference, although momentum conservation improvements were less consistent due to biases in the ground truth data.

Significance and Claims
The paper positions this work as a foundational first step toward advanced reasoning algorithms for PDE modeling, rather than a definitive solution.

Paradigm Shift: It suggests a shift from relying solely on model capacity and training data to leveraging inference-time computation. This aligns with the "bitter lesson" of AI, where scalable systems rely on computation rather than handcrafted knowledge.
Practical Impact: By enabling high accuracy with smaller models and sparse data, the method addresses the critical bottleneck of data scarcity in scientific applications where high-fidelity simulations are expensive.
Future Directions: The authors frame this as an early exploration similar to the early era of LLM reasoning models. They suggest that while this current work uses reward-model-driven self-evaluation, it paves the way for fully adaptive, reinforcement-learning-based reasoning algorithms. The paper explicitly notes that the definition of "reasoning" in PDEs requires further philosophical and technical scrutiny, distinguishing it from human reasoning by the presence of objective physical benchmarks.

Towards Reasoning for PDE Foundation Models: A Reward-Model-Driven Inference-Time-Scaling Algorithm