FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs

This paper introduces FEM-Bench, a structured benchmark based on computational mechanics tasks designed to rigorously evaluate the ability of large language models to generate scientifically valid finite element method code, revealing that even state-of-the-art models struggle to consistently solve these nontrivial problems.

Original authors: Saeed Mohammadzadeh, Erfan Hamdi, Joel Shor, Emma Lejeune

Published 2026-06-01✓ Author reviewed
📖 5 min read🧠 Deep dive

Original authors: Saeed Mohammadzadeh, Erfan Hamdi, Joel Shor, Emma Lejeune

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a brilliant, well-read robot how to be a structural engineer. You don't just want it to write code that looks like it works; you want it to write code that actually understands the laws of physics, like gravity, tension, and how materials bend.

This paper introduces FEM-Bench, a "final exam" designed specifically to test if Large Language Models (LLMs)—the AI brains behind tools like ChatGPT—can do this kind of serious scientific engineering.

Here is a breakdown of the paper using simple analogies:

1. The Problem: The "Calculator" vs. The "Engineer"

Think of current AI models as incredibly fast calculators. If you ask them to write a simple program to add numbers or sort a list, they are great. But if you ask them to simulate how a bridge collapses under a heavy truck, they often fail.

Why? Because building a physics simulation isn't just about writing code; it's about:

  • Understanding the rules: Knowing exactly how forces move through a beam.
  • Connecting the dots: Taking tiny pieces of a puzzle (small parts of a structure) and snapping them together perfectly to make a whole picture.
  • Checking the work: Writing a test to prove the simulation isn't lying.

The authors realized there was no standard "driver's test" for AI in this specific field. Existing tests check if AI can write a website or solve a math riddle, but not if it can build a scientifically valid model of the physical world.

2. The Solution: FEM-Bench (The "Driving Test")

The authors created FEM-Bench, a collection of 33 specific challenges based on a first-year graduate course in computational mechanics.

  • The Analogy: Imagine a driving test. You don't just ask the driver to "drive." You ask them to parallel park, merge onto a highway, and navigate a roundabout.
  • The Tasks: In FEM-Bench, the "driving" involves things like:
    • Calculating how a 3D beam bends when you push it.
    • Turning a smooth, continuous shape (like a curved bridge) into a digital grid of tiny triangles (called "meshing").
    • Solving complex equations to see if a structure will buckle (collapse) under pressure.

3. The Twist: Two Parts to the Test

The benchmark doesn't just ask the AI to write the code. It asks for two things:

  1. The Code: The actual simulation program.
  2. The Test: A set of "check-up" rules (unit tests) that the AI must write to prove its own code works.

The Metaphor: It's like asking a student to not only build a bridge out of popsicle sticks but also to write a checklist proving the bridge won't fall down. If the student builds a bridge that looks cool but collapses when you put a weight on it, they fail. If they build a bridge that holds, but they can't write a test to prove it, they also fail.

4. The Results: The AI is Smart, But Not There Yet

The authors ran the top 10 AI models (including the newest ones from Google, OpenAI, and Anthropic) through this exam. Here is what they found:

  • The Easy Stuff: The AIs are great at the basics. They can easily handle simple, straight-line problems (like a single wooden beam). It's like they can parallel park perfectly.
  • The Hard Stuff: When the problems get complex—like dealing with twisting forces, curved shapes, or predicting when a structure will buckle—the AIs start to stumble.
    • The "Knowledge Gap": Sometimes the AI simply didn't know the specific formula for a complex physical phenomenon. It was like a driver who knows how to drive a car but doesn't know the rules of a roundabout.
    • The "Assembly Gap": Sometimes the AI knew the pieces but couldn't put them together correctly. It was like having all the Lego instructions but snapping the wrong bricks together.
    • The "Testing Gap": Even when the AI wrote a perfect simulation, it often failed to write the tests to prove it was correct. Writing the "checklist" was harder than building the "bridge."

The Score:

  • The best model (Gemini 3 Pro) got about 90% of the simple tasks right.
  • However, on the hardest tasks (those requiring complex physics without help), no model could solve them consistently.
  • Interestingly, the AI was often better at writing the code than writing the tests to verify that code.

5. The "Cheat Sheet" Experiment

The researchers tried to see if they could help the AI by giving it a "cheat sheet" (a system prompt with extra instructions).

  • Result: When they gave the AI the specific, complex formulas it was missing, it suddenly got much better at solving the hard problems.
  • The Lesson: The AI isn't "stupid"; it just lacks specific, deep knowledge about certain physics formulas. It can't "invent" the math of a collapsing bridge on the fly, but if you hand it the formula, it can use it perfectly.

Summary

FEM-Bench is a reality check for AI in science. It shows that while AI is getting very good at general coding, it still struggles to be a reliable, independent engineer for complex physical problems. It can follow instructions and build simple models, but it cannot yet reliably reason through the deep, messy, and precise laws of physics required to simulate the real world without human help.

The paper concludes that we need benchmarks like this to track progress. As AI gets smarter, the "driving test" will need to get harder to keep measuring real improvement.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →