MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models

Imagine you have a brilliant student named MaterialFigBench. This student is a "Multimodal Large Language Model" (MLLM)—basically, a super-smart AI that can read text and look at pictures.

The researchers behind this paper wanted to test if this student is actually smart or just good at memorizing flashcards.

Here is the story of their experiment, explained simply.

1. The Test: "Don't Just Guess!"

Usually, when we test AI, we ask it text-based questions like, "What is the capital of France?" The AI knows this because it read it a million times in its training data.

But in Materials Science (the study of what things are made of and how they behave), the answers are often hidden inside pictures:

Phase Diagrams: Maps showing how metals melt and mix.
Stress-Strain Curves: Graphs showing how much you can stretch a rubber band before it snaps.
Microscopes: Zoomed-in photos of crystal structures.

The researchers created a special test called MaterialFigBench. They took 137 college-level problems from textbooks.

The Catch: They took the original textbook pictures, changed the names of the metals (e.g., changing "Copper" to "Mystery Metal X"), and tweaked the lines on the graphs.
The Goal: To force the AI to actually look at the picture and do the math, rather than just recalling a fact it memorized.

2. The Results: The "Cheat Sheet" Problem

When they ran the test on top AI models (like GPT-4o, GPT-o1, and GPT-5), they found something surprising.

The AI was cheating.

The Scenario: The test asked, "Look at this graph for Mystery Metal X and tell me the melting point."
The AI's Reaction: Instead of looking at the graph, the AI thought, "Hmm, 'Mystery Metal X' sounds like Carbon in an Iron alloy. I've seen this a million times. The answer is 0.76%."
The Result: The AI got the answer right, but for the wrong reason. It ignored the picture and relied on its memory.

It's like a student taking a math test where the teacher changes the numbers in the problem. The student ignores the new numbers, remembers the answer to the old problem from last year, and writes that down. They get the grade, but they didn't learn anything.

3. Where Did the AI Struggle?

When the AI couldn't cheat (because the problem was too unique or required precise measurement), it struggled badly.

The "Pixel Peeping" Problem: The AI is terrible at reading exact numbers off a graph. If a line is at 45.3 on the chart, the AI might guess 45 or 46. In science, that small difference matters.
The "Significant Digits" Issue: The AI often gave answers that were too vague (e.g., "58,000" instead of "57,915") or too precise (giving 10 decimal places when the graph only had 2). It's like a chef guessing "a pinch of salt" when the recipe requires exactly 3.2 grams.
The "Geometry" Gap: Tasks requiring the AI to measure distances on a map or count angles in a crystal structure were very hard. It's like asking a human to estimate the distance between two stars just by looking at a blurry photo.

4. The Good News

The AI is getting better, but slowly.

Newer models (like GPT-5) did better than older ones (GPT-4).
They are getting slightly better at reading specific types of graphs (like Arrhenius plots, which look like straight lines on a log scale).
However, they still haven't mastered the "general skill" of looking at a new, weird graph and figuring it out from scratch.

5. The Big Lesson

The paper concludes that current AI is not truly "seeing" the world.

It's like a parrot that has memorized the entire dictionary. If you ask it a question about a picture, it might answer correctly by matching keywords, but it doesn't actually understand the visual relationship between the objects.

Why does this matter?
If we want AI to help scientists discover new materials (like better batteries or stronger bridges), the AI needs to be able to look at a new experiment's data and reason about it, not just guess based on old data.

MaterialFigBench is like a "honesty test" for AI. It strips away the cheat sheets and forces the AI to prove it can actually read the charts. Right now, the AI is passing the test by cheating, but this benchmark helps researchers figure out how to teach it to actually look.

Here is a detailed technical summary of the paper "MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models."

1. Problem Statement

While Large Language Models (LLMs) have shown promise in materials science, existing benchmarks primarily rely on textual inputs or treat scientific data (like crystal structures) as structured text (e.g., SMILES, CIF files) rather than visual images. There is a critical gap in evaluating Multimodal LLMs (MLLMs) on their ability to interpret scientific figures (phase diagrams, stress-strain curves, Arrhenius plots, microscopy images) which are indispensable for solving university-level materials science problems. Current models often rely on memorized domain knowledge rather than genuine visual reasoning, leading to inflated accuracy scores that do not reflect true multimodal capabilities.

2. Methodology

A. Dataset Construction (MaterialFigBENCH)

The authors constructed a benchmark dataset consisting of 137 free-response problems derived from standard university-level materials science textbooks.

Selection Criteria: Problems were selected where figures are indispensable for the solution. Text-only problems were excluded.
Figure Modification: To prevent models from relying on memorized specific material data (e.g., the standard Cu-Ag or Fe-C phase diagrams), all figures were redrawn by the authors.
- Specific element names were replaced with hypothetical symbols (e.g., "Cu-Ag" $\rightarrow$ "MA-MB").
- Visual parameters (slopes, intersections, line thicknesses) were altered while preserving the underlying physical logic.
- Original problem texts were rewritten to reference these new figures and clarify instructions (e.g., specifying units, asking for multiple answers).
Answer Validation:
- Textual Answers: Evaluated for exact matches.
- Numerical Answers: Since reading values from images introduces human-level uncertainty, expert-defined answer ranges were established for numerical questions. A model's response is considered correct if it falls within this range.
Categories: The dataset covers seven categories:
1. Crystal orientation and indexing.
2. Stress–strain curves and hardness.
3. Hypothetical alloy phase diagrams.
4. Fe–C (Iron-Carbon) phase diagrams.
5. Fe–C phase transformations and microstructures.
6. Hypothetical oxide system phase diagrams.
7. Electrical conductivity of semiconductors.

B. Evaluation Protocol

The dataset was tested against a suite of state-of-the-art MLLMs:

ChatGPT Variants: ChatGPT-4o, o1, o3-mini, o3-mini-high, 5, 5-auto, 5-thinking.
OpenAI API Models: GPT-4o, GPT-o1, GPT-5.
Testing Conditions:
- Problems were submitted with figure attachments.
- For API models, multiple responses (up to 10) were generated per problem to assess consistency.
- Control Test: Models were also tested without figure attachments to determine if they were "hallucinating" correct answers based on pre-trained knowledge rather than visual input.
Judgment Criteria: Correctness was judged based on exact matches for text, and range inclusion for numerical values. Significant digit handling was strictly enforced; answers with incorrect precision were marked wrong.

3. Key Contributions

MaterialFigBENCH Dataset: The first benchmark specifically designed to evaluate figure-based reasoning in materials science, featuring 137 modified problems with expert-defined answer ranges.
Methodological Rigor: Introduction of a systematic approach to obfuscate memorized knowledge by redrawing figures with hypothetical symbols, forcing models to rely on visual interpretation.
Comprehensive Evaluation: A large-scale evaluation of the latest generation of MLLMs (including the GPT-5 family), analyzing performance across different problem types and model versions.
Diagnostic Analysis: Identification of specific failure modes, such as the inability to handle significant digits and the tendency to bypass visual inputs entirely.

4. Key Results

A. Overall Performance

Low Accuracy: Even the most advanced models (GPT-5, ChatGPT-5-thinking) achieved relatively low accuracy (approx. 40–55%), indicating that current MLLMs struggle significantly with scientific figure interpretation.
Model Trends: Generally, accuracy improved with model updates (e.g., GPT-5 > GPT-4o), but this was not uniform. Notably, ChatGPT-o1 outperformed ChatGPT-5 in some metrics, and ChatGPT-o3-mini variants performed worse than o1, suggesting that scaling does not linearly improve visual reasoning for this domain.

B. The "Memorization vs. Vision" Phenomenon

Fe-C Phase Diagrams: Models achieved high accuracy on Fe-C problems even without attaching the figure. They inferred the hypothetical element "X" as Carbon and used memorized solubility limits (e.g., 0.022 wt% C in ferrite) rather than reading the provided image.
Hypothetical Systems: When figures were modified (e.g., MA-MB diagrams), models often failed, proving they lack generalizable visual reasoning skills for phase diagrams.
No-Figure Tests: In tests where figures were omitted, many models (especially 4o and o1) still generated answers by guessing based on training data, whereas GPT-5 was more likely to admit it could not see the figure.

C. Specific Weaknesses

Significant Digits: A major failure point. Higher-tier models (like GPT-o1 and GPT-5) frequently reduced numerical precision (e.g., reporting 2 significant digits instead of 3), leading to incorrect judgments despite correct calculations.
Geometric Interpretation: Tasks requiring distance measurement on phase diagrams or crystallographic indexing (Categories 1 and 2) showed the lowest success rates.
Arrhenius Plots: These showed the most improvement with newer models, suggesting that specific plot types are easier for models to learn than complex phase diagrams.

D. Consistency

There was a "U-shaped" distribution in accuracy: models either got a problem completely right or completely wrong, with few intermediate successes.
Disagreements between ChatGPT interfaces and API versions were observed but decreased with newer model versions.

5. Significance and Implications

Benchmarking Reality: The study reveals that high accuracy scores in general benchmarks often mask a lack of genuine visual understanding. Models are "shortcutting" by recalling training data rather than analyzing inputs.
Domain-Specific Needs: Scientific reasoning requires quantitative precision (significant digits) and geometric reasoning that current general-purpose MLLMs lack.
Future Directions:
- Training: Future models need targeted training on scientific visual abstractions (e.g., phase boundaries, logarithmic slopes) rather than just general image recognition.
- Evaluation: Benchmarks must enforce strict constraints (like hypothetical elements and answer ranges) to prevent memorization from inflating scores.
- Reliability: For materials science applications, current MLLMs are not yet robust enough for reliable, autonomous problem-solving without human verification, particularly regarding numerical precision.

In conclusion, MaterialFigBENCH provides a critical diagnostic tool showing that while MLLMs are improving, they remain far from possessing the robust, figure-grounded reasoning capabilities required for professional materials science research.