MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models

The paper introduces MaterialFigBench, a benchmark dataset of 137 university-level materials science problems requiring figure interpretation, which reveals that despite improvements in multimodal large language models, they still struggle with genuine visual reasoning and quantitative analysis, often relying on memorized knowledge rather than accurately reading provided diagrams.

Michiko Yoshitake, Yuta Suzuki, Ryo Igarashi, Yoshitaka Ushiku, Keisuke Nagato

Published Fri, 13 Ma
📖 4 min read☕ Coffee break read

Imagine you have a brilliant student named MaterialFigBench. This student is a "Multimodal Large Language Model" (MLLM)—basically, a super-smart AI that can read text and look at pictures.

The researchers behind this paper wanted to test if this student is actually smart or just good at memorizing flashcards.

Here is the story of their experiment, explained simply.

1. The Test: "Don't Just Guess!"

Usually, when we test AI, we ask it text-based questions like, "What is the capital of France?" The AI knows this because it read it a million times in its training data.

But in Materials Science (the study of what things are made of and how they behave), the answers are often hidden inside pictures:

  • Phase Diagrams: Maps showing how metals melt and mix.
  • Stress-Strain Curves: Graphs showing how much you can stretch a rubber band before it snaps.
  • Microscopes: Zoomed-in photos of crystal structures.

The researchers created a special test called MaterialFigBench. They took 137 college-level problems from textbooks.

  • The Catch: They took the original textbook pictures, changed the names of the metals (e.g., changing "Copper" to "Mystery Metal X"), and tweaked the lines on the graphs.
  • The Goal: To force the AI to actually look at the picture and do the math, rather than just recalling a fact it memorized.

2. The Results: The "Cheat Sheet" Problem

When they ran the test on top AI models (like GPT-4o, GPT-o1, and GPT-5), they found something surprising.

The AI was cheating.

  • The Scenario: The test asked, "Look at this graph for Mystery Metal X and tell me the melting point."
  • The AI's Reaction: Instead of looking at the graph, the AI thought, "Hmm, 'Mystery Metal X' sounds like Carbon in an Iron alloy. I've seen this a million times. The answer is 0.76%."
  • The Result: The AI got the answer right, but for the wrong reason. It ignored the picture and relied on its memory.

It's like a student taking a math test where the teacher changes the numbers in the problem. The student ignores the new numbers, remembers the answer to the old problem from last year, and writes that down. They get the grade, but they didn't learn anything.

3. Where Did the AI Struggle?

When the AI couldn't cheat (because the problem was too unique or required precise measurement), it struggled badly.

  • The "Pixel Peeping" Problem: The AI is terrible at reading exact numbers off a graph. If a line is at 45.3 on the chart, the AI might guess 45 or 46. In science, that small difference matters.
  • The "Significant Digits" Issue: The AI often gave answers that were too vague (e.g., "58,000" instead of "57,915") or too precise (giving 10 decimal places when the graph only had 2). It's like a chef guessing "a pinch of salt" when the recipe requires exactly 3.2 grams.
  • The "Geometry" Gap: Tasks requiring the AI to measure distances on a map or count angles in a crystal structure were very hard. It's like asking a human to estimate the distance between two stars just by looking at a blurry photo.

4. The Good News

The AI is getting better, but slowly.

  • Newer models (like GPT-5) did better than older ones (GPT-4).
  • They are getting slightly better at reading specific types of graphs (like Arrhenius plots, which look like straight lines on a log scale).
  • However, they still haven't mastered the "general skill" of looking at a new, weird graph and figuring it out from scratch.

5. The Big Lesson

The paper concludes that current AI is not truly "seeing" the world.

It's like a parrot that has memorized the entire dictionary. If you ask it a question about a picture, it might answer correctly by matching keywords, but it doesn't actually understand the visual relationship between the objects.

Why does this matter?
If we want AI to help scientists discover new materials (like better batteries or stronger bridges), the AI needs to be able to look at a new experiment's data and reason about it, not just guess based on old data.

MaterialFigBench is like a "honesty test" for AI. It strips away the cheat sheets and forces the AI to prove it can actually read the charts. Right now, the AI is passing the test by cheating, but this benchmark helps researchers figure out how to teach it to actually look.