Imagine you have hired a team of 25 brilliant, super-fast interns (these are the Large Language Models, or LLMs) to help you with materials science. Your goal is to see if they can act as reliable scientists: predicting how strong a material is, what color it might be, or finding facts about it in a giant library.
The researchers at MIT put these interns through a rigorous test to answer a simple question: "Can we trust these AI interns to do real science?"
Here is what they found, explained through some everyday analogies.
1. The Two Types of Tasks: "Trivia" vs. "Math"
The researchers gave the interns two very different kinds of homework:
- The Trivia Tasks (Symbolic): "What crystal system does this material belong to?" or "Fill in the blank: Titanium Dioxide is known for being ______."
- The Math Tasks (Numerical): "Predict the exact energy gap of this material" or "Calculate the dielectric constant."
The Big Discovery: The AI behaves completely differently depending on which type of homework it's doing.
For Trivia (Symbolic Tasks): The "Confused Student"
- Before Training: The raw AI models were like students who hadn't studied the textbook. They guessed wildly. If you asked them the same question 10 times, they gave 10 different, random answers. They were unreliable and inconsistent.
- After Training (Fine-tuning): When the researchers gave them specific study materials (fine-tuning), the students suddenly "got it." They stopped guessing randomly and started giving the same, correct answer every time.
- The Lesson: For facts and categories, training works wonders. It turns a confused guesser into a reliable fact-checker.
For Math (Numerical Tasks): The "Overconfident Liar"
- Before Training: This is where it gets scary. The raw AI models didn't guess randomly; they guessed precisely. If you asked them for a number, they gave a very specific number (like "4.23 eV") and they gave the exact same wrong number every single time you asked.
- The Problem: They were confidently wrong. They sounded like experts, but they were hallucinating.
- After Training: Training helped them get closer to the right number, but they still sometimes gave the same wrong answer with high confidence.
- The Lesson: You cannot trust an AI just because it sounds sure of itself. In math tasks, a "confident" answer might still be a lie.
2. The "Brain vs. Mouth" Bottleneck
The researchers did something clever. They didn't just listen to what the AI said (the text output); they looked inside the AI's "brain" (the internal data layers) while it was thinking.
- The Analogy: Imagine a student taking a test. They know the answer deep down (in their brain), but when they try to write it down on the paper (the text output), they mess up the handwriting or forget a decimal point.
- The Finding: For some properties (like "bandgap"), the AI's brain actually held the correct answer perfectly. But when the AI tried to speak it out loud, it degraded the information.
- The Solution: Instead of asking the AI to "write an essay" with the answer, the researchers found they could just "read the student's mind" (extract the internal data) and use a simple calculator to get the answer. This was often more accurate than letting the AI speak.
- The Catch: This only worked for some properties. For others (like "dielectric constant"), the AI's brain didn't even hold the right answer, so reading its mind didn't help.
3. The "Silent Update" Problem
Finally, the researchers tracked the performance of the famous "GPT" models over 18 months.
- The Analogy: Imagine you hire a tutor for your child. You test them today, and they get an A. You test them again in six months, and they get a C. You didn't change the tutor; the tutoring company just quietly swapped the tutor for a different one, or changed the textbook, without telling you.
- The Finding: The performance of these AI models fluctuated wildly (by 9% to 43%) over time. One day, a model might be great at predicting material strength; the next day, after a silent software update by the company, it might be terrible.
- The Warning: If scientists use these AI tools for long-term research, their results might not be reproducible because the "tool" they used yesterday is not the same tool they are using today.
Summary: What Should We Take Away?
- Trust but Verify: If an AI is doing a "fact" task, training it makes it reliable. If it's doing a "math" task, be very careful—even if it sounds confident, it might be wrong.
- Look Inside the Box: Sometimes, the best way to get an answer from an AI isn't to ask it to talk, but to peek at its internal data.
- The "Moving Target" Risk: Using AI for science is tricky because the AI changes under the hood. Scientists need to be careful to document exactly which version of the AI they used, or their results might not hold up later.
In short, these AI models are powerful tools, but they are not magic. They have specific strengths, specific weaknesses, and they can be surprisingly unpredictable. To use them safely in science, we need to understand how they think, not just what they say.
Get papers like this in your inbox
Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.