Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math

This paper introduces ScratchMath, a novel benchmark dataset of 1,720 handwritten math samples from Chinese students designed to evaluate and improve multimodal large language models' ability to diagnose and explain student errors, revealing significant performance gaps between current AI systems and human experts.

Dingjie Song, Tianlong Xu, Yi-Fan Zhang, Hang Li, Zhiling Yan, Xing Fan, Haoyang Li, Lichao Sun, Qingsong Wen

Published 2026-03-27
📖 4 min read☕ Coffee break read

Imagine you are a teacher grading a stack of math homework. But instead of neat, typed answers, you have to look at messy, handwritten scribbles on scratch paper. You aren't just checking if the final number is right; you have to play detective to figure out why the student got it wrong. Did they misunderstand the question? Did they mess up the multiplication? Did they just forget to convert grams to kilograms?

This is exactly the challenge researchers tackled in a new paper called "Can MLLMs Read Students' Minds?" (or more formally, Unpacking Multimodal Error Analysis in Handwritten Math).

Here is the story of their work, broken down into simple concepts:

1. The Problem: The "Examinee" vs. The "Teacher"

Imagine a super-smart AI robot that is great at taking tests. It can look at a math problem and instantly give you the correct answer. This is what current AI models (called Multimodal Large Language Models, or MLLMs) are good at. They are like star students who always get an A.

But the researchers wanted to know: Can this star student become a teacher?
Can it look at a wrong answer, read the messy handwriting, and explain why the student made a mistake?

  • The Gap: Most AI is trained to solve the problem, not to diagnose the error. It's like asking a race car driver to fix a broken engine just because they know how to drive fast. They might not know why the engine failed.

2. The Solution: "ScratchMath" (The New Training Ground)

To teach these AI models how to be teachers, the researchers built a new "gym" called ScratchMath.

  • The Dataset: They collected 1,720 real examples of math problems from Chinese elementary and middle school students. These aren't clean textbook problems; they are photos of actual, messy scratch paper with crossed-out numbers, weird symbols, and confusing layouts.
  • The Two Tasks:
    1. The Detective (Explanation): The AI has to write a paragraph explaining exactly what went wrong (e.g., "The student forgot to divide by 1000 to get kilograms").
    2. The Classifier (Categorization): The AI has to pick a label for the mistake from a list (e.g., "Calculation Error" or "Misunderstood the Question").

3. The Experiment: Who Passed the Test?

The researchers put 16 different AI models through this "ScratchMath" test. They compared:

  • Open-Source Models: Like free, community-built tools (think of them as talented volunteers).
  • Proprietary Models: Like expensive, corporate super-computers (think of them as elite private detectives).

The Results:

  • The Gap: Even the best AI models scored much lower than human teachers. They struggled to read messy handwriting and follow the student's logic.
  • The Winners: The expensive, proprietary models (like o4-mini and Gemini) did the best, but they still made mistakes.
  • The Surprise: The "Reasoning" models (AI designed to think step-by-step) were surprisingly good at explaining the errors, even if they weren't perfect at classifying them.

4. Where Did the AI Get Stuck? (The "Hallucinations")

The researchers found that the AI failed in specific, funny, and frustrating ways:

  • The "Bad Handwriting" Problem: If a student wrote a "1" that looked like an "l" or a "7", the AI would get confused. It's like trying to read a doctor's prescription when the handwriting is terrible.
  • The "Magic Guess": Sometimes, the AI would just make up a reason for the error that wasn't true. This is called hallucination. It's like a detective accusing the suspect of stealing the cookie when the suspect actually just dropped it.
  • The "Unit Confusion": In one famous example, a student calculated the weight of a brick in grams but forgot to convert it to kilograms. The AI often missed this subtle but critical step, just like a human might if they were rushing.

5. Why Does This Matter?

Imagine a future where every student has a personal AI tutor.

  • Today: The AI might say, "You got this wrong. The answer is 5."
  • Tomorrow (with this research): The AI could say, "You got this wrong because you multiplied the length and width but forgot to multiply by the height. Also, your handwriting made the '3' look like an '8', which threw off your calculation."

This paper is a crucial step toward that future. It proves that while AI is getting better at "reading minds" (understanding student errors), it still has a long way to go before it can replace a human teacher's intuition.

In a nutshell: The researchers built a test to see if AI can grade messy math homework and explain mistakes. The AI is getting there, but it still needs to learn how to read bad handwriting and think like a teacher, not just a calculator.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →