Human-in-the-Loop LLM Grading for Handwritten Mathematics Assessments

This paper presents a scalable, human-in-the-loop workflow for grading handwritten mathematics assessments that leverages LLMs to reduce grading time by approximately 23% while maintaining fairness and accuracy comparable to manual grading.

Arne Vanhoyweghen, Vincent Holst, Melika Mobini, Lukas Van de Voorde, Tibo Vanleke, Bert Verbruggen, Brecht Verbeken, Andres Algaba, Sam Verboven, Marie-Anne Guerry, Filip Van Droogenbroeck, Vincent Ginis

Published 2026-03-16
📖 5 min read🧠 Deep dive

Imagine you are a teacher with 100 students. Every week, you give them a short, handwritten math quiz. You want to give them feedback quickly so they can learn, but grading 200 pages of messy handwriting takes hours. It's slow, tiring, and by the time you hand the papers back, the students have already forgotten the lesson.

Now, imagine a new tool: a super-smart AI robot that can read handwriting, understand math, and grade papers in seconds. But here's the catch: AI can sometimes make silly mistakes, get confused by bad handwriting, or be too generous.

This paper is about a team of researchers who built a "Human-in-the-Loop" system. Think of it not as replacing the teacher with a robot, but as giving the teacher a super-powered co-pilot.

Here is how their system works, explained through simple analogies:

1. The Problem: The "Handwriting Mountain"

Teachers are drowning in a mountain of handwritten papers.

  • The Old Way: Teachers climb the mountain alone, grading every single paper. It takes forever.
  • The AI Risk: If you just let an AI grade everything, it might get tricked by a student who writes "5" when they meant "S," or it might hallucinate a correct answer where none exists.
  • The New Threat: Students are now using AI to do their homework at home, so teachers are forced to give in-class handwritten tests to see what students actually know. This creates more grading work, not less.

2. The Solution: The "Assembly Line" Workflow

The researchers built a factory line for grading that combines human brains with AI speed.

Step 1: The Blueprint (The Grading Key)
Before the AI sees a single paper, the teachers create a very strict "recipe" for grading.

  • Analogy: Imagine you are baking a cake. You don't just tell the robot, "Make it taste good." You give it a recipe: "If the cake is golden brown, give 2 points. If it has a crack in the middle, subtract 1 point."
  • The researchers found that if the instructions are vague, the AI gets confused. They had to write extremely detailed instructions so the AI knew exactly what to look for.

Step 2: The Privacy Shield
Before the AI sees the paper, the system takes a photo of the student's answer, cuts out their name, and hides their ID.

  • Analogy: It's like sending a letter to a judge with the sender's name blacked out. The AI only sees the math, not who wrote it. This keeps things fair and private.

Step 3: The "Five Judges" Rule
The AI doesn't just grade the paper once. It grades the same paper five times.

  • Analogy: Imagine you have a coin. You flip it once, and it lands on heads. Is it a fair coin? Maybe. But if you flip it five times and get heads every time, you are much more confident.
  • The AI acts like five different graders. If all five agree, great. If they disagree wildly, the system flags it as "suspicious."

Step 4: The Human Safety Net
This is the most important part. The AI gives a "provisional" grade, but a human teacher must look at it before it's final.

  • Analogy: Think of the AI as a very fast, very confident intern. The intern does 90% of the work in 10 minutes. The teacher (the boss) walks by, checks the intern's work, and says, "Yes, this looks right," or "Whoa, you missed a step here, let me fix it."
  • The human only has to check the tricky cases or verify the easy ones. They don't have to start from scratch.

3. What Happened? (The Results)

The researchers tested this in real university math classes. Here is what they found:

  • Speed: Grading became 23% faster. It's like the teacher got a part-time assistant who did the heavy lifting.
  • Fairness: The AI's grades were actually more consistent than human teachers grading each other. Humans get tired and might grade the 50th paper differently than the 1st. The AI stays the same.
  • Accuracy: The AI made mistakes, but they were rare. Because of the "Human Safety Net," those mistakes were caught before the students saw them.
  • The "Outlier" Problem: Sometimes the AI gets too excited and gives a perfect score to a messy answer. The system is designed to catch these "happy accidents" and flag them for the human to review.

4. The Big Takeaway

The paper argues that we shouldn't ask, "Can AI replace teachers?"
Instead, we should ask, "How can AI help teachers do their job better?"

The Final Metaphor:
Think of grading like driving a car.

  • Manual Grading: You are driving a manual car up a steep, rocky hill. You have to shift gears, steer, and brake yourself the whole way. You get tired.
  • Full AI Grading: You are in a self-driving car, but the roads are foggy and the AI might drive you off a cliff.
  • This Paper's System: You are driving a car with Cruise Control and Lane Assist. The AI handles the speed and keeps you in the lane (doing the boring, repetitive work). But you are still holding the steering wheel. You are ready to take over if the road gets weird or if the AI tries to drive into a tree.

Conclusion:
By using AI as a "co-pilot" rather than a replacement, teachers can give students faster, fairer, and more consistent feedback without burning out. The AI handles the volume; the human handles the judgment.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →