Human-in-the-Loop LLM Grading for Handwritten Mathematics Assessments

Imagine you are a teacher with 100 students. Every week, you give them a short, handwritten math quiz. You want to give them feedback quickly so they can learn, but grading 200 pages of messy handwriting takes hours. It's slow, tiring, and by the time you hand the papers back, the students have already forgotten the lesson.

Now, imagine a new tool: a super-smart AI robot that can read handwriting, understand math, and grade papers in seconds. But here's the catch: AI can sometimes make silly mistakes, get confused by bad handwriting, or be too generous.

This paper is about a team of researchers who built a "Human-in-the-Loop" system. Think of it not as replacing the teacher with a robot, but as giving the teacher a super-powered co-pilot.

Here is how their system works, explained through simple analogies:

1. The Problem: The "Handwriting Mountain"

Teachers are drowning in a mountain of handwritten papers.

The Old Way: Teachers climb the mountain alone, grading every single paper. It takes forever.
The AI Risk: If you just let an AI grade everything, it might get tricked by a student who writes "5" when they meant "S," or it might hallucinate a correct answer where none exists.
The New Threat: Students are now using AI to do their homework at home, so teachers are forced to give in-class handwritten tests to see what students actually know. This creates more grading work, not less.

2. The Solution: The "Assembly Line" Workflow

The researchers built a factory line for grading that combines human brains with AI speed.

Step 1: The Blueprint (The Grading Key)
Before the AI sees a single paper, the teachers create a very strict "recipe" for grading.

Analogy: Imagine you are baking a cake. You don't just tell the robot, "Make it taste good." You give it a recipe: "If the cake is golden brown, give 2 points. If it has a crack in the middle, subtract 1 point."
The researchers found that if the instructions are vague, the AI gets confused. They had to write extremely detailed instructions so the AI knew exactly what to look for.

Step 2: The Privacy Shield
Before the AI sees the paper, the system takes a photo of the student's answer, cuts out their name, and hides their ID.

Analogy: It's like sending a letter to a judge with the sender's name blacked out. The AI only sees the math, not who wrote it. This keeps things fair and private.

Step 3: The "Five Judges" Rule
The AI doesn't just grade the paper once. It grades the same paper five times.

Analogy: Imagine you have a coin. You flip it once, and it lands on heads. Is it a fair coin? Maybe. But if you flip it five times and get heads every time, you are much more confident.
The AI acts like five different graders. If all five agree, great. If they disagree wildly, the system flags it as "suspicious."

Step 4: The Human Safety Net
This is the most important part. The AI gives a "provisional" grade, but a human teacher must look at it before it's final.

Analogy: Think of the AI as a very fast, very confident intern. The intern does 90% of the work in 10 minutes. The teacher (the boss) walks by, checks the intern's work, and says, "Yes, this looks right," or "Whoa, you missed a step here, let me fix it."
The human only has to check the tricky cases or verify the easy ones. They don't have to start from scratch.

3. What Happened? (The Results)

The researchers tested this in real university math classes. Here is what they found:

Speed: Grading became 23% faster. It's like the teacher got a part-time assistant who did the heavy lifting.
Fairness: The AI's grades were actually more consistent than human teachers grading each other. Humans get tired and might grade the 50th paper differently than the 1st. The AI stays the same.
Accuracy: The AI made mistakes, but they were rare. Because of the "Human Safety Net," those mistakes were caught before the students saw them.
The "Outlier" Problem: Sometimes the AI gets too excited and gives a perfect score to a messy answer. The system is designed to catch these "happy accidents" and flag them for the human to review.

4. The Big Takeaway

The paper argues that we shouldn't ask, "Can AI replace teachers?"
Instead, we should ask, "How can AI help teachers do their job better?"

The Final Metaphor:
Think of grading like driving a car.

Manual Grading: You are driving a manual car up a steep, rocky hill. You have to shift gears, steer, and brake yourself the whole way. You get tired.
Full AI Grading: You are in a self-driving car, but the roads are foggy and the AI might drive you off a cliff.
This Paper's System: You are driving a car with Cruise Control and Lane Assist. The AI handles the speed and keeps you in the lane (doing the boring, repetitive work). But you are still holding the steering wheel. You are ready to take over if the road gets weird or if the AI tries to drive into a tree.

Conclusion:
By using AI as a "co-pilot" rather than a replacement, teachers can give students faster, fairer, and more consistent feedback without burning out. The AI handles the volume; the human handles the judgment.

1. Problem Statement

The paper addresses the critical challenge of providing timely, consistent, and individualized feedback on handwritten student work at scale.

Pedagogical Need: Frequent, low-stakes assessments with detailed feedback are proven to improve learning, but they are operationally difficult to implement due to the time required to grade handwritten responses.
The AI Paradox: Generative AI has undermined the reliability of take-home assignments (as students can generate solutions instantly), shifting the focus back to supervised, in-class, pen-and-paper assessments. However, these assessments reintroduce the heavy workload of manual grading and digitization.
The Gap: While Large Language Models (LLMs) show promise in grading open-ended text, their application to handwritten mathematics in a real-world classroom setting—specifically handling OCR, stochasticity, and the need for strict fairness—remains unproven in end-to-end workflows.

2. Methodology

The authors developed and deployed a scalable, end-to-end human-in-the-loop workflow using GPT-5.1. The system was tested in two undergraduate mathematics courses at Vrije Universiteit Brussel using six "bonus tests" (10-minute, two-question handwritten exercises).

A. Data Pipeline & Anonymization

Standardized Input: Students wrote answers on standardized sheets with bubble-coded IDs and designated answer boxes.
Processing: Sheets were bulk-scanned. An OCR and template recognition step extracted coordinates for answer boxes and student IDs.
Anonymization: Answer boxes were cropped into separate images. All identifying information (IDs, names) was stripped before submission to the LLM to ensure privacy. IDs were manually verified against the database post-OCR.

B. The Grading Workflow

The core of the system consists of three components:

Solution Keys: Fully worked-out LaTeX solutions based on lecture methods.
Grading Keys (Rubrics): The most critical component. The authors found that LLMs require highly specific, decomposed rubrics.
- Design Principles: Solutions were broken into small, explicit steps (2–3 points each) to prevent single-step errors from skewing the total score.
- Language: Keys were standardized to English.
- Prompt Engineering: The prompt explicitly instructed the model to ignore non-mathematical reasoning, avoid hallucinating steps not written by the student, and flag alternative valid approaches.
Multi-Pass Evaluation & Aggregation:
- Each student response was evaluated five independent times by the LLM to account for model stochasticity.
- Aggregation Rule: The maximum of the five scores was selected as the provisional grade (a conservative choice favoring the student to avoid under-grading).
- Consistency Checks: Variance and anomaly scores were calculated. High disagreement triggered automatic flags for human review.

C. Human-in-the-Loop Verification

Hybrid Model: The LLM provided a provisional grade and reasoning, but final authority remained with human instructors.
Review Interface: Humans reviewed a PDF containing the anonymized answer, the five LLM scores, consistency metrics, and the LLM's rationale. They could accept or override the grade.
Blind Validation: A subset of responses was graded independently by teaching assistants blind to the LLM outputs to measure true alignment.

3. Key Contributions

End-to-End Workflow: A complete, deployable pipeline for handwritten math grading that integrates OCR, anonymization, multi-pass LLM scoring, and human verification.
Grading Key Optimization: Identification that rubric granularity is the single most important factor for LLM reliability. The paper demonstrates that vague criteria lead to hallucinations, while decomposed, low-point-step rubrics yield stable results.
Stochasticity Management: A strategy using multi-pass evaluation and maximum-score aggregation to balance fairness (protecting students from under-grading) with consistency.
Empirical Benchmarking: A rigorous study comparing LLM-assisted grading against fully manual grading in a real classroom setting, using inter-annotator agreement as the gold standard.

4. Results

The system was evaluated across six questions and two courses with six experienced graders.

Time Efficiency:
- LLM-assisted grading reduced grading time by approximately 23.3% (Geometric Mean Ratio of 0.767) compared to manual grading.
- This efficiency gain was structural and consistent across graders with different baseline speeds, not just a learning curve effect.
Grading Alignment & Consistency:
- Agreement: The agreement between human graders and the LLM (Digital condition) was comparable to, and in several cases tighter than, the agreement between two human graders (Manual condition).
- Cohen's $\kappa$ : Quadratically weighted Cohen's $\kappa$ values for Human-vs-LLM were often higher than Human-vs-Human (e.g., 0.87 vs 0.70 for Bonus 3 Q2B).
- Deviation Distribution: In the digital workflow, score deviations were more tightly concentrated near zero (lower median absolute deviation), indicating the LLM acted as a stabilizing "anchor."
- Outliers: While the LLM reduced typical variance, it produced a small number of large outliers (approx. 3%) due to hallucinations or misinterpretation. These were effectively caught by the human verification step.
Robustness: The system successfully handled different solution paths (e.g., L'Hôpital's rule vs. factoring) when the grading keys explicitly allowed for them.

5. Significance and Conclusion

Redefining the Role of AI: The paper argues that the goal should not be to replace human graders but to augment them. LLMs serve as a high-efficiency baseline that stabilizes grading decisions and reduces workload, while humans retain authority over edge cases and final validation.
Scalability: This workflow makes frequent, formative, handwritten assessments feasible at scale, enabling the "tight feedback loops" necessary for effective learning without overburdening instructors.
Limitations & Future Work:
- The study relied on structured, short-answer math problems; results may vary for highly open-ended tasks.
- OCR errors on bubble sheets occurred occasionally; future iterations could use error-correcting identifiers (e.g., codes with high Hamming distance) instead of standard student numbers.
- The workflow is adaptable to other STEM fields (e.g., in-class coding) where step-by-step reasoning is required.

Conclusion: Carefully embedded, human-in-the-loop LLM grading can substantially reduce instructor workload (by ~23%) while maintaining or even improving grading consistency and fairness compared to fully manual processes.

Human-in-the-Loop LLM Grading for Handwritten Mathematics Assessments

1. The Problem: The "Handwriting Mountain"

2. The Solution: The "Assembly Line" Workflow

3. What Happened? (The Results)

4. The Big Takeaway

1. Problem Statement

2. Methodology

A. Data Pipeline & Anonymization

B. The Grading Workflow

C. Human-in-the-Loop Verification

3. Key Contributions

4. Results

5. Significance and Conclusion

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks