Sketch2Feedback: Grammar-in-the-Loop Framework for Rubric-Aligned Feedback on Student STEM Diagrams

Imagine you are a teacher grading a stack of hand-drawn physics diagrams. Some show forces acting on a box; others show electrical circuits. You need to give specific, helpful feedback to every student, but you have hundreds of papers and only a few hours.

This is the problem Sketch2Feedback tries to solve. It's a new computer system designed to look at student sketches, find mistakes, and write a helpful note back to the student.

Here is the simple breakdown of how it works, using some everyday analogies.

The Problem: The "Overconfident AI"

Currently, we have powerful AI models (like the ones that can chat with you) that can look at a picture and describe it. But these AIs have a bad habit: they hallucinate.

Think of an AI like a very confident but slightly distracted student. If you show it a drawing of a car, it might say, "I see a car, a tree, and a dog running behind it." But there is no dog! It just imagined the dog because it sounds like a normal thing to see. In a classroom, if an AI tells a student, "You forgot to draw the dog," the student gets confused and loses trust in the teacher (or the computer).

The Solution: The "Grammar-in-the-Loop" Factory

The authors built a new system called Sketch2Feedback. Instead of letting the AI guess what's in the picture, they built a factory line with four specific stations. The AI is only allowed to speak after the previous stations have proven a mistake actually exists.

Here are the four stations:

The Detective (Hybrid Perception):
First, a set of classic, rule-based computer tools scans the drawing. It doesn't "guess"; it measures. It looks for arrows, lines, and shapes. It's like a security guard checking IDs at a door. It says, "I see a red arrow here," or "I see a battery symbol there."
- Analogy: This is like a metal detector at an airport. It beeps if it finds metal. It doesn't know what the metal is, just that something is there.
The Architect (Symbolic Graph):
The system takes the list of things the Detective found and builds a map. It connects the dots. "The arrow is touching the box," or "The wire connects the battery to the lightbulb."
- Analogy: This is like a construction foreman drawing a blueprint based on what the workers found on the site.
The Rulebook (Constraint Checking):
This is the most important part. The system compares the map against the "Answer Key" (the scenario). It asks strict questions: "Did the student draw a force pushing down? No. Is that a mistake? Yes."
- Crucial Rule: The system only flags errors that the Rulebook confirms. If the Rulebook doesn't see a mistake, the system stays silent.
- Analogy: This is like a strict editor who refuses to let the writer publish a sentence unless it's grammatically correct.
The Translator (The AI):
Finally, the AI (a Visual Language Model) gets the list of verified mistakes. Its job is just to translate "Missing ground wire" into a friendly sentence: "Hey, you forgot to connect the ground wire. Try adding a line to the earth symbol."
- The Safety Net: Because the AI only gets the list of real mistakes, it cannot make up fake ones. It's like a translator who is only allowed to translate words that are actually on the page.

The Results: A Tale of Two Subjects

The researchers tested this on two types of drawings: Free-Body Diagrams (physics forces) and Circuit Diagrams (wiring). The results were surprising and mixed:

On Physics Diagrams (Forces): The "Old Way" (just letting a big AI look at the picture) actually did better. Why? Because physics forces are about spatial relationships and "vibes" that are hard to measure with strict rules. The big AI could "feel" the mistake better than the rule-based factory.
On Circuit Diagrams (Wiring): The "Grammar Factory" crushed it. Circuits are logical. A wire is either connected or it isn't. The rule-based system was perfect at finding missing connections, and because it followed the rules, it gave perfectly actionable advice (5 out of 5 stars). The big AI, however, got confused and hallucinated a lot of fake errors.

The Big Win: Knowing Why You Failed

The most important discovery wasn't just about who won, but how they failed.

In the Circuit tests, the Grammar Factory made a lot of mistakes (it thought there were errors when there weren't). But because the system is built in stages, the researchers could pinpoint exactly where it went wrong.

They found the AI wasn't lying.
They found the "Detective" (Stage 1) was seeing shadows and thinking they were wires.
Because they knew the problem was in Stage 1, they could fix just that part without rebuilding the whole system.

In contrast, with the big "End-to-End" AI, if it makes a mistake, you have no idea if it was because it didn't see the picture, didn't understand the physics, or just got confused. It's a "black box."

The Bottom Line

Sketch2Feedback is a smart way to build AI for schools. It trades "guessing everything" for "being 100% sure about what it says."

Pros: It never invents fake mistakes (once the perception part is fixed), and it's easy to debug if it does make a mistake.
Cons: It relies on the "Detective" being good at seeing the drawing. If the drawing is messy, the system might miss things.

The authors conclude that there is no "one size fits all" AI yet. For some subjects, a big, smart AI is best. For others, a strict, rule-based factory is better. The future likely lies in combining them, using the strengths of both to help students learn.

1. Problem Statement

Providing timely, specific, and rubric-aligned feedback on hand-drawn STEM diagrams (specifically Free-Body Diagrams and Circuit Schematics) is a significant challenge in education. While Large Multimodal Models (LMMs) like LLaVA and GPT-4V can generate natural language explanations, they suffer from hallucinations (confidently describing non-existent elements). This lack of reliability erodes trust in classroom settings. The core bottleneck identified is not the generation of text, but the reliability of perception: models often fail to distinguish between actual diagram elements and hallucinated ones.

2. Methodology: Sketch2Feedback

The authors propose Sketch2Feedback, a "grammar-in-the-loop" framework that decouples perception, symbolic reasoning, and language generation. This architecture ensures that the language model only verbalizes errors verified by an upstream rule engine.

The pipeline consists of four distinct stages:

Hybrid Perception (Stage 1): Uses classical Computer Vision (CV) techniques rather than deep learning detectors.
- Techniques: CLAHE contrast normalization, adaptive thresholding, contour analysis (for arrows/forces), HoughLinesP (for wires), and shape-based classification (for components/junctions).
- Output: A set of detected primitives with bounding boxes.
Symbolic Graph Construction (Stage 2): Converts detected primitives into a typed graph $G = (V, E)$ , where nodes represent components/forces and edges represent spatial proximity.
Constraint Checking (Stage 3): A domain-specific rule engine checks the graph against a "scenario key" (the correct ground truth).
- Checks: Presence of required forces, vector direction, component polarity, ground connections, and non-local constraints (e.g., force balance).
- Output: A verified list of violations.
Constrained Feedback Generation (Stage 4): A compact Vision-Language Model (VLM), specifically Qwen2-VL-2B, receives only the verified violation list and the image. It is instructed to generate feedback based strictly on these verified constraints, preventing it from fabricating errors.

Baselines for Comparison:

End-to-End LMM (B1): LLaVA-1.5-7B prompted directly for error detection and feedback without intermediate symbolic structure.
Vision-Only (B2): The same Stage 1 CV detection as the main pipeline but with static template feedback (no VLM, no constraint checking).

3. Key Contributions

New Micro-Benchmarks: Introduction of FBD-10 (200 free-body diagram samples) and Circuit-10 (200 circuit schematic samples). These include controlled error taxonomies, pixel-level bounding boxes, and rubric keys.
Grammar-in-the-Loop Architecture: A novel pipeline that trades recall for precision by grounding VLM output in symbolic evidence.
Comprehensive Evaluation Suite: A multi-objective evaluation measuring detection F1, feedback quality (Correctness/Actionability), hallucination rates, calibration (ECE), and latency, all with 95% bootstrap confidence intervals.
Diagnostic Analysis: An honest analysis showing that no single architecture dominates all domains, revealing complementary strengths between symbolic and end-to-end approaches.

4. Experimental Results

The evaluation was conducted on 40 test samples per benchmark (80 total).

A. Detection Performance (Micro-F1)

Free-Body Diagrams (FBD-10): The End-to-End LMM (LLaVA) significantly outperformed the grammar pipeline.
- LLaVA: 0.471 (Precision: 0.571, Recall: 0.400)
- Grammar Pipeline: 0.263 (Precision: 0.385, Recall: 0.200)
- Reasoning: FBDs rely on holistic spatial relationships between forces and bodies, which LLMs handle better than rule-based primitive detection.
Circuit Schematics (Circuit-10): The Grammar Pipeline vastly outperformed the End-to-End LMM.
- Grammar Pipeline: 0.329 (Precision: 0.522)
- LLaVA: 0.038 (Precision: 0.333)
- Reasoning: Circuit errors are discrete symbolic predicates (e.g., missing ground, wrong polarity) that are amenable to rule-based checking, whereas LLaVA failed to understand schematic topology.

B. Feedback Quality & Hallucination

Actionability: The Grammar Pipeline achieved perfect actionability (5.0/5) on circuits because its feedback was template-based and directly linked to detected violations.
Hallucination Rates:
- On Circuits, the Grammar Pipeline had a high hallucination rate (0.925). However, root-cause analysis revealed this was not due to VLM confabulation. Instead, the classical CV perception module produced false positives, which the constraint checker passed to the VLM. The VLM faithfully reported these errors.
- On FBDs, both models had similar hallucination rates (~0.375), but the mechanisms differed (CV false positives vs. plausible-sounding but incorrect descriptions by LLaVA).

C. Error Type Analysis

Complementarity: The Grammar Pipeline excelled at detecting "wrong direction" errors in FBDs and "missing ground" in circuits. The End-to-End LMM excelled at detecting "missing forces" in FBDs. Neither model could detect "missing components" or "wrong polarity," indicating a shared perception bottleneck.

5. Significance and Conclusion

Modularity and Debuggability: The primary value of Sketch2Feedback is its modularity. Unlike end-to-end black boxes, failures can be precisely attributed to specific stages (e.g., identifying that circuit hallucinations stem from Stage 1 perception, not the VLM). This allows for targeted improvements, such as swapping the classical CV module for a learned detector.
Domain Dependence: There is no "one-size-fits-all" solution. Holistic visual understanding (End-to-End) is superior for spatial/continuous tasks (FBDs), while symbolic reasoning (Grammar-in-the-Loop) is superior for discrete, rule-based tasks (Circuits).
Future Directions: The paper suggests that perception is the current bottleneck. Future work should focus on training learned detectors specifically for diagram components and exploring ensemble approaches that combine the strengths of both architectures.

Limitations: The study relies on synthetic data (which may not capture full student variability), has small test sets (n=40), and has not yet been validated in a real classroom setting.

Sketch2Feedback: Grammar-in-the-Loop Framework for Rubric-Aligned Feedback on Student STEM Diagrams

The Problem: The "Overconfident AI"

The Solution: The "Grammar-in-the-Loop" Factory

The Results: A Tale of Two Subjects

The Big Win: Knowing Why You Failed

The Bottom Line

1. Problem Statement

2. Methodology: Sketch2Feedback

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Compositional Neuro-Symbolic Reasoning

Understanding the Nature of Generative AI as Threshold Logic in High-Dimensional Space

AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems