Text to Automata Diagrams: Comparing TikZ Code Generation with Direct Image Synthesis

Imagine you are a teacher grading a stack of homework. The students have drawn complex maps of "robot brains" (called automata diagrams) on paper. These drawings are messy, scribbled in pencil, and look different for every student. You want to turn these messy sketches into clean, perfect digital diagrams so you can grade them automatically or show them on a screen.

This paper is about a team of researchers trying to build a robot translator to do this job. They wanted to see if AI could look at a messy student drawing, describe it in words, and then turn those words back into a perfect digital drawing.

Here is how they did it, explained with some everyday analogies:

1. The Goal: The "Rosetta Stone" for Robot Brains

The researchers wanted to create a pipeline (a step-by-step process) that works like this:

Input: A messy, hand-drawn sketch of a robot brain.
Step A: An AI "Eye" looks at the sketch and writes a description in plain English.
Step B: A human teacher reads that description and fixes any mistakes (like a proofreader).
Step C: A second AI "Hand" takes the description and writes code (called TikZ) that draws a perfect, clean version of the diagram.
Output: A digital diagram that looks just like what the student intended to draw, but without the messy pencil marks.

2. The Experiment: The "Blind Taste Test"

The researchers tested two different paths to see which one worked better:

Path A (The Raw AI): The AI looks at the drawing, writes a description, and immediately passes it to the "Hand" AI to draw the code.
Path B (The Human Editor): The AI looks at the drawing, writes a description, a human reads it and fixes the errors (like "Wait, that arrow is pointing the wrong way!"), and then passes the corrected description to the "Hand" AI.

They also tested two different ways of asking the AI to describe the drawing:

The "Blind" Prompt: Just showing the picture.
The "Context" Prompt: Showing the picture plus the original exam question (e.g., "Draw a machine that counts even numbers").

3. The Big Problems: Where the AI Got Lost

The researchers found that the AI "Eye" is good at seeing shapes, but it often misses the logic.

The Analogy: Imagine the AI is a tourist taking a photo of a city map. It sees the lines and the dots, but it doesn't understand that the red line is a highway and the blue line is a river. It might say, "There is a line here," when it should say, "There is a highway connecting these two cities."
The Result: When the AI wrote the description without human help, it often missed arrows, got the directions wrong, or forgot which "city" was the starting point.

4. The Two Ways to Draw: "Painting" vs. "Blueprints"

The researchers tried two methods to turn the text back into a picture:

Direct Image Synthesis: Asking the AI to just "paint" a picture based on the text. This is like asking an artist to draw a map from a story. It's fast, but the artist might get the details wrong.
TikZ Code Generation: Asking the AI to write code (a set of strict instructions) that a computer uses to draw the map. This is like giving a robot a blueprint. If the blueprint is right, the robot builds it perfectly.

The Surprise: The "Blueprint" method (TikZ code) worked much better than the "Painting" method. Even if the text description had small errors, the code-based approach was more consistent and accurate.

5. The Verdict: Humans Are Still the Boss

Here is what they discovered:

AI alone is messy: When the AI tried to describe the drawing on its own, it made too many mistakes to be useful for grading.
Human editing is magic: When a human took 5 minutes to fix the AI's description, the final result was almost perfect.
Context helps: If you tell the AI what the student was supposed to draw (the exam question), it makes fewer mistakes.
Code is king: Turning text into code (TikZ) is a more reliable way to recreate diagrams than asking the AI to just "draw" the image directly.

Why Does This Matter?

Think of this as building a super-automated teaching assistant.

For Teachers: Instead of squinting at 50 messy pencil drawings at 2 AM, the system could turn them into clean digital maps, highlight where the student made a mistake (like a missing arrow), and give a grade.
For Students: It could give instant feedback: "Hey, you drew the start circle here, but the rules say it should be there."

In short: The AI is a great assistant, but it's not ready to work alone yet. It needs a human to double-check its notes, and it works best when it's building a blueprint (code) rather than trying to paint a picture.

Here is a detailed technical summary of the paper "Text to Automata Diagrams: Comparing TikZ Code Generation with Direct Image Synthesis."

1. Problem Statement

Computer science education, particularly in formal languages and automata theory, relies heavily on hand-drawn diagrams (e.g., Finite Automata, Turing Machines) for exams and assignments. These student sketches vary significantly in layout, notation, and correctness.

The Challenge: Current Vision-Language Models (VLMs) struggle to accurately interpret these informal, hand-drawn diagrams and convert them into structured digital representations.
The Goal: To evaluate the reliability of a pipeline that reconstructs student-drawn diagrams into clean, digital formats (specifically LaTeX TikZ code) using a combination of VLMs and Large Language Models (LLMs). The study specifically investigates whether human revision of intermediate text descriptions is necessary to achieve high-fidelity reconstruction.

2. Methodology

The researchers proposed and evaluated a multi-stage reconstruction pipeline:

A. Dataset Preparation

Source: Approximately 190 scanned diagrams from undergraduate automata theory exams and assignments at Marshall University and West Virginia State University.
Content: Includes Deterministic Finite Automata (DFA), Nondeterministic Finite Automata (NFA) with $\epsilon$ -transitions, Pushdown Automata, graphs, trees, and Turing machines.
Characteristics: The dataset contains both correct and incorrect student responses with significant variation in layout and detail. All data was anonymized and IRB-approved.

B. The Reconstruction Pipeline

Image-to-Text (VLM): Scanned diagrams were fed into GPT-4o to generate textual descriptions. Three prompting strategies were tested:
- Diagram-only: Image input only.
- Question-conditioned: Image + original exam question (to reduce ambiguity).
- One-shot: Image + example diagram + example caption.
- Finding: Question-conditioned prompts yielded the most consistent structure.
Human Revision: A human annotator reviewed and edited the model-generated descriptions to correct structural inaccuracies, clarify ambiguities, and ensure alignment with visual content.
Text-to-Code (LLM): Both the raw (unedited) and edited descriptions were fed into an LLM to generate TikZ code (a standard LaTeX package for vector graphics).
Compilation: The generated TikZ code was compiled into final diagrams.

C. Evaluation Metrics

The study employed three comparison methods:

Semantic Similarity: Used BLEU, METEOR, and ROUGE-L to measure the lexical difference between raw and edited descriptions.
Direct Image Generation: Generated images directly from text descriptions (via VLM) and compared them to original sketches using a 5-point Likert scale (1 = Completely incorrect, 5 = Fully consistent).
TikZ Compilation: Compiled TikZ code from both raw and edited descriptions and compared the resulting diagrams to original sketches using the same 5-point scale.
Human Evaluation: Two independent evaluators scored the outputs. Disagreements >1 point were resolved via consensus.

3. Key Contributions

Empirical Analysis of VLM Accuracy: Demonstrated that while VLMs can capture basic layout, they frequently miss critical structural components (e.g., transitions, accepting states) when describing hand-drawn automata diagrams.
Human-in-the-Loop Validation: Proved that human revision of intermediate text descriptions is a critical step. Raw descriptions often contained errors that propagated to the final output, whereas edited descriptions significantly improved downstream accuracy.
Pipeline Comparison (Direct vs. Indirect): Conducted a comparative study between Direct Image Synthesis (Text $\to$ Image) and Code-Based Synthesis (Text $\to$ TikZ Code $\to$ Image).
Prompt Engineering Insights: Identified that providing exam context (question-conditioned prompts) and using one-shot examples with similar layouts significantly improves description accuracy.

4. Key Results

Semantic Analysis

Similarity Scores: The average similarity between raw and edited descriptions was moderate (BLEU: 0.57, METEOR: 0.68, ROUGE-L: 0.73).
Interpretation: The high ROUGE-L/METEOR scores indicate that edits were often structural rather than purely lexical, while the lower BLEU score reflects the reordering and phrasing changes necessary to fix structural logic.

Reconstruction Quality (Human Evaluation Scores)

The study compared the average scores (out of 5) of reconstructed diagrams against original hand-drawn sketches:

Comparison Type	Source Description	Average Score	Std. Dev.
Direct Image Gen	Raw (Unedited)	2.85	1.30
Direct Image Gen	Edited (Human-revised)	3.60	1.09
TikZ Compilation	Raw (Unedited)	2.95	1.14
TikZ Compilation	Edited (Human-revised)	4.65	0.48

Key Finding 1: Human-edited descriptions consistently outperformed raw descriptions in both direct image generation and TikZ compilation.
Key Finding 2: TikZ compilation significantly outperformed direct image synthesis. The best-performing method (TikZ from edited descriptions) achieved an average score of 4.65, compared to 3.6 for direct image generation from edited descriptions.
Compilation Rate: After cleaning extra characters and adding necessary packages, the compilation success rate was high (~90%+ for edited descriptions).

Common Errors

Raw Descriptions: Frequently missed transitions, misidentified accepting states, and misplaced labels/loops.
Direct Image Gen: Often produced "hallucinated" structures or failed to adhere to strict automata conventions.
TikZ Gen: Errors were primarily due to input text inaccuracies (e.g., missing transitions in the prompt), but the output format was structurally cleaner and more consistent with academic standards.

5. Significance and Future Work

Educational Impact: This pipeline offers a pathway for automated grading and feedback. By reconstructing a student's messy sketch into a clean, standard digital diagram, instructors can easily identify structural errors (missing transitions, wrong start states) without manual redrawing.
Accessibility: It aids in creating accessible instructional materials by converting informal student notes into standard, high-quality digital diagrams.
Limitations: The study used a limited dataset (~190 images) and a single VLM (GPT-4o).
Future Directions: The authors plan to curate larger datasets, explore other models, develop automated detection for structural inconsistencies in descriptions, and refine prompts for specific automata types.

Conclusion: The paper concludes that while AI models can process diagrams, a human-in-the-loop approach for text refinement is essential for high-fidelity reconstruction. Furthermore, generating TikZ code is a superior strategy to direct image synthesis for producing accurate, standard-compliant academic diagrams.