TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning

Imagine you are a scientist trying to explain a complex idea, like how a virus spreads or how a new engine works. You have the words, but you need a perfect diagram to go with them. In the academic world, the "gold standard" for drawing these diagrams is a programming language called TikZ. It's like the "LaTeX" of drawing: incredibly precise, but notoriously difficult to learn. It's like trying to paint a masterpiece by only speaking in code.

Previously, if you asked a smart AI (a Large Language Model) to draw these diagrams for you, it usually failed. It would either write code that didn't work, draw the wrong shapes, or get lost in a loop of nonsense.

Enter TikZilla, a new project by researchers at the University of Technology Nuremberg. They built a system that can turn your text descriptions into perfect scientific diagrams, and they did it by solving three major problems.

Here is how they did it, explained with some everyday analogies:

1. The Problem: The "Bad Teacher" and the "Tiny Library"

Imagine trying to learn a new language (TikZ) by reading a library that only has 50 books, and half of them are written in gibberish. That was the state of AI training data for scientific diagrams.

The Old Data: Previous datasets were small and full of errors. The "captions" (descriptions) were often vague, like saying "a picture of a machine" instead of "a red gear turning a blue wheel."
The Result: The AI was like a student trying to learn from a bad teacher. It would guess, hallucinate, and produce broken code.

2. The Solution: Building a "Super-Library" (DaTikZ-V4)

The researchers decided to build a massive, high-quality library.

Scavenger Hunt: They went out and collected over 2 million examples of TikZ code from scientific papers (arXiv), coding projects (GitHub), and forums.
The "Fix-It" Crew: A lot of this code was broken (it wouldn't compile). Instead of throwing it away, they used a powerful AI to act as a mechanic. They fed the broken code and the error messages to the AI, which "repaired" the code so it worked.
The "Translator" Crew: They realized the original descriptions were too simple. So, they used advanced Vision AI (VLMs) to look at the diagrams and write rich, detailed descriptions. Instead of "a graph," the AI now wrote, "A blue line starts at zero, curves up to the right, and peaks at 50."
The Result: They created DaTikZ-V4, a dataset four times larger than anything before, filled with clean code and perfect descriptions.

3. The Training: From "Textbook Learning" to "Real-World Practice"

They trained their new models, called TikZilla (based on the Qwen family of AI), in two distinct stages:

Stage 1: The Classroom (Supervised Fine-Tuning)
They taught the AI the rules of the language. This is like a student memorizing grammar and vocabulary from a textbook. The AI learned how to write TikZ code that actually compiles.
- Analogy: Learning the rules of chess so you don't move the rook diagonally.
Stage 2: The Coach (Reinforcement Learning)
This is the secret sauce. Just knowing the rules isn't enough; you need to know if your move is good.
- They built a special Reward Model (a "Coach"). This coach doesn't just read the code; it looks at the picture the code produces.
- If the AI writes code that creates a picture looking like the original, the Coach gives a high score. If the picture is wrong (e.g., the arrow points the wrong way), the Coach gives a low score.
- The AI tries again and again, learning from the Coach's feedback until it gets it right.
- Analogy: A student learning to draw. First, they learn how to hold a pencil (Stage 1). Then, they draw a picture, and a teacher compares it to the original photo and says, "Your nose is too big, try again." The student keeps trying until the drawing matches the photo perfectly.

4. The Results: The Little Engine That Could

The results are impressive.

Small but Mighty: TikZilla comes in small sizes (3 billion and 8 billion parameters). For context, the "giants" like GPT-4o or GPT-5 are much larger. Yet, TikZilla outperformed GPT-4o and matched the performance of the massive GPT-5.
Reliability: While other AIs often produce code that crashes (doesn't compile), TikZilla produces working code 95-98% of the time.
Human Approval: When human experts rated the diagrams, TikZilla scored higher than the big commercial models, producing images that were ready for publication.

Why This Matters

This isn't just about drawing pretty pictures. It's about democratizing science.

For Scientists: You can now describe your experiment in plain English, and a small, open-source AI will generate the professional-grade diagram for your paper. No more spending hours learning complex coding languages.
For the Future: It proves that you don't need a massive, expensive "super-computer" AI to do complex tasks. With the right data and the right training method (the "Coach"), small, open-source models can beat the giants.

In short, TikZilla is like giving every scientist a personal, expert illustrator who speaks their language, learned from millions of examples, and is constantly coached to get better.

1. Problem Statement

The paper addresses the challenge of Text-to-TikZ generation, where Large Language Models (LLMs) must convert natural language descriptions into TikZ code (a vector graphics language used in LaTeX for scientific figures). While TikZ is the de facto standard for scientific publishing due to its precision and integration with LaTeX, current LLMs struggle with this task due to:

Data Scarcity and Noise: Existing datasets (e.g., DaTikZ-V3) are small and contain noisy, unstructured captions that fail to capture the complex spatial and semantic details required for accurate figure reconstruction.
Limitations of Supervised Fine-Tuning (SFT): Prior approaches rely solely on SFT, which aligns the model with syntax but fails to expose it to the rendered semantics of the figure. This leads to common failure modes such as infinite loops, hallucinated content, incorrect spatial relationships, and low compilation rates.
Inadequate Reward Signals: General-purpose image-text metrics (e.g., CLIPScore) do not correlate well with human judgments of scientific figure quality and are prone to "reward hacking."

2. Methodology

The authors propose a comprehensive pipeline involving dataset construction, a two-stage training framework, and a domain-specific reward model.

A. Dataset Construction: DaTikZ-V4

To overcome data limitations, the authors constructed DaTikZ-V4, a dataset over 2 million instances (4x larger than its predecessor), sourced from:

Sources: arXiv (post-2021), GitHub (5,500 repositories), TeX StackExchange, and synthetic data.
Quality Enhancements:
- LLM Debugging Pipeline: An iterative process using LLMs (specifically Qwen-32B) to fix uncompilable TikZ code based on compiler error logs, increasing the usable corpus significantly.
- VLM-Based Descriptions: Instead of relying on raw captions, Vision Language Models (VLMs) like Qwen2.5-VL-7B were used to generate precise, semantically rich textual descriptions of the figures. These descriptions explicitly detail objects, attributes, and spatial relations.
- Filtering: Rule-based filtering for dynamic package inclusion, removal of external file dependencies, and deduplication.

B. Model Architecture: TikZilla

The authors introduce TikZilla, a family of small, open-source models based on Qwen (3B and 8B parameters). The training follows a two-stage paradigm:

Stage 1: Supervised Fine-Tuning (SFT): The model is fine-tuned on DaTikZ-V4 to learn TikZ syntax and task-specific token distributions.
Stage 2: Reinforcement Learning (RL): The SFT model is further optimized using Group Relative Policy Optimization (GRPO).
- Reward Model: A domain-specific reward model is trained via inverse graphics (Image $\to$ TikZ). An image encoder (derived from DeTikZify-V2) is retrained on the large DaTikZ-V4 corpus to generate high-quality embeddings of scientific figures.
- Reward Calculation: The reward is computed using Earth Mover's Distance (EMD) between the patch-level embeddings of the ground-truth image and the rendered output of the generated TikZ code. This measures semantic alignment more effectively than global similarity metrics.
- Format Reward: A binary reward ensures the output starts and ends with valid LaTeX document environments.

3. Key Contributions

DaTikZ-V4: A massive, high-quality dataset (2M+ samples) featuring VLM-generated descriptions and LLM-debugged code, quadrupling the scale of previous efforts.
Caption Quality Analysis: Empirical evidence showing that raw captions are insufficient for figure reconstruction, motivating the shift to VLM-generated descriptions.
Domain-Specific Reward Model: The first reward model for Text-to-TikZ trained via inverse graphics, which correlates strongly with human judgment and avoids the pitfalls of general-purpose metrics.
TikZilla Models: Release of efficient 3B and 8B parameter models that outperform much larger proprietary models.
Two-Stage Training Framework: Demonstrating that combining SFT for syntax grounding with RL for semantic alignment significantly improves generation quality.

4. Results

The models were evaluated on a contamination-free test set (1,047 samples) and compared against GPT-4o, GPT-5, and other open-source baselines.

Automatic Metrics:
- TikZilla-3B-RL achieved the best aggregate score (0.385), surpassing GPT-5 (0.365) and GPT-4o (0.320).
- Compilation Rate (CR): RL-tuned models achieved 95–98% compilation rates, significantly higher than baselines (e.g., GPT-5 at 88%, GPT-4o at 78%).
- Efficiency: TikZilla models generated significantly shorter code (fewer tokens) compared to baselines while maintaining higher quality.
Human Evaluation:
- Evaluated by 9 experts on a 1–5 Likert scale.
- TikZilla-3B-RL and TikZilla-8B-RL outperformed GPT-4o by 0.5 points and matched GPT-5 in image-based alignment (approx. 3.46 vs. 3.48), despite being orders of magnitude smaller.
- RL training provided a substantial boost (+0.6 to +0.75 points) over SFT-only baselines.
Out-of-Distribution (OOD) Performance:
- On the SPIQA benchmark (figures from non-TikZ sources like Matplotlib), TikZilla-3B-RL and TikZilla-8B-RL outperformed GPT-5, demonstrating robustness to structural complexity and distribution shifts.

5. Significance

Democratization of Scientific Tools: The work proves that small, open-source models (3B–8B parameters) can outperform massive proprietary models (GPT-5) in specialized scientific tasks when trained with high-quality data and appropriate RL strategies.
Methodological Advancement: It establishes a new standard for Text-to-Code generation in scientific domains, highlighting the necessity of rendered semantic feedback (via inverse graphics) over simple text-text or text-image alignment.
Reproducibility: By releasing the dataset, code, and models, the authors enable the scientific community to build upon these results, reducing reliance on costly proprietary APIs for scientific figure generation.
Future Directions: The paper suggests that similar approaches (large-scale data + domain-specific RL rewards) could be applied to other structured generation tasks like LaTeX tables, CAD designs, and flowcharts.

TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning

1. The Problem: The "Bad Teacher" and the "Tiny Library"

2. The Solution: Building a "Super-Library" (DaTikZ-V4)

3. The Training: From "Textbook Learning" to "Real-World Practice"

4. The Results: The Little Engine That Could

Why This Matters

1. Problem Statement

2. Methodology

A. Dataset Construction: DaTikZ-V4

B. Model Architecture: TikZilla

3. Key Contributions

4. Results

5. Significance

More like this

When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews

Demystifying When Pruning Works via Representation Hierarchies

Fine-Tuning A Large Language Model for Systematic Review Screening

Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset

Enhancing Structured Meaning Representations with Aspect Classification