Do Large Language Models Understand Data Visualization Principles?

Imagine you are a chef trying to teach a very smart, but inexperienced, sous-chef how to cook a perfect meal. You have a strict recipe book (the Data Visualization Principles) that says things like, "Never use red for a cold dish," or "Always list ingredients from smallest to largest."

For years, we've had a Robot Chef (the old Symbolic Systems) that checks the recipe. It's incredibly precise because it follows a rigid, mathematical rulebook. But here's the catch: to teach the robot, a human expert has to write out every single rule in a complex computer language. If you want to add a new rule, you have to hire a programmer to rewrite the robot's brain. It's accurate, but it's slow and inflexible.

Now, enter the New Sous-Chef: a Large Language Model (LLM). This is an AI that has read millions of cookbooks and knows what "good food" feels like. It doesn't need a rigid rulebook; it just needs you to tell it, "Hey, make sure this dish follows the rules."

This paper is the ultimate taste test. The researchers wanted to see: Can this new AI Sous-Chef actually understand the rules of good cooking, or is it just guessing?

The Experiment: The "Chart" Kitchen

The researchers set up two kitchens to test the AI:

The Synthetic Kitchen (The Practice Range): They generated 2,000 fake charts (recipes) using a computer. They deliberately messed them up in specific ways (e.g., using the wrong color for a category, or cutting off the bottom of a graph). They knew exactly which rules were broken because a computer (the Robot Chef) had written the recipe.
The Real Kitchen (The Restaurant): They took 300 real charts that humans had actually made and published online. They checked these against the rules too.

They asked the AI two main questions:

The Inspector: "Look at this recipe. Did I break any rules?"
The Fixer: "You broke a rule. Now, rewrite the recipe to fix it without breaking anything else."

The Results: The Good, The Bad, and The Weird

Here is what they found, translated into everyday terms:

1. The AI is a decent Inspector, but not perfect.
When the AI looked at the fake charts, the best models (like Gemini-2.5-Flash) got about 68% of the rules right. That's like a student getting a B- on a test. They were good at spotting obvious mistakes (like a bar chart that looks like a line), but they struggled with subtle, tricky rules (like "don't use color to show order").

The Analogy: The AI can tell you the soup is too salty, but it might miss that the chef forgot to peel the carrots.

2. Showing the AI the Picture didn't help much.
The researchers gave some AI models both the text recipe and a picture of the finished dish. They hoped seeing the "ugly" chart would help the AI spot the error.

The Result: It helped a tiny bit, but not as much as they hoped. The AI was still mostly relying on the text recipe, not the visual "vibe" of the chart. It's like giving a blind taste-tester a photo of the food; they still have to guess based on the description.

3. The "Fixer" is surprisingly better than the "Inspector."
This was the most surprising twist! When the AI was asked to detect a mistake, it was okay. But when asked to fix the mistake, it got much better (up to 94% success rate).

The Analogy: Imagine a student who struggles to identify why a sentence is grammatically wrong. But if you say, "Rewrite this sentence to be correct," they suddenly write a perfect sentence. The AI is better at doing the right thing than explaining what was wrong.

4. Open Source vs. The Big Brands.
The "Big Brand" models (like GPT-4 and Gemini) were generally better than the "Open Source" models (the free, community-built ones). However, the best open-source model was catching up fast, proving that you don't always need the most expensive tool to get a good meal.

The Big Takeaway

The paper concludes that AI is a promising new tool for checking our data charts, but it's not ready to replace the expert human or the rigid robot just yet.

The Promise: AI can act as a flexible, conversational editor. You can say, "Make this chart follow the rules," and it will likely do a great job fixing it.
The Limit: It still gets confused by the subtle, nuanced rules of human perception. It's like a sous-chef who knows how to chop vegetables perfectly but doesn't quite understand why a certain garnish looks unappetizing.

In short: We are no longer just building robots that follow rules; we are teaching them to understand the spirit of the rules. They are getting there, but they still need a human chef to double-check the final dish.

1. Problem Statement

Data visualization relies on established principles derived from decades of research in design and human perception to ensure accurate communication. While constraint-based systems (e.g., using Answer Set Programming or ASP) can enforce these principles via formal logical rules, they require significant expert knowledge to maintain and lack flexibility. Conversely, while Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown promise in generating charts or flagging misleading figures, it remains unclear whether they possess the reasoning capabilities to directly interpret, detect, and enforce specific visualization principles within chart specifications.

The core research questions are:

Q1: Can LLMs assess chart adherence to visualization principles?
Q2: Do multimodal VLMs (image + text) outperform text-only models in this reasoning task?
Q3: Can LLMs actively enforce principles to fix flawed chart specifications?

2. Methodology

A. Benchmark Datasets

The authors constructed two complementary datasets to evaluate models against a "hard-verification" ground truth derived from ASP constraints:

Synthetic Dataset:
- Generation: Created by sampling 2,000 Vega-Lite specifications from 21 source data tables (finance, health, demographics).
- Process: Used a base Draco grammar specification, randomly assigned parameters, and applied a Kullback–Leibler (KL) divergence filter to ensure a balanced distribution of 57 distinct principle violations (e.g., color-for-order, axis truncation).
- Ground Truth: Each specification was automatically annotated with violations using a Clingo solver (ASP) to establish a definitive "truth" for evaluation.
Real Visualization Dataset:
- Source: 307 human-authored Vega-Lite specifications extracted from public GitHub repositories (originally 1,981, filtered for compatibility).
- Process: Translated into Draco grammar and checked against the same ASP constraint set.
- Characteristics: Captures authentic design practices and errors, though only 16 of the 57 principles were triggered in this subset.

B. Evaluation Tasks

The study evaluated models on two distinct tasks:

Checking (Detection): Models were prompted with natural-language descriptions of principles and the Vega-Lite specification (and optionally the rendered chart image). They had to identify which principles were violated.
Fixing (Correction): Models were prompted to generate a corrected Vega-Lite specification that resolved a specific violation while minimizing changes to the original.

C. Models Evaluated

Text-Only: Open-source (Deepseek, Llava, Llama, Gemma, GPT-OSS) and Closed-source (GPT-4o, Gemini-2.5-Flash).
Multimodal (VLM): Models capable of processing chart images (Llava, Gemma, GPT-4o, Gemini).

D. Metrics

Detection: Macro-averaged F1-scores across principle categories.
Correction:
- Compilability (CO): Percentage of generated specs that render successfully.
- Enforcement Rate (ER): Success rate in resolving the specific targeted violation.
- Compliance Ratio (CR): Ratio of total violations after vs. before correction ( $P_{fixed}/P_{original}$ ).

3. Key Results

Detection Performance (Q1 & Q2)

Text-Only vs. Ground Truth: Performance was modest. On the synthetic dataset, the best model (Gemini-2.5-Flash) achieved an F1 of 0.678, while open-source models like GPT-OSS-20B reached 0.580. Lower-tier models scored below 0.20.
Real vs. Synthetic: Performance improved on the real dataset (Gemini-2.5-Flash: 0.743), likely due to the narrower set of principles and potential pretraining exposure to similar patterns.
Multimodal Advantage (Q2): Adding chart images provided modest but consistent improvements. Gemini-2.5-Flash improved from 0.678 (text) to 0.716 (text+image) on synthetic data. However, the gains were limited, suggesting models struggle to fully exploit visual signals for abstract rule checking.
Pattern Recognition: Models clustered into two groups: top-tier (Gemini, GPT-OSS, GPT-4o) and lower-tier. Performance varied significantly by mark type (e.g., "arc" marks were hardest; "bar" and "line" were easier).

Correction Performance (Q3)

Asymmetry: A critical finding is that models are significantly better at fixing violations than detecting them.
Enforcement Rates: Gemini-2.5-Flash achieved a 94.3% enforcement rate (fixing the specific error), despite only ~68% detection accuracy. GPT-OSS-20B achieved 86.3%.
Quality: Both models maintained high compilability (>99%) and reduced the total number of violations in the chart (Compliance Ratio $\approx$ 0.72), indicating they generally improve overall chart quality even if they don't perfectly detect every initial error.

4. Key Contributions

First Systematic Evaluation: The first study to evaluate LLMs/VLMs against formally specified visualization principles encoded as ASP constraints, bridging the gap between symbolic solvers and language-based reasoning.
Novel Benchmark: Introduction of a controlled dataset of 2,000 synthetic Vega-Lite specifications with ground-truth annotations derived from logical solvers, alongside a curated real-world dataset.
Methodological Framework: Established a rigorous protocol for evaluating both the detection and correction capabilities of models, moving beyond simple question-answering benchmarks.
Empirical Insights: Revealed the "detection-fixing asymmetry," showing that while models struggle to reliably identify subtle principle violations, they are highly effective at generating corrected specifications when guided.

5. Significance and Limitations

Significance: The work demonstrates that LLMs can serve as flexible, flexible validators and editors for visualization design, potentially automating "linting" tasks without the need for manual rule specification. It highlights the potential for AI-assisted visualization authoring tools.
Limitations:
- Subtle Perception: Models still struggle with nuanced perceptual constraints (F1 < 0.10 in some categories), indicating a gap between symbolic logic and human visual perception.
- Generalization: Performance on synthetic data suggests models may rely on pattern recognition rather than deep reasoning about abstract design rules.
- Precision: While models fix errors well, they occasionally introduce new violations or fail to make selective edits without side effects.

Conclusion: Large (Vision-)Language Models show promise as interactive partners in visualization design, particularly for correcting specifications. However, they are not yet a complete replacement for symbolic solvers for rigorous, principle-based verification, especially regarding subtle perceptual nuances. Future work should focus on improving multimodal reasoning and developing evaluation metrics that capture partial correctness and reasoning quality.