RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation

🎨 The Big Idea: From "Drawing a Stick Figure" to "Architecting a Skyscraper"

Imagine you have a robot assistant that is really good at drawing. If you show it a picture of a simple stick figure, it can instantly write the code to draw that stick figure perfectly. This is what current AI models (Vision-Language Models) are great at today.

But what happens if you show the robot a picture of a complex, multi-story skyscraper with intricate blueprints, hundreds of windows, and specific data about who lives on each floor?

The Problem: The paper argues that while these AI robots are excellent at drawing stick figures (simple charts), they completely fall apart when asked to build skyscrapers (complex, real-world data visualizations). They get lost in the details, mix up the floors, or forget to add the windows.

🛠️ What Did the Researchers Do?

The team created a new, super-tough test called RealChart2Code. Think of it as a "Final Exam" for AI artists, but instead of asking them to draw a smiley face, they have to recreate a complex, multi-panel dashboard using real, messy data.

Here are the three main challenges they gave the AI:

The Copycat (Chart Replication):
- The Task: "Here is a picture of a complex chart. Write the code to make it look exactly like this."
- The Catch: The AI has to guess the data and the logic just by looking at the pixels. It's like trying to reverse-engineer a cake recipe just by looking at a photo of the finished cake.
The Chef (Chart Reproduction):
- The Task: "Here is a picture of a chart AND the raw ingredients (the actual data files). Write the code to cook this exact dish."
- The Catch: This is harder because the data is huge and messy (like a warehouse full of ingredients). The AI has to know how to chop, mix, and arrange thousands of data points to match the picture.
The Editor (Chart Refinement):
- The Task: "Here is a chart with some mistakes (e.g., the colors are wrong, or the title is missing). Fix it based on my instructions."
- The Catch: This simulates a real conversation. You tell the AI, "Make the bars blue," and it fixes that but accidentally breaks the legend. Then you say, "Fix the legend," and it breaks the colors again. The AI struggles to keep the whole picture in mind while making small changes.

📉 What Did They Find? (The "Reality Check")

The researchers tested 14 of the smartest AI models available (including big names like Claude, GPT, and Gemini). Here is what happened:

The "Easy Mode" Trap: On simple tests (like drawing a single bar chart), these AIs scored near 100%. They looked like geniuses.
The "Hard Mode" Crash: When they took the RealChart2Code test, their scores dropped by half.
- Analogy: It's like a student who gets an A+ on a spelling test but fails a math exam. The AI can memorize patterns, but it can't reason through complex structures.
The "Rich vs. Open" Gap: The expensive, proprietary models (like Claude-Opus) did the best, but they still struggled. The free, open-source models did significantly worse, often failing to even get the code to run without errors.

🧩 Why Is This Hard? (The "Jenga Tower" Problem)

The paper explains that these AIs fail for two main reasons:

They Lose the Big Picture: When building a complex chart with 10 different sub-charts, the AI focuses on one small part (like a single bar) and forgets how it fits into the whole grid. It's like trying to build a Jenga tower by only looking at one block at a time; eventually, the whole thing collapses.
They Hallucinate: The AI often invents code that looks real but doesn't exist (like using a library function that isn't real). It's like a chef saying, "I'll add some magic dust," when the recipe actually calls for salt.

🚀 Why Does This Matter?

This paper is a wake-up call. It tells us that while AI is amazing at simple tasks, it is not yet ready to replace human data scientists for complex work.

For the Future: We can't just train AI on more "easy" examples. We need to teach them how to handle messy, real-world data and how to plan complex layouts.
The Takeaway: If you ask an AI to draw a simple graph today, it will do a great job. But if you ask it to build a complex business dashboard from raw data, you still need a human expert to double-check the work.

In short: RealChart2Code is the benchmark that finally stopped the AI from bragging about its drawing skills and forced it to admit it still needs to learn how to be an architect.

1. Problem Statement

While Vision-Language Models (VLMs) have demonstrated strong capabilities in generating code for simple, single-panel charts, their ability to handle complex, multi-panel visualizations derived from authentic, large-scale real-world data remains largely unassessed.

Limitations of Existing Benchmarks: Current benchmarks (e.g., Plot2Code, ChartMimic) rely heavily on synthetic data or pre-existing chart-code pairs from the internet. They often feature simple, single-panel layouts and lack the complexity of real-world data science workflows.
The Gap: There is a critical need for a benchmark that evaluates:
1. Generation from raw, large-scale datasets (not just images).
2. Handling of intricate composite layouts (multi-subplots, nested grids).
3. Iterative refinement capabilities in a multi-turn conversational setting (debugging and modifying code based on user feedback).

2. Methodology: RealChart2Code Benchmark

The authors introduce RealChart2Code, a large-scale benchmark comprising 2,896 instances grounded in authentic datasets. The construction follows a rigorous four-stage pipeline:

A. Data Curation

Source: Datasets were collected and filtered from Kaggle, starting with over 8,000 datasets (30 billion rows) and narrowing down to 1,036 high-quality datasets (approx. 860 million rows).
Diversity: Covers 8 high-level domains (Finance, Health, Tech, etc.) and 35 sub-topics.
Chart Types: Includes 50 distinct chart types and 7 high-level visualization intents (Correlation, Deviation, Ranking, Distribution, Composition, Change, Groups). All visualizations are designed to be complex, featuring composite charts or multi-panel layouts.

B. Task Definition

The benchmark evaluates models on three distinct tasks (illustrated in Figure 2 of the paper):

Chart Replication (Image $\to$ Code): The model receives only a chart image and must reverse-engineer the visualization code.
Chart Reproduction (Image + Data $\to$ Code): The model receives the chart image, raw data files (CSV), and metadata. It must generate code that correctly processes the actual data to reproduce the chart.
Chart Refinement (Multi-turn Dialogue): The model is presented with a flawed chart/code pair and a natural language instruction to fix specific errors (e.g., "add a violin plot," "fix axis labels"). This tests iterative debugging and context retention.

C. Ground Truth & Quality Control

Human-Authored Code: All ground-truth code was manually implemented by a team of 5 expert Python developers using Matplotlib and Seaborn. No model-generated code was used as ground truth to avoid bias.
Error Injection: For refinement tasks, errors were manually injected into the ground-truth code (e.g., wrong chart types, data mapping errors, overlapping elements) to create realistic debugging scenarios.

D. Evaluation Metrics

The evaluation framework uses a multi-agent judging panel to score outputs on a 3-point scale (0, 1, 2) across eight key criteria:

Chart Type Consistency
Spatial Layout Consistency (Grid structure, positioning)
Text Element Consistency (Titles, labels)
Axis Configuration (Scales, ranges, ticks)
Color Scheme Consistency
Style and Format (Fonts, markers, grids)
Component Completeness
Data Pattern Consistency (Visualized trends)

Special Metric for Reproduction: Data Alignment (Code-level verification) ensures the generated code processes the raw data identically to the reference, rather than just looking similar visually.
Execution Pass Rate: Code must execute successfully in a sandboxed Docker environment (Python 3.13, 120s timeout) to be scored.

3. Key Contributions

First Real-Data Benchmark: The first benchmark to systematically evaluate chart-to-code generation using authentic, large-scale raw data rather than synthetic or simplified data.
Multi-Task Framework: Introduces a comprehensive evaluation covering Replication, Reproduction (data-driven), and Refinement (interactive debugging), simulating real-world developer workflows.
Rigorous Evaluation Protocol: Establishes a multi-agent automated evaluation system with high agreement with human experts (Cohen's $\kappa \approx 0.83$ ) and a strict sandboxed execution environment.
Comprehensive Analysis: Provides the first large-scale analysis of 14 leading VLMs (5 proprietary, 9 open-weight) on complex visualization tasks.

4. Experimental Results

The authors evaluated 14 models, including Claude-4.5-Opus, GPT-5.1, Gemini-3-Pro, Qwen3-VL, and Intern-VL.

Key Findings:

Performance Collapse: Models that excel on existing benchmarks (e.g., ChartMimic, Plot2Code) suffer a drastic performance drop on RealChart2Code.
- Example: Qwen3-VL-235B scored ~85% on existing benchmarks but dropped to 3.6/10 on RealChart2Code.
- Example: Claude-4.5-Opus (the best performer) scored 8.2/10, while the best open-weight model scored only 3.6/10.
Proprietary vs. Open-Weight: A significant capability gap exists. Proprietary models generally handle complex layouts and data logic better, while open-weight models struggle significantly with syntax and spatial reasoning.
Task Difficulty:
- Chart Reproduction is the hardest task, as it requires understanding raw data structures.
- Chart Refinement reveals a weakness in iterative context; models often fix the requested error but introduce new errors in previously correct parts ("Regressive Editing").
Error Analysis:
- Open-weight models frequently fail due to Syntax/Execution Errors (hallucinating libraries) and Layout Failures (overlapping subplots).
- Proprietary models rarely fail on syntax but struggle with Data Mapping Errors (mapping data to wrong axes) and failing to maintain global consistency during refinement.

5. Significance and Future Directions

Revealing the "Complexity Gap": The study proves that current SOTA models are proficient at simple plotting but lack the robust reasoning required for complex, real-world data visualization. The "saturation" seen in previous benchmarks is an artifact of task simplicity.
Guidance for Research: The results suggest that future VLMs need improvements in:
- Spatial Reasoning: Better handling of nested layouts and grid systems.
- Data Logic: Stronger integration of data processing logic with visual generation.
- Context Retention: Improved ability to modify code iteratively without breaking existing functionality.
Resource Release: The authors release the RealChart2Code dataset, code, and evaluation framework to the community to drive further research in multimodal code generation.

In conclusion, RealChart2Code serves as a critical stress test for the field, demonstrating that while VLMs have made progress, they are far from mastering the full spectrum of professional data visualization tasks involving real-world data complexity.