Factuality Matters: When Image Generation and Editing Meet Structured Visuals

Imagine you have a super-talented artist who can paint beautiful sunsets, portraits, and landscapes that look so real you could almost touch them. This artist is like today's top AI image generators. They are amazing at making things look pretty.

But, ask this artist to draw a precise math graph, a complex circuit diagram, or a chart showing exact sales numbers, and they start to stumble. They might draw a bar chart where the bars are the wrong height, or a pie chart where the slices don't add up to 100%. They get the "vibe" right, but the facts are wrong.

This paper, titled "Factuality Matters," is like a report card for AI, pointing out that while our AI artists are great at making pretty pictures, they are terrible at making useful, factual diagrams. The authors decided to fix this by building a new school, a new textbook, and a new test for AI.

Here is the story of their solution, broken down into three simple parts:

1. The Problem: The "Pretty but Wrong" Artist

Current AI models are like students who are great at memorizing the look of a thing but bad at understanding the rules behind it.

Natural Images: If you ask for "a cat," the AI knows what a cat looks like.
Structured Images: If you ask for "a bar chart showing 2024 sales," the AI tries to guess what a chart looks like, often messing up the numbers, the labels, or the layout. It's like a chef who can make a cake look beautiful but forgets to put the sugar in.

2. The Solution: Building a "Code-Based" Training Camp

The authors realized that structured images (like charts and graphs) are actually just code in disguise. A chart isn't just a picture; it's a set of instructions (like a recipe) that tells a computer how to draw lines and colors.

To teach the AI properly, they didn't just show it pictures; they built a massive library of 1.3 million "Code-to-Image" pairs.

The Analogy: Imagine teaching a child to draw a map. Instead of just showing them a finished map, you give them the GPS coordinates and the rules for drawing roads.
The Process: They took millions of computer programs (code) that draw charts, ran them to make the images, and then asked a super-smart AI (GPT-5) to create "editing instructions."
- Example: "Change the red bar to blue" (Image instruction) matches perfectly with "Change the color code from #FF0000 to #0000FF" (Code instruction).
The Secret Sauce (Chain-of-Thought): They didn't just give the AI the answer. They forced it to write down its thinking process (like a student showing their work on a math test). This helps the AI understand why a chart looks the way it does, not just what it looks like.

3. The New Test: The "Fact-Checker" Exam

How do you grade an AI on a chart? You can't just ask, "Does this look nice?" You need to check if the data is true.

The Old Way: Ask an AI, "Is this a good chart?" (This is unreliable; the AI might just say "Yes" to be polite).
The New Way (StructBench & StructScore): The authors created a rigorous exam called StructBench.
- Instead of a simple "Pass/Fail," the system breaks the chart down into hundreds of tiny questions: "What is the number on the top bar?" "Is the title in blue?" "Does the axis start at zero?"
- It treats the AI like a student taking a multiple-choice test, checking every single fact. If the AI gets the numbers wrong, it fails, even if the picture looks pretty.

The Result: A Smarter, More Honest AI

The authors trained a new model using this special "Code + Thinking" method.

The Outcome: Their model became the best at editing and generating these tricky structured images, beating even the most expensive, closed-source systems (like the ones from Google or OpenAI).
The "Thinking" Trick: They found that if you let the AI "think out loud" (use a reasoning step) before it draws the picture, it gets much better at following instructions. It's like telling a painter, "First, plan the layout on paper, then start painting."

Why This Matters

This paper is a wake-up call. It says: "Stop just making pretty pictures. Start making pictures that are actually true."

By releasing their data, their model, and their test, they are handing the keys to the whole AI community. They are saying, "Here is the textbook, here is the exam, and here is the smartest student. Now, let's build AI that can help us with science, math, and business, not just make cool wallpapers."

In a nutshell: They taught AI to stop guessing and start calculating, turning it from a "pretty painter" into a "reliable engineer."

1. Problem Statement

While modern visual generation models (e.g., diffusion transformers) excel at creating aesthetically pleasing natural images, they struggle significantly with structured visuals such as charts, mathematical figures, diagrams, tables, and puzzles.

The Gap: Current models fail to maintain factual fidelity. Generating or editing these images requires not just aesthetic appeal but precise composition planning, text rendering (numbers, labels), and multimodal reasoning to understand quantitative relationships.
Limitations of Existing Work:
- Existing benchmarks focus on natural images, prioritizing aesthetics and coarse-grained instruction following.
- Existing datasets lack strict alignment between instructions and visual output, often resulting in hallucinations (e.g., wrong axis labels, incorrect data trends).
- Evaluation metrics (like CLIP scores or PSNR) are ill-suited for structured visuals, as they measure pixel similarity or semantic broadness rather than factual accuracy.

2. Methodology

The authors propose a comprehensive framework comprising a large-scale dataset, a unified model architecture, and a rigorous evaluation benchmark.

A. Data Construction: Structured Image Dataset

The authors constructed a dataset of 1.3 million high-quality image pairs derived from executable drawing programs.

Code-Aligned Synthesis: Instead of collecting user prompts, they collected ~2M source code snippets (Python, LaTeX) for rendering structured images.
Editing Pipeline:
1. Salient Feature Extraction: GPT-5 analyzes the source image to identify a core domain-specific feature.
2. Dual Instruction Generation: GPT-5 generates a matched pair of instructions: a Code Editing Instruction (precise programmatic change) and an Image Editing Instruction (visual description).
3. Execution: The code is modified and re-rendered to produce the target image, ensuring a strictly verifiable state transition.
Chain-of-Thought (CoT) Annotations: Each sample is augmented with GPT-5 generated reasoning trajectories (input analysis, editing logic, target prediction) to support complex reasoning during training.
Filtering: Rule-based filters remove invalid renders, null edits, and low-information images.

B. Model Architecture & Training

The authors trained a unified model based on FLUX.1 Kontext (a diffusion transformer) enhanced with a Vision-Language Model (VLM).

Architecture:
- Backbone: FLUX.1 Kontext.
- Connector: A lightweight MLP connector aligns features from Qwen-VL (encoding input images and text) with the diffusion backbone. This is chosen over heavy transformer projectors to reduce overhead and stabilize optimization.
- Reasoner: An external reasoner (GPT-5) is used at inference time to decompose complex tasks.
Three-Stage Training Curriculum:
1. Unified Alignment: Aligns Qwen-VL features with the diffusion backbone using simple T2I/editing data (backbone frozen, connector trained).
2. Hybrid Visual Learning: Jointly fine-tunes the backbone and connector on the structured dataset mixed with natural image data. A mask-based loss strategy downweights background/unchanged regions to focus on structural changes.
3. Thinking Enhancement: Injects explicit CoT reasoning. The model learns to follow long-context reasoning instructions. At inference, an external reasoner analyzes the input and predicts the ideal target content, which guides the generator.

C. Evaluation: StructBench & StructScore

StructBench: A benchmark with 1,714 challenging instances across six domains: Math, Graph, Chart, Puzzle, Science, and Table.
StructScore Metric: A novel evaluation protocol designed to reduce hallucinations.
- Process: Instead of binary Yes/No judging, a VLM generates atomic Question-Answer (Q&A) pairs based on the ground-truth image description.
- Scoring: The model generates an image; the VLM answers the Q&A pairs based on the generated image. The answers are compared to ground truth.
- Weighting: For editing tasks, the score is weighted heavily toward instruction following (90%) over visual consistency (10%) to prioritize factual changes over mere image preservation.

3. Key Contributions

First Systematic Investigation: The first comprehensive study dedicated to the generation and editing of structured visuals, addressing the specific challenges of factual accuracy.
Large-Scale Code-Aligned Dataset: A 1.3M dataset with strict code-image alignment and CoT reasoning annotations, enabling precise supervision for structured tasks.
Unified Model with Reasoning: A strong open-source model that integrates VLMs with diffusion backbones and leverages inference-time reasoning to achieve state-of-the-art performance.
StructBench & StructScore: A rigorous benchmark and metric that moves beyond pixel-level similarity to evaluate fine-grained factual correctness, showing high correlation with human preferences ( $r > 0.9$ ).

4. Results

The authors evaluated 15 models (including closed-source leaders like GPT-Image, Nano Banana, and open-source baselines).

Performance Gap: Even the best closed-source systems achieve only ~50-60% accuracy on structured tasks, indicating a significant gap in current capabilities.
Model Performance:
- The authors' model achieved the highest overall accuracy (55.98%) on the StructEditBench, outperforming all baselines.
- It significantly outperformed the base FLUX.1 Kontext, highlighting the importance of the curated dataset.
Impact of Reasoning:
- Adding explicit reasoning trajectories (CoT) at inference time yielded consistent gains across diverse architectures.
- Models with "thinking" capabilities (e.g., Bagel-Think) or external reasoners significantly outperformed non-reasoning baselines, especially on complex tasks like chart-type conversion.
Data vs. Architecture: The results suggest that data quality and scale are the dominant drivers of performance in this domain, rather than specific architectural choices (e.g., VAE vs. SigLIP encoders).

5. Significance

Advancing Multimodal Foundations: This work bridges the gap between visual understanding (which is strong in structured domains) and visual generation (which is weak), moving toward truly unified foundation models.
Reliability for Scientific/Technical Use: By prioritizing factual fidelity over aesthetics, this research paves the way for AI tools that can reliably generate scientific diagrams, mathematical proofs, and data visualizations.
New Evaluation Paradigm: The introduction of StructScore and the atomic Q&A protocol provides a new standard for evaluating AI in domains where "looking good" is insufficient; "being right" is the primary metric.
Open Science: The release of the dataset, model, and benchmark encourages further research into structured visual generation, a previously underexplored but critical area.