DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

Imagine you have a brilliant architect who can draw beautiful blueprints for a house (the visual design), but you need someone to actually build the house using specific bricks, mortar, and tools (the code).

For a long time, we've been testing AI architects by asking them to build simple wooden shacks (basic HTML/CSS). But in the real world, modern houses are built with complex, specialized systems like smart plumbing (React), automated lighting (Vue), or high-tech security (Angular).

DesignBench is a new, rigorous "driving test" for AI architects (Multimodal Large Language Models) to see if they can actually build these modern, complex houses, not just the wooden shacks.

Here is a simple breakdown of what the paper does, using some everyday analogies:

1. The Problem: The Old Tests Were Too Easy

Previous tests for AI code generators were like asking a student to build a birdhouse. It was too simple.

They ignored the tools: They didn't test if the AI could use modern "construction kits" like React, Vue, or Angular.
They only tested the first step: They only checked if the AI could build the house from scratch. They didn't ask, "Can you paint the kitchen blue?" (Editing) or "Can you fix the leaky roof?" (Repairing).
They didn't look closely: They just said, "Looks good!" without checking if the wiring was safe or if the bricks were laid correctly.

2. The Solution: DesignBench (The Ultimate Construction Exam)

The researchers created DesignBench, a massive exam with 900 different construction challenges. It tests the AI in three specific ways:

Stage 1: The Blueprint (Generation)
- The Task: The AI sees a picture of a website and has to write the code to build it.
- The Twist: It has to build it using specific "kits" (React, Vue, Angular), not just basic wood and nails.
Stage 2: The Renovation (Edit)
- The Task: The house is built, but the owner says, "I don't like the red door; make it blue, and add a porch."
- The Twist: The AI has to find the exact spot in the code to change without breaking the rest of the house.
Stage 3: The Emergency Repair (Repair)
- The Task: The house has a problem. "The front door is stuck under the porch roof!"
- The Twist: The AI has to spot the bug and fix it, even if the instructions are vague.

3. The Results: The AI is Good, But Still a Rookie

The researchers tested 9 of the smartest AI models available (like GPT-4o, Claude, and Gemini). Here is what they found:

The "Big Kid" Advantage: Just like a bigger construction crew can handle more complex jobs, the larger AI models performed significantly better than the smaller ones.
The "Special Kit" Struggle: The AIs were great at building simple wooden shacks (Vanilla HTML). But when asked to use complex modern kits (especially Angular), they started making mistakes. They often forgot the specific rules of the kit, like using the wrong type of screw or forgetting to connect the wires.
The "Where?" Problem: When asked to edit or repair, the AIs often knew what to change but not where it was in the code. It's like a chef who knows how to make a perfect sauce but can't find the pot on the stove.
The "Text vs. Picture" Surprise: When giving the AI instructions to fix a bug, giving it the code text alone worked better than giving it a picture of the bug. It turns out, for fixing code, reading the manual (text) is more precise than looking at a photo of the broken part.

4. The Verdict: We Need Better Training

The paper concludes that while AI is amazing at drawing the blueprint, it's still struggling to be a master builder with modern tools.

For Researchers: We need to teach these AIs more about specific construction kits (React, Vue, Angular) and how to use them efficiently, rather than just copying patterns.
For Developers: If you want to use AI to build websites, don't just say "Fix this." Be specific! Tell the AI exactly where to look and what to change, because it's still getting lost in the details.

In short: DesignBench is a reality check. It shows us that AI is ready to be a junior apprentice, but it's not quite ready to be the lead contractor on a complex skyscraper just yet.

1. Problem Statement

While Multimodal Large Language Models (MLLMs) have shown promise in converting visual designs to code, existing benchmarks for front-end code generation suffer from three critical limitations that prevent them from reflecting real-world development scenarios:

Lack of Framework Integration: Current benchmarks primarily focus on vanilla HTML/CSS, ignoring the dominant modern frameworks (React, Vue, Angular) used in industry.
Insufficient Task Coverage: Existing evaluations focus almost exclusively on the initial "Generation" phase. They neglect the iterative nature of development, specifically Design Editing (refining code based on instructions) and Design Repair (fixing visual bugs or layout issues).
Limited Evaluation Dimensions: Most benchmarks use unidimensional metrics (e.g., visual similarity) without analyzing task difficulty, input context variations, or deep code-level attributes like reusability and compilation success.

2. Methodology: DesignBench

The authors introduce DesignBench, a comprehensive benchmark designed to evaluate MLLMs across multiple frameworks and tasks.

A. Dataset Construction

Scale & Diversity: Contains 900 webpage samples spanning 11+ topics (e.g., e-commerce, news, blogs).
Frameworks: Covers Vanilla HTML/CSS, React, Vue, and Angular.
Tasks:
1. Design Generation ( $T_G$ ): Generating code from a UI mockup image ( $I \to C$ ).
2. Design Edit ( $T_E$ ): Modifying existing code based on a UI image, original code, and natural language instructions ( $(I_o, C_o, T) \to C_{new}$ ).
3. Design Repair ( $T_R$ ): Fixing display issues in code based on a problematic image and code ( $(C_p, I_p) \to C_r$ ).
Data Sources:
- Generation: GitHub projects and Top 500 global websites.
- Edit: Real-world interaction histories from Vercel's V0 and Vue0 platforms, filtered for clarity and quality by human annotators.
- Repair: Manually induced UI issues (e.g., occlusion, crowding, alignment) verified by PhD-level developers.
Annotation: Includes 9 edit types (Add, Change, Delete across 6 attributes) and 6 issue categories (Occlusion, Crowding, Text Overlap, Alignment, Color/Contrast, Overflow).

B. Evaluation Metrics

The benchmark employs a multi-dimensional evaluation strategy:

Visual Metrics: CLIP (semantic similarity) and SSIM (structural similarity).
Code Metrics:
- Compilation Success Rate (CSR): Percentage of code that compiles without errors.
- Code Modification Location Similarity (CMLS): Jaccard similarity of AST nodes modified (precision of where changes were made).
- Code Modification Content Similarity (CMCS): CodeBLEU score of the modified content (precision of what was changed).
MLLM-as-Judge: GPT-4o is used to score the quality of edits and repairs (0–10 scale), validated by human experts with high inter-annotator agreement (Kappa > 0.84).

C. Experimental Setup

Models: Evaluated 9 state-of-the-art MLLMs (e.g., Claude-3.7, GPT-4o, Gemini-2.0, Llama-3.2, Qwen2.5-VL, Pixtral).
Prompting: Framework-specific prompts were designed to guide models on syntax, component usage, and output formats.

3. Key Results & Findings

Performance Across Tasks (RQ1)

Top Performers: Claude-3.7, GPT-4o, Gemini-2.0, and Pixtral-124B consistently outperformed others.
Bottlenecks:
- Generation: Struggled with compilation errors and visual rendering inaccuracies.
- Edit/Repair: The primary bottleneck was code localization. Models often generated code that compiled but modified the wrong sections or failed to identify the specific UI issue to fix.

Framework Comparison (RQ2)

Vanilla vs. Frameworks: MLLMs perform best with Vanilla HTML/CSS.
Framework Difficulty: Performance drops significantly in frameworks. Angular showed the poorest performance (lowest compilation rates and CLIP scores), followed by React and Vue.
Syntax Issues: Models struggle with specific syntax: JSX parsing (React), template syntax (Vue), and TypeScript/Component architecture (Angular).

Impact of Difficulty & Context (RQ3 & RQ4)

Difficulty: Performance degrades as complexity increases (larger images, complex instructions, severe UI bugs). Small models suffer "catastrophic failures" on hard tasks.
Input Context: Surprisingly, Code-only input consistently outperformed Image-only input and Multimodal (Image + Code) inputs for Edit and Repair tasks. This suggests that for precise code modification, semantic code representation is more effective than visual cues for current MLLMs.

Limitations & Failure Analysis (RQ5 & RQ6)

Component-Based Design: MLLMs rarely use component-based patterns (e.g., v-for in Vue), instead hardcoding repetitive structures. Average adoption rates were extremely low (0.24% for React, 5% for Vue, 19% for Angular).
Issue Detection: MLLMs have poor accuracy (~27% average) in autonomously identifying UI display issues (e.g., occlusion, alignment).
Failure Types:
- Generation: Spatial reasoning errors (wrong size/position) and missing elements.
- Edit: Scope control issues (unnecessary modifications, partial edits).
- Repair: Fundamental inability to identify the defect ("No repair" attempts).

4. Key Contributions

First Multi-Framework, Multi-Task Benchmark: Introduced DesignBench, the first benchmark to evaluate MLLMs on React, Vue, and Angular across Generation, Edit, and Repair tasks.
Comprehensive Evaluation: Provided a multi-dimensional analysis covering task difficulty, input modalities, and code-level metrics (correctness, reusability, localization).
Critical Insights: Revealed that current MLLMs lack framework-specific syntax understanding, fail to utilize component-based architectures, and struggle with code localization in iterative tasks.
Actionable Guidelines: Proposed strategies for researchers (enhance framework-specific training, improve visual-code fusion) and developers (provide explicit edit locations, decompose complex tasks).

5. Significance

DesignBench shifts the paradigm of evaluating MLLMs from simple "image-to-code" generation to a realistic assessment of automated front-end engineering workflows. By highlighting the gap between current model capabilities and the requirements of modern, framework-based development, it provides a roadmap for future research to build more reliable, reusable, and context-aware AI coding assistants. The benchmark is open-sourced to foster further development in this field.