DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation

The paper introduces DOne, a framework that decouples structure understanding from element rendering to overcome the holistic limitations of Vision Language Models in Design-to-Code generation, achieving superior visual fidelity and productivity on the new HiFi2Code benchmark.

Xinhao Huang, Jinke Yu, Wenhao Xu, Zeyi Wen, Ying Zhou, Junzhuo Liu, Junhao Ji, Zulong Chen

Published 2026-04-03
📖 5 min read🧠 Deep dive

Imagine you are a master architect who has drawn a beautiful, intricate blueprint for a new skyscraper. You hand this blueprint to a robot builder and say, "Build this exactly as I drew it."

In the world of web development, this is the challenge of Design-to-Code: turning a visual picture of a website into the actual code (HTML/CSS) that makes it work.

For a long time, the robots (AI models) trying to do this had a major problem. They were like a builder who looked at the whole skyscraper at once and got overwhelmed. They would guess the general shape but miss the details. They might build the walls in the wrong place, forget the windows, or replace a fancy glass door with a generic cardboard box.

The paper you shared introduces a new system called DOne (Decoupling Structure and Rendering) to fix this. Here is how it works, explained with simple analogies:

1. The Problem: The "Holistic Bottleneck"

Imagine trying to describe a complex painting to a friend by looking at the whole thing at once. You might say, "It's a house with trees," but you'd forget the specific color of the front door or the pattern on the roof tiles.

Current AI models suffer from this "Holistic Bottleneck." They try to understand the entire website layout and generate the code in one giant leap. Because the task is too big, they get confused, leading to:

  • Broken Layouts: The menu ends up at the bottom of the page.
  • Missing Details: Logos and icons disappear or turn into gray boxes.
  • Generic Replacements: Instead of a specific product image, they just put a placeholder that says "Image."

2. The Solution: DOne's "Three-Step Chef" Approach

The authors realized that to build a perfect website, you shouldn't try to do everything at once. Instead, you should break the job down into three specialized steps, like a high-end restaurant kitchen.

Step 1: The "Floor Planner" (Layout Segmentation)

Before the chef starts cooking, the floor planner looks at the restaurant and draws lines to separate the kitchen, the dining area, and the bar.

  • What DOne does: Instead of looking at the whole website as one messy image, DOne uses a smart AI to slice the design into logical chunks (Header, Sidebar, Main Content, Footer).
  • The Analogy: It's like cutting a complex pizza into slices before eating it. This prevents the AI from getting overwhelmed by the whole pie at once.

Step 2: The "Inventory Manager" (Element Retrieval)

Once the slices are defined, the Inventory Manager goes through the kitchen and grabs every single specific ingredient: the fresh basil, the specific brand of cheese, the unique sauce.

  • What DOne does: It has a special tool that hunts down tiny, specific details like icons, logos, and buttons. It doesn't just guess; it finds the exact image file and saves it.
  • The Analogy: If the design has a tiny red heart icon, the Inventory Manager finds that exact heart and puts it in a box, ensuring it doesn't get lost or replaced with a generic "heart" emoji.

Step 3: The "Architect's Blueprint" (Schema-Guided Generation)

Now, the builder (the AI that writes the code) doesn't just guess. They are handed a strict, step-by-step blueprint (a JSON schema) that says: "Put the Header here, put the Red Heart icon inside the Header, and make sure the Sidebar is 200 pixels wide."

  • What DOne does: It creates a logical map of the website first. Then, it tells the code-writing AI to follow this map strictly.
  • The Analogy: Instead of telling the builder, "Build a house," you give them a detailed instruction manual: "First, build the foundation. Then, place the front door at coordinate X. Then, hang the specific red heart icon on the door."

3. The Result: A "High-Fidelity" Website

Because DOne separates the structure (where things go) from the rendering (what things look like), the final result is amazing.

  • Before: The AI might build a house that looks like a house, but the door is on the roof, and the windows are missing.
  • With DOne: The house looks exactly like the blueprint. The door is in the right place, the windows are clear, and the red heart icon is perfectly placed.

Why Does This Matter?

The authors also created a new "test" called HiFi2Code. Think of this as a super-hard driving test for AI. Previous tests were like driving in an empty parking lot; this new test is like driving in rush hour traffic with complex intersections.

When they tested DOne on this hard course:

  • It was 3 times faster for human developers to get a working website.
  • The websites looked much more like the original designs (over 10% better than the best previous methods).

Summary

DOne is like hiring a team of specialists instead of one overworked generalist.

  1. One person slices the design into manageable pieces.
  2. One person collects all the tiny details.
  3. One person assembles the code based on a strict map.

By doing this, the AI stops guessing and starts building with precision, turning a blurry sketch into a pixel-perfect website.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →