DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation

Imagine you are a graphic designer. You've spent hours arranging text, images, and colors on a poster. You feel it looks great. But then, a client looks at it and says, "The text is too squished," or "It feels unbalanced."

For a long time, computers have been getting really good at making these posters automatically. But they've been terrible at judging them. They are like a robot that can paint a picture but doesn't understand why a human might prefer one painting over another. They often miss the subtle "vibe" of good design.

This paper, DesignSense, is like giving that robot a pair of human eyes and a human heart. Here is the story of how they did it, explained simply.

1. The Problem: The Robot's "Blind Spot"

Think of existing AI models as art critics who have only ever looked at photographs of nature (sunsets, cats, landscapes). They are experts at saying, "This photo of a cat is blurry," or "This sunset is too dark."

But graphic design isn't just about pretty pictures; it's about structure. It's about how a headline sits next to a logo, or how much space is between two paragraphs. The "nature" critics fail here because they don't understand the rules of layout. They can't tell the difference between a well-organized flyer and a messy one if the pictures inside look fine.

2. The Solution: Building a "Design School" (The Dataset)

To fix this, the researchers at Adobe built a massive training school for AI, called DesignSense-10k.

Instead of just showing the AI random pictures, they created a specific curriculum:

The "Twin" Test: They took one design and created two slightly different versions of it (like changing the aspect ratio or moving a button).
The Human Judges: They asked real humans to look at these pairs and vote: "Left is better," "Right is better," "Both are great," or "Both are terrible."
The Secret Sauce: Most AI only learns "Left vs. Right." This dataset taught the AI that sometimes, both designs are bad (a crucial lesson!), and sometimes both are good. This helps the AI understand nuance, not just binary choices.

They used a clever 5-step assembly line to make these practice problems:

Grouping: Like a teacher grouping students by subject, they grouped related design elements (e.g., a date and its location) so the AI didn't get overwhelmed.
Prediction: They asked a smart AI to rearrange these groups into new layouts.
Filtering: They threw out the messy, broken drafts.
Clustering: They made sure they didn't just make 1,000 copies of the same poster; they ensured variety.
Refinement: They used a super-smart AI to nudge the elements slightly so nothing looked crooked or overlapping.

3. The New Teacher: The DesignSense Model

Once they had 10,000+ of these "Twin Tests" with human votes, they trained a new AI model called DesignSense.

Think of this new model as a Master Art Critic who has studied thousands of design books and has a sharp eye for balance.

The Results: When they tested this new critic against the world's most famous AI models (like GPT-4o or Gemini), DesignSense crushed them. It was 54% better at understanding human taste.
The "Both Bad" Moment: The most impressive part? The big, famous AI models often got confused when both designs were terrible. They would guess randomly. DesignSense, however, confidently said, "Yeah, both of these are ugly," just like a human would.

4. Why Does This Matter? (The Real-World Impact)

You might ask, "So the AI is better at judging, but does it help make better designs?"

Yes. It's like having a coach during practice.

Training the Generator: When they used DesignSense as a "coach" to train the AI that makes the posters, the posters got significantly better. The AI learned to avoid mistakes because it had a better teacher.
The "Try Many, Pick Best" Trick: Imagine you ask a designer to make 10 different versions of a flyer. A human might pick the best one. DesignSense can do this instantly. It can generate 10 options, grade them all, and pick the winner. This "inference-time scaling" improved the quality of the final output by nearly 4%.

The Big Picture

Before this paper, AI was like a student who could draw a stick figure but didn't know what "balance" meant. DesignSense gave that student a textbook, a practice exam, and a strict teacher.

Now, the AI doesn't just generate images; it understands design intent. It knows that a poster isn't just a collection of pixels, but a carefully arranged dance of space, text, and images. This means in the future, when you ask an AI to design a menu, a poster, or an ad, it won't just give you a design—it will give you a good one.

1. Problem Statement

Graphic layouts are essential for visual communication in advertising, posters, and digital art. While recent deep learning models (e.g., diffusion-based and transformer-based) have improved layout generation, they frequently fail to align with nuanced human aesthetic judgment.

The Gap: Existing preference datasets and reward models (e.g., ImageReward, HPS, PickScore) are trained on text-to-image generation. These models evaluate photorealistic content, prompt alignment, and image fidelity.
The Mismatch: Layout evaluation depends on spatial relationships, compositional balance, and hierarchical organization of identical elements. Current models fail to generalize to this domain because they lack exposure to structural cues and treat layout variants as distinct visual content rather than spatial rearrangements.
Limitation of Current VLMs: Frontier Vision-Language Models (VLMs) perform poorly on layout preference tasks, often failing catastrophically on multi-class tasks (e.g., distinguishing between "both good" and "both bad") and lacking specialized training for design principles.

2. Methodology

The authors propose a comprehensive framework consisting of a large-scale dataset (DesignSense-10k) and a specialized reward model (DesignSense).

A. Data Curation Pipeline (5-Stage Process)

To generate high-quality, diverse layout pairs for annotation, the authors developed a pipeline based on the Crello dataset (19.3k layouts) and the AesthetiQ layout synthesis model. The pipeline includes:

Grouping: Semantic clustering of related elements (e.g., a date and its label) using GPT-4o to reduce prediction complexity and preserve design intent.
Prediction: Generating candidate layouts using a retrained AesthetiQ model that operates at the group level. The model generates variants across three aspect ratio settings: Original, Stretching-2x (elongated), and Inverse-Ratio (swapped dimensions).
Filtering: Removing low-quality candidates (e.g., those with overlaps or overflows) using GPT-4o as an intermediate judge.
Clustering: Using IoU-based clustering and DINO v3 features to ensure diversity. The system selects the most distinct layouts from clusters to avoid redundancy.
Refinement: Using OpenAI o3 to fine-tune bounding boxes, eliminating subtle overlaps and improving alignment while preserving the design intent.

B. Dataset Construction (DesignSense-10k)

Scale: 10,235 human-annotated preference pairs.
Annotation Protocol: A 4-class scheme is used to capture subjective ambiguity:
1. Left: Left layout is better.
2. Right: Right layout is better.
3. Both Good: Both designs are aesthetically pleasing.
4. Both Bad: Both designs are flawed.
Split: 8,735 for training, 500 for validation, and 1,000 for testing.
Augmentation: To improve robustness, the training data is augmented with controlled perturbations (random position offsets and scale changes) to create "negative" samples, teaching the model to detect misalignment and visual imbalance.

C. The DesignSense Model

Architecture: A Vision-Language Model (VLM) based on InternVL3-8B.
Training: Fine-tuned on the DesignSense-10k dataset. Unlike standard classifiers, it reasons about layout composition, balance, and design intent using natural language understanding.
Prompting: The model is prompted to evaluate visual appeal, spacing, alignment, and element clarity, with a specific tie-breaking rule to prioritize the absence of visual defects.

3. Key Contributions

DesignSense-10k: The first large-scale dataset of human preference pairs specifically for graphic layout quality assessment, featuring diverse aspect ratios and a 4-class annotation scheme.
Novel Curation Pipeline: A five-stage automated pipeline that generates high-quality, diverse layout comparison pairs, addressing the scarcity of preference data in graphic design.
Specialized Reward Model: The DesignSense classifier, which significantly outperforms existing open-source and proprietary models by learning layout-specific aesthetic rules.
Downstream Impact: Demonstration that using DesignSense as a reward signal in Reinforcement Learning (RL) and inference-time scaling significantly improves the quality of generated layouts.

4. Experimental Results

A. Benchmarking Performance

The DesignSense model was evaluated against open-source (e.g., InternVL3.5) and proprietary (e.g., GPT-5, OpenAI-o3, Gemini-2.5) baselines on a 4-class layout preference task.

Macro F1 Score: DesignSense achieved 0.456, representing a 54.6% improvement over the strongest proprietary baseline (GPT-5-fewshot-deepthink at 0.295).
Weighted F1 Score: Achieved 0.520, an 86.4% improvement over the best baseline.
Binary Accuracy: Achieved 73.2%, outperforming all baselines.
Key Insight: Frontier VLMs showed high binary accuracy but extremely low Macro F1 and Cohen's $\kappa$ , indicating they struggle with the full 4-class task (often misclassifying "both good/bad" scenarios).

B. Generalization

Out-of-Distribution (OOD): Tested on the PrismLayers dataset (unseen domain). DesignSense maintained high performance (72.2% Binary Acc, 0.41 Macro F1), proving it learns fundamental spatial rules rather than overfitting to the Crello style.
Generator Agnostic: Tested on layouts generated by LayoutNUWA. DesignSense achieved 92.3% binary agreement and 68.3% accuracy on the 4-class task, vastly outperforming frontier models (which failed catastrophically on the 4-class task, scoring <42.5%).

C. Downstream Applications

RL-Based Training: When used as a judge in the AAPA reinforcement learning framework to train AesthetiQ, the generator's win rate improved by ~3% (from ~14-15% to ~18%).
Inference-Time Scaling: By generating 10 candidate layouts per prompt and selecting the best one using DesignSense, the system achieved a 3.6% improvement in win rate (from 16.6% to 20.2%) when evaluated by GPT-5.

5. Significance

This work addresses a critical bottleneck in automated graphic design: the lack of alignment between generative models and human aesthetic preferences.

Paradigm Shift: It moves beyond text-to-image preference modeling to layout-specific modeling, recognizing that spatial arrangement is the primary determinant of quality in design.
Practical Utility: The DesignSense model serves as a reliable "judge" that can be integrated into RL loops for training or used for inference-time selection, directly improving the quality of real-world design outputs.
Methodological Rigor: The introduction of the 4-class annotation scheme ("Both Good/Bad") captures the nuance of design evaluation that binary "A vs. B" models miss, providing a more robust signal for training.

In conclusion, DesignSense provides a holistic framework (data, pipeline, and model) that enables the creation of human-aligned graphic layout generation systems, setting a new standard for evaluating and improving design AI.