OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation

Imagine you have a very talented artist named MLLM (Multimodal Large Language Model). This artist is incredible at understanding what you say and can draw pictures based on your descriptions. However, they have a annoying habit: they are great at the big picture but terrible at the details.

If you ask them to draw "a red cat sitting on a blue mat," they might draw a cat, but it could be green, or the mat might be red, or they might accidentally draw a dog instead. They also struggle with counting (e.g., "three apples") or getting the spatial relationship right (e.g., "the apple is under the cup").

This paper introduces a new training method called OSPO (Object-centric Self-improving Preference Optimization) to fix this. Here is how it works, explained simply:

The Problem with Old Methods

Previously, to teach the artist to be better, humans (or other super-smart AIs) had to look at thousands of drawings, pick the "good" ones, and cross out the "bad" ones.

The Issue: This is expensive, slow, and requires hiring a huge team of critics. Also, the critics might not agree on what "good" means, leading to confusion.

The OSPO Solution: The Artist Becomes Their Own Teacher

OSPO is like giving the artist a magic mirror and a checklist so they can teach themselves without needing a human boss. It happens in five steps:

1. The Prompt Generator (The Idea Man)

First, the system creates a list of specific drawing requests. Instead of just "a cat," it creates requests like "a striped cat on a rug." It breaks these down into categories like colors, shapes, and positions.

2. The "What-If" Game (Perturbation)

This is the clever part. For every request, the system creates a "twisted" version of the request.

Original: "A red cat on a blue mat."
Twisted: "A blue cat on a red mat."
The system then asks the artist to draw both versions. Now, the artist has two drawings: one that matches the first description and one that matches the second.

3. The Magic Mask (Object Detection)

Here is where OSPO gets smart. Instead of just looking at the whole picture, the system uses a special tool (based on the artist's own internal "attention") to draw a mask around the specific objects mentioned (like the cat or the mat).

Think of this as the artist putting a magnifying glass over just the cat to see if it's actually red, ignoring the background. This ensures the artist focuses on the details that matter, not just the general vibe.

4. The Self-Quiz (VQA)

Before the artist gets to keep their drawings, they have to take a quiz about them. The system asks simple Yes/No questions based on the prompt:

"Is the cat red?"
"Is the mat blue?"
"Are there three apples?"
The artist answers these questions about their own drawing. If they draw a green cat but say "Yes, it's red," the system knows the drawing is bad. It filters out the bad drawings and keeps only the ones where the drawing and the description match perfectly.

5. The Self-Improvement Loop (Training)

Finally, the artist learns from the "Good" drawing vs. the "Bad" drawing. But here is the secret sauce: The "Object-Weighted Loss."

Imagine the artist is being graded. In the past, if they got the background right but the cat wrong, they might still get a decent grade.
With OSPO, the system says, "We don't care about the background right now. We only care about the cat. If the cat is wrong, you get a zero."
This forces the artist to focus intensely on getting the specific objects and their attributes (color, shape, position) correct.

Why is this a big deal?

No External Help Needed: The artist generates its own practice problems and grades its own homework. It doesn't need expensive human teachers.
Focus on Details: By using "masks" to zoom in on specific objects, it fixes the "hallucination" problem (drawing things that don't exist or getting colors wrong).
Better than the Pros: The paper shows that this self-taught artist actually draws better than some of the most famous, specialized drawing bots (like DALL-E 3 or SD-XL) when it comes to complex, detailed instructions.

The Analogy Summary

Think of the old way as a student who needs a teacher to grade every single essay they write. It's slow and expensive.

OSPO is like a student who:

Writes two versions of an essay (one correct, one with a typo).
Uses a highlighter to mark exactly where the typo is (the object mask).
Quizzes themselves to see if they can spot the error.
Learns specifically to fix that highlighted error, ignoring the rest of the page until they master it.

By doing this repeatedly, the student becomes a master of details without ever needing a teacher to look at their work.

1. Problem Statement

Despite the rapid advancement of Unified Multimodal Large Language Models (MLLMs) in both understanding and generating visual content, they struggle with fine-grained text-to-image (T2I) alignment. Specifically, current models often fail to faithfully render:

Precise object attributes (color, shape, texture).
Spatial relationships between objects.
Complex compositional instructions.

This leads to object hallucination, where models generate non-existent objects, omit described objects, or distort their attributes.

Existing solutions face two major bottlenecks:

Data Dependency: Traditional preference optimization methods (like DPO or PPO) rely on large, expensive datasets curated by humans or stronger AI models, which are difficult to scale for image generation.
Distribution Mismatch: These methods suffer from "off-policy" issues, where the external preference data does not match the model's own output distribution, leading to unstable optimization.
Lack of Granularity: Current self-improving frameworks (e.g., SILMM) generate preference data but lack explicit mechanisms to enforce object-level alignment, often resulting in noisy training signals (e.g., "Preference-Null" pairs where both images are equally bad, or "Preference-False" pairs where the better image is mislabeled).

2. Methodology: OSPO Framework

The authors propose OSPO (Object-centric Self-improving Preference Optimization), a fully self-contained, five-stage framework that enables MLLMs to autonomously generate high-quality, object-centric preference data and optimize themselves without external models or datasets.

Stage 1: Prompt Generation

The model generates a base set of training prompts categorized into four semantic types to ensure diversity:

Attribute: Color, Shape, Texture.
Layout: 2D and 3D Spatial relationships.
Non-spatial Relationship: Dynamic actions/states.
Complex Composition: Combinations of the above.

Stage 2: Prompt Perturbation and Densification

Instead of the standard "Best-of-N" sampling (which generates multiple images from one prompt and picks winners/losers), OSPO creates paired prompts to ensure controlled differences:

Perturbation: For each base prompt $x$ $x$ , the model generates $N$ $N$ perturbed variants ( $\tilde{x}$ $\tilde{x}$ ) using three strategies:
- Replace: Swaps an object/attribute with a new one.
- Swap: Exchanges positions of objects/attributes.
- Drop: Removes an object/attribute.
Densification: Both the original and perturbed prompts are "densified" (enriched with contextual details) to ensure they share the same global background but differ only in fine-grained object details. This creates a clean contrast for learning.

Stage 3: Image and Object Mask Generation

Image Generation: The model generates candidate image pairs $(y_w, y_\ell)$ from the densified prompt pairs.
Object Mask Extraction: Crucially, the model extracts binary object masks directly from its internal attention weights (specifically from intermediate layers, excluding top/bottom $k$ layers to avoid noise). This identifies which visual tokens correspond to specific objects described in the prompt, eliminating the need for external segmentation models.

Stage 4: VQA-based Preference Pair Construction

To filter out noisy data (Preference-Null and Preference-False pairs), OSPO employs Self-VQA:

The model decomposes the prompt into atomic Yes/No questions regarding object attributes and relationships.
It evaluates each candidate image against these questions to calculate an alignment score ( $S$ ).
Filtering: Pairs are discarded if the "preferred" image scores below a threshold or if the "non-preferred" image accidentally satisfies the prompt conditions.
Selection: The pair with the highest alignment score for the preferred image is selected for training.

Stage 5: Preference Optimization

The model is fine-tuned using a combined loss function:

Object-weighted SimPO Loss: A modification of the standard SimPO loss. It applies spatial weights based on the extracted object masks, forcing the model to focus gradients on object-relevant visual tokens rather than irrelevant background tokens.
- Formula: $L_{Obj-SimPO} = -E [\log \sigma ( \frac{w(y_w)}{|y_w|} \log \pi_\theta(y_w|x) - \frac{w(y_\ell)}{|y_\ell|} \log \pi_\theta(y_\ell|x) - \gamma )]$
SFT Loss: A standard Supervised Fine-Tuning loss on the preferred image to maintain global structural coherence and stability.

3. Key Contributions

OSPO Framework: A novel, fully self-improving pipeline that constructs fine-grained, object-focused preference data without relying on external datasets or reward models.
Object-Centric Mechanism: The introduction of attention-based object masks and object-weighted loss, which explicitly guides the optimization toward correcting object-level hallucinations rather than just global image quality.
Data Construction Pipeline: A robust method involving prompt perturbation, densification, and Self-VQA filtering that significantly reduces noisy supervision compared to existing Best-of-N approaches.
State-of-the-Art Performance: Demonstrated ability to outperform specialized diffusion models and other self-improving MLLMs on compositional benchmarks.

4. Experimental Results

The authors evaluated OSPO on Janus-Pro-1B and Janus-Pro-7B backbones across three benchmarks: T2I-CompBench++, DPGBench, and GenEval.

Performance Gains:
- On T2I-CompBench++, OSPO significantly outperformed baselines (SILMM, SUDER) and even specialized diffusion models (SD-XL, DALL-E 3, FLUX.1) in the Attribute category (e.g., Janus-Pro-7B + OSPO scored 0.8567 vs. 0.7824 for SUDER).
- On GenEval, OSPO achieved the second-best overall score among unified MLLMs, showing strong generalization.
- On DPGBench, the 7B model achieved the highest score among Janus-Pro-based self-improving methods.
Qualitative Improvements: Visual examples show a marked reduction in object hallucination (e.g., correctly rendering specific colors, shapes, and spatial arrangements like "a red circle on a blue square").
Efficiency: OSPO achieves higher performance with lower computational cost compared to frameworks using GRPO or multiple reward models, primarily because it generates smaller, targeted candidate sets rather than relying on massive sampling.

5. Significance

This work addresses a critical bottleneck in generative AI: the inability of unified models to precisely follow complex, object-level instructions. By shifting from global image quality optimization to object-centric self-improvement, OSPO provides a scalable, cost-effective path to reducing hallucinations.

The significance lies in:

Autonomy: It proves that MLLMs can self-correct their generative flaws using only their own internal capabilities (understanding and generation) without human annotation or external tools.
Granularity: It moves beyond "is the image good?" to "is the specific object in the correct place with the correct color?", setting a new standard for fine-grained T2I generation.
Scalability: The method is highly efficient, making it feasible to train high-quality T2I models on limited compute resources compared to data-hungry alternatives.