MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models

Imagine you are an art director at a busy studio. You have a team of incredibly talented AI artists (called Unified Multimodal Models). These artists are great at painting a picture based on a single description, like "a cat on a mat."

But now, you want them to do something much harder: Combine three different photos into one new, perfect scene.

You give them:

A photo of a wolf.
A photo of a teddy bear.
A photo of a man.

You ask: "Put the wolf, the bear, and the man together in a museum."

Here's the problem: The AI artists often get confused. They might forget the bear, make the wolf look like a dog, or put the man floating in mid-air. They struggle to "remember" details from multiple photos at once.

This paper introduces two things to fix this mess: a test to see who is good at this, and a new trick to help them get better.

Part 1: The Test (MICON-Bench)

The Analogy: The "Six-Station Obstacle Course"

Previously, we only tested AI on simple tasks, like "Draw a cat." But to see if they can handle complex jobs, the authors built MICON-Bench. Think of this as a rigorous obstacle course with six specific stations. To pass, the AI must prove it can:

Object Composition: "Put these three random things in a room together." (Can they fit them all in?)
Spatial Composition: "Put the wolf on the left, the bear in the middle, and the man on the right." (Can they follow directions about where things go?)
Attribute Disentanglement: "Take the cow from Photo A, paint it in the style of Photo B, and put it in the background of Photo C." (Can they mix and match features without getting confused?)
Component Transfer: "Take the helmet from the girl in Photo A and put it on the boy in Photo B." (Can they swap specific parts?)
FG/BG Composition: "Cut the person out of Photo A and paste them into the background of Photo B." (Can they blend edges smoothly?)
Story Generation: "Here are three photos of a story so far. Draw the fourth photo showing what happens next." (Can they understand cause and effect?)

How they grade it:
Instead of a human looking at every single picture (which takes forever), they use a super-smart AI judge (an MLLM) to act as the referee. This judge checks a "checklist" (called Checkpoints) for every image:

Did it include the wolf? (Yes/No)
Is the wolf actually a wolf and not a dog? (Yes/No)
Is the lighting consistent? (Yes/No)

If the AI fails a "Hard Constraint" (like forgetting the wolf entirely), it gets a zero for that part. This creates a fair, automated score.

Part 2: The Fix (Dynamic Attention Rebalancing - DAR)

The Analogy: The "Spotlight Manager"

The authors noticed that when these AI artists try to look at three photos at once, their "attention" gets scattered.

The Problem: Imagine the AI is trying to listen to three people talking at once. Instead of focusing on the words, it gets distracted by the background noise, the color of the walls, or the person's shoes. It starts "hallucinating" (making things up) because it's looking at the wrong parts of the reference photos.
The Solution (DAR): The authors invented a "Spotlight Manager" called Dynamic Attention Rebalancing (DAR).
- How it works: Before the AI starts painting, DAR looks at where the AI is currently "looking" (its attention map).
- The Adjustment: If the AI is staring too hard at the background of the wolf photo (irrelevant), DAR dims that spotlight. If the AI is ignoring the wolf's face (relevant), DAR turns that spotlight up to maximum brightness.
- The Result: The AI is forced to focus only on the important details (the wolf's face, the man's shirt) and ignore the noise.

Best part? You don't need to retrain the AI or teach it new lessons. It's a "plug-and-play" tool. You just turn it on during the generation process, and it instantly makes the AI smarter.

The Big Picture

What did they find?

The Test: Even the most advanced AI models today struggle with these multi-image tasks. They often mix up identities or get the spatial arrangement wrong.
The Fix: When the authors added their "Spotlight Manager" (DAR) to these models, the results improved dramatically. The AI became much better at keeping characters consistent, placing objects in the right spots, and blending images seamlessly.

In summary:
The paper says, "Hey, current AI is bad at combining multiple photos. We built a tough test to prove it (MICON-Bench), and we found a simple, free trick (DAR) that acts like a spotlight, forcing the AI to pay attention to the right things so it can finally do the job correctly."

1. Problem Statement

Unified Multimodal Models (UMMs) have made significant strides in text-to-image generation and single-image editing. However, their ability to perform multi-image context generation—integrating, reasoning over, and synthesizing information from multiple reference images into a single coherent output—remains underexplored and poorly quantified.

Existing benchmarks primarily focus on text-to-image fidelity or single-image editing, failing to address the unique challenges of multi-image tasks, such as:

Cross-image consistency: Maintaining object identity and attributes across different references.
Spatial-temporal coherence: Correctly arranging objects based on geometric constraints or narrative logic.
Hallucinations: Models often attend indiscriminately to irrelevant regions in reference images, leading to visual inconsistencies or "hallucinated" content that mixes attributes incorrectly.
Lack of Evaluation: There is no standardized framework to objectively measure semantic and visual consistency across complex multi-image compositions.

2. Methodology

The paper introduces two core components: a comprehensive benchmark (MICON-Bench) and a training-free enhancement mechanism (Dynamic Attention Rebalancing).

A. MICON-Bench (The Benchmark)

MICON-Bench is a dataset and evaluation framework designed to test multi-image reasoning.

Scope: It contains 1,043 cases and 2,518 images covering six distinct tasks:
1. Object Composition: Combining subjects with backgrounds.
2. Spatial Composition: Arranging multiple objects with specific geometric relations (e.g., left, center, right).
3. Attribute Disentanglement: Separating and recombining subject, style, and background from three different images.
4. Component Transfer: Extracting specific parts (e.g., clothing, accessories) from one image and applying them to another.
5. FG/BG Composition: Seamless foreground extraction and background replacement.
6. Story Generation: Inferring and generating the next logical step in a narrative based on reference images (causal reasoning).
Evaluation-by-Checkpoint Framework: Instead of relying solely on human judgment or simple metrics, MICON-Bench uses a Multimodal Large Language Model (MLLM) as an automated verifier.
- For each task, specific checkpoints are defined across seven dimensions: Instruction Following, Identity, Structure, Cross-Reference Consistency, Causality, Text Grounding, and Overall Usability.
- The MLLM evaluates whether the generated image satisfies these binary (Pass/Fail) checkpoints, producing a composite score. This ensures scalable, objective, and fine-grained assessment.

B. Dynamic Attention Rebalancing (DAR)

To address the issue of models attending to irrelevant regions in reference images, the authors propose DAR, a plug-and-play, training-free mechanism.

Mechanism:
1. Sampling: To reduce computational cost, a subset of query tokens (noise tokens) is sampled.
2. Attention Analysis: The model computes attention maps between these sampled queries and the key tokens of the reference images.
3. Dynamic Weighting: Based on the attention scores, reference tokens are categorized into "highly relevant," "irrelevant," and "neutral" regions using thresholds ( $\tau_{high}, \tau_{low}$ ).
4. Rebalancing: A weighting factor ( $\gamma$ ) is applied to amplify attention on relevant regions and suppress irrelevant ones. The adjusted attention map is then used for the final generation step.
Goal: This forces the UMM to focus on semantically relevant parts of the reference images, improving identity preservation and reducing hallucinations without requiring model retraining.

3. Key Contributions

MICON-Bench: The first comprehensive benchmark specifically designed for multi-image context generation, covering diverse tasks from simple composition to complex causal reasoning.
Evaluation-by-Checkpoint Paradigm: A novel, automated evaluation framework using MLLMs to verify semantic and visual consistency via binary checkpoints, offering a more rigorous alternative to traditional metrics.
Dynamic Attention Rebalancing (DAR): A novel, training-free technique that dynamically adjusts attention weights during inference to enhance cross-image coherence and reduce hallucinations.
Extensive Empirical Analysis: Demonstrating that even state-of-the-art UMMs struggle with multi-image consistency and showing that DAR significantly improves performance across various models.

4. Experimental Results

The authors evaluated multiple state-of-the-art models (including proprietary models like Nano-Banana and GPT-Image, and open-source models like BAGEL and OmniGen2) on MICON-Bench.

Baseline Performance: Even top-tier models showed significant gaps in multi-image tasks, particularly in Component Transfer and Story Generation, where cross-image consistency is critical.
Impact of DAR:
- Applying DAR to OmniGen2 and BAGEL resulted in consistent improvements across all tasks.
- Notable gains were observed in FG/BG Composition and Story Generation, where DAR helped preserve object identity and narrative logic.
- OmniGen2 + DAR achieved an average score increase from 67.83 to 69.21, with specific task improvements (e.g., FG/BG from 57.96 to 59.28).
Generalization: DAR was tested on other benchmarks (OmniContext, XVerseBench) and showed consistent improvements in fine-grained subject identification and attribute similarity.
Ablation Studies:
- Performance degrades as the number of reference images increases (from 2 to 5), highlighting the difficulty of fusing multiple contexts.
- The weight factor $\gamma$ in DAR was tuned, with $\gamma=0.15$ yielding optimal results.
- The evaluation framework was proven robust against prompt sensitivity and different verifier MLLMs (Qwen3-VL vs. InternVL3.5).

5. Significance

This work addresses a critical gap in the evolution of generative AI. As models move from single-image generation to complex, multi-modal reasoning, the ability to synthesize information from multiple sources is essential for applications like storyboarding, design prototyping, and visual storytelling.

For the Community: MICON-Bench provides a standardized, rigorous testbed to drive future research in multi-image reasoning.
For Model Development: The Evaluation-by-Checkpoint framework offers a scalable way to assess complex semantic consistency without expensive human annotation.
For Practical Application: DAR provides an immediate, low-cost solution to improve the quality of multi-image generation in existing models, making them more reliable for real-world use cases where cross-image consistency is paramount.

In conclusion, the paper establishes a new standard for evaluating multi-image generation and provides a practical, training-free method to enhance the coherence and reliability of Unified Multimodal Models.

MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models

Part 1: The Test (MICON-Bench)

Part 2: The Fix (Dynamic Attention Rebalancing - DAR)

The Big Picture

1. Problem Statement

2. Methodology

A. MICON-Bench (The Benchmark)

B. Dynamic Attention Rebalancing (DAR)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation