MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models

This paper introduces MICON-Bench, a comprehensive benchmark for evaluating multi-image context generation in unified multimodal models, alongside a training-free Dynamic Attention Rebalancing (DAR) mechanism and an MLLM-driven evaluation framework to enhance cross-image coherence and reduce hallucinations.

Mingrui Wu, Hang Liu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

Published 2026-02-24
📖 5 min read🧠 Deep dive

Imagine you are an art director at a busy studio. You have a team of incredibly talented AI artists (called Unified Multimodal Models). These artists are great at painting a picture based on a single description, like "a cat on a mat."

But now, you want them to do something much harder: Combine three different photos into one new, perfect scene.

You give them:

  1. A photo of a wolf.
  2. A photo of a teddy bear.
  3. A photo of a man.

You ask: "Put the wolf, the bear, and the man together in a museum."

Here's the problem: The AI artists often get confused. They might forget the bear, make the wolf look like a dog, or put the man floating in mid-air. They struggle to "remember" details from multiple photos at once.

This paper introduces two things to fix this mess: a test to see who is good at this, and a new trick to help them get better.


Part 1: The Test (MICON-Bench)

The Analogy: The "Six-Station Obstacle Course"

Previously, we only tested AI on simple tasks, like "Draw a cat." But to see if they can handle complex jobs, the authors built MICON-Bench. Think of this as a rigorous obstacle course with six specific stations. To pass, the AI must prove it can:

  1. Object Composition: "Put these three random things in a room together." (Can they fit them all in?)
  2. Spatial Composition: "Put the wolf on the left, the bear in the middle, and the man on the right." (Can they follow directions about where things go?)
  3. Attribute Disentanglement: "Take the cow from Photo A, paint it in the style of Photo B, and put it in the background of Photo C." (Can they mix and match features without getting confused?)
  4. Component Transfer: "Take the helmet from the girl in Photo A and put it on the boy in Photo B." (Can they swap specific parts?)
  5. FG/BG Composition: "Cut the person out of Photo A and paste them into the background of Photo B." (Can they blend edges smoothly?)
  6. Story Generation: "Here are three photos of a story so far. Draw the fourth photo showing what happens next." (Can they understand cause and effect?)

How they grade it:
Instead of a human looking at every single picture (which takes forever), they use a super-smart AI judge (an MLLM) to act as the referee. This judge checks a "checklist" (called Checkpoints) for every image:

  • Did it include the wolf? (Yes/No)
  • Is the wolf actually a wolf and not a dog? (Yes/No)
  • Is the lighting consistent? (Yes/No)

If the AI fails a "Hard Constraint" (like forgetting the wolf entirely), it gets a zero for that part. This creates a fair, automated score.


Part 2: The Fix (Dynamic Attention Rebalancing - DAR)

The Analogy: The "Spotlight Manager"

The authors noticed that when these AI artists try to look at three photos at once, their "attention" gets scattered.

  • The Problem: Imagine the AI is trying to listen to three people talking at once. Instead of focusing on the words, it gets distracted by the background noise, the color of the walls, or the person's shoes. It starts "hallucinating" (making things up) because it's looking at the wrong parts of the reference photos.

  • The Solution (DAR): The authors invented a "Spotlight Manager" called Dynamic Attention Rebalancing (DAR).

    • How it works: Before the AI starts painting, DAR looks at where the AI is currently "looking" (its attention map).
    • The Adjustment: If the AI is staring too hard at the background of the wolf photo (irrelevant), DAR dims that spotlight. If the AI is ignoring the wolf's face (relevant), DAR turns that spotlight up to maximum brightness.
    • The Result: The AI is forced to focus only on the important details (the wolf's face, the man's shirt) and ignore the noise.

Best part? You don't need to retrain the AI or teach it new lessons. It's a "plug-and-play" tool. You just turn it on during the generation process, and it instantly makes the AI smarter.


The Big Picture

What did they find?

  • The Test: Even the most advanced AI models today struggle with these multi-image tasks. They often mix up identities or get the spatial arrangement wrong.
  • The Fix: When the authors added their "Spotlight Manager" (DAR) to these models, the results improved dramatically. The AI became much better at keeping characters consistent, placing objects in the right spots, and blending images seamlessly.

In summary:
The paper says, "Hey, current AI is bad at combining multiple photos. We built a tough test to prove it (MICON-Bench), and we found a simple, free trick (DAR) that acts like a spotlight, forcing the AI to pay attention to the right things so it can finally do the job correctly."

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →