PICS: Pairwise Image Compositing with Spatial Interactions

Imagine you are a digital chef trying to create the perfect sandwich. You have a slice of bread (the background), a piece of ham (Object A), and a slice of cheese (Object B).

The Old Way (The "Step-by-Step" Chef):
Most previous AI tools tried to build this sandwich one ingredient at a time. First, they'd put the ham on the bread. Then, they'd try to add the cheese.

The Problem: When the AI added the cheese, it often forgot the ham was already there. It might accidentally erase part of the ham, make the cheese float weirdly in the air, or blend the two ingredients into a mushy blob where they touch. It's like trying to add a second layer of frosting to a cake without realizing the first layer is already there; you end up with a mess.

The New Way (PICS: The "Team Chef"):
The paper introduces PICS (Pairwise Image Compositing with Spatial Interactions). Instead of working in a line, PICS acts like a team of chefs working on the whole sandwich at the same time.

Here is how PICS works, broken down with simple analogies:

1. The "Traffic Cop" Strategy (Parallel Compositing)

Instead of putting the ham down first and then the cheese, PICS looks at the ham, the cheese, and the bread all at once. It asks: "Where do these things touch? Who is on top? Who is hidden?"

Analogy: Imagine a traffic cop at a busy intersection. Instead of letting cars go one by one (which causes pile-ups), the cop directs everyone simultaneously to ensure no two cars crash into each other. PICS ensures the ham and cheese know exactly how to sit next to each other without fighting for space.

2. The "Specialized Team" (Interaction Transformer)

Inside PICS, there is a smart brain called an Interaction Transformer. Think of this as a team of specialized workers, each with a specific job, guided by a map (the masks).

The Background Worker: Looks at the bread and says, "I'll keep the bread looking exactly like the bread."
The Ham Worker: Looks at the ham and says, "I'll make sure the ham looks like ham."
The Cheese Worker: Looks at the cheese and says, "I'll make sure the cheese looks like cheese."
The "Overlap" Worker (The Star of the Show): This is the most important part. When the ham and cheese overlap, a normal AI might just mash them together. The Overlap Worker acts like a smart referee. It looks at the scene and decides: "Okay, in this tiny spot, the cheese is slightly on top of the ham, so I'll show the cheese here. But right next to it, the ham is on top, so I'll show the ham."
- The Magic: It uses a technique called Adaptive Blending. Imagine a dimmer switch for light. The AI doesn't just choose "Ham" or "Cheese"; it smoothly fades between them based on who should be visible, creating a perfect, realistic edge where they touch.

3. The "3D Gym" (Geometry Augmentation)

Real life is 3D, but photos are 2D. Sometimes objects are tilted, rotated, or viewed from weird angles.

The Problem: If you train an AI only on flat, straight-on photos, it gets confused when you give it a photo of a cup lying on its side.
The Solution: PICS goes to a "3D Gym." During training, the AI is shown the same objects from many different angles (rotated, tilted, viewed from above). It's like a gymnast practicing on a balance beam so they don't fall over when the beam tilts. This makes PICS super robust; it knows how a chair looks even if it's upside down or sideways.

Why Does This Matter?

No More "Ghosting": Old methods often left weird shadows or double edges where objects touched. PICS makes the edges crisp and real.
No More "Erasing": If you add a second object, PICS won't accidentally delete the first one.
Real Physics: It understands that if a cup is inside a box, the box hides the bottom of the cup. If a person is leaning on a wall, the wall supports them. PICS gets these physical rules right.

The Bottom Line

PICS is like upgrading from a clumsy robot that stacks blocks one by one (often knocking them over) to a master architect who looks at the whole blueprint at once. It understands that in the real world, objects interact, overlap, and support each other. By modeling these relationships simultaneously, it creates digital images that look so real, you might forget they were made by a computer.

Here is a detailed technical summary of the paper "PICS: Pairwise Image Compositing with Spatial Interactions" (ICLR 2026).

1. Problem Statement

Current diffusion-based image compositing methods excel at single-turn edits (inserting one object into a background) but struggle significantly in multi-turn or sequential editing scenarios.

The Core Issue: When inserting multiple objects sequentially, subsequent insertions often overwrite previously generated content, leading to incoherent spatial relations, contact artifacts, and loss of physical consistency (e.g., objects floating, incorrect occlusion ordering, or distorted boundaries).
Root Cause: Existing methods typically treat object insertion as independent foreground-background tasks. They lack explicit modeling of object-object interactions (support, containment, occlusion, deformation) which are fundamental to spatial plausibility in real-world scenes.
Limitation of Sequential Approaches: Methods relying on the "Painter's Algorithm" (depth sorting) often fail because the first inserted object is treated as background in the second step, causing partial removal or distortion when the second object is added.

2. Methodology: PICS

The authors propose PICS (Pairwise Image Compositing with Spatial Interactions), a self-supervised paradigm that performs parallel pairwise compositing in a single pass. Instead of sequential insertion, PICS composites two objects into a background simultaneously while explicitly modeling their interactions.

A. Parallel Compositing Pipeline

Input Construction: The target image is decomposed into a masked background ( $x_{bg}$ ) and two object segments ( $x_a, x_b$ ) with binary masks ( $m_a, m_b$ ).
Region Decomposition: The masks are logically partitioned into:
- Exclusive Regions: Areas covered only by object A or object B.
- Overlap Region: The intersection where objects interact ( $m_{ab} = m_a \land m_b$ ).
- Background Region: The area covered by neither.
Latent Encoding: Objects and the background are encoded into latent codes ( $c_a, c_b, z_{bg}$ ) using a VAE and a Shape Encoder.

B. Interaction Transformer Block (The Core Innovation)

PICS replaces standard residual blocks in the diffusion backbone (U-Net) with Interaction Transformer Blocks that utilize a Mask-Guided Mixture-of-Experts (MoE) architecture. This allows the model to route different spatial regions to specialized experts:

Background Expert: Preserves the background identity (identity-preserving).
Exclusive-Region Experts: For non-overlapping parts of each object, these experts inject object-specific appearance via cross-attention from the scene to the individual object code.
Overlap Expert (Adaptive $\alpha$ -Blending): This is the critical component for handling occlusion and contact.
- Mechanism: It employs an attention-gated $\alpha$ -blending strategy.
- Gating Query: A query ( $q_g$ ) is derived from the background latent code to act as a "referee."
- Scoring: The model scores how well each object's aggregated code aligns with the background context.
- Adaptive Fusion: A mixing weight $\alpha$ is computed based on these scores (using a softmax with temperature $\tau$ ). This dynamically decides whether object A, object B, or a blend should dominate at each pixel in the overlap region.
- Result: This creates an order-agnostic mechanism that learns implicit occlusion semantics (who is in front) based on context rather than input order, ensuring boundary fidelity.

C. Geometry-Aware Augmentations

To improve robustness to pose variations, the training pipeline includes:

Multi-View Shape Prior: Uses a single-view reconstruction model (Zero123++) to render auxiliary views of objects, encoding them into a compact multi-view descriptor to capture 3D shape consistency.
In-Plane Rotation: Random rotations of object images and masks to handle 2D misalignment.

3. Key Contributions

Parallel Compositing Paradigm: Shifts from sequential to parallel pairwise compositing, effectively eliminating artifacts caused by step-wise overwriting and ensuring consistent spatial interactions.
Interaction Transformer Block: Introduces a mask-guided MoE architecture with a dedicated Overlap Expert that uses adaptive $\alpha$ -blending to resolve occlusion and contact boundaries dynamically.
Comprehensive Evaluation: Demonstrates superior performance across virtual try-on, indoor, and street scene settings, outperforming state-of-the-art baselines (e.g., AnyDoor, ObjectStitch, ControlCom) in both quantitative metrics and user studies.

4. Experimental Results

Datasets: Trained on a mixture of ~1M images (LVIS, Objects365, Cityscapes, etc.) and evaluated on LVIS validation and DreamBooth test sets.
Quantitative Performance:
- Achieved the best scores in PSNR, SSIM, and LPIPS on intersection regions (critical for occlusion handling).
- Outperformed baselines in FID and DreamSim (perceptual similarity) on the DreamBooth test set.
- While some baselines had slightly higher DINOv2 scores (due to edge map inputs), PICS maintained better scene consistency and object identity without relying on rigid structural constraints.
Qualitative Performance:
- Occlusion Handling: Successfully generates realistic occlusions (e.g., a human sitting on a sofa) without the "ghosting" or "fusing" artifacts seen in sequential methods.
- Boundary Fidelity: Preserves fine-grained details at contact points (e.g., seams in virtual try-on).
- Multi-Object Scalability: The method extends to 3 and 4-object compositing, maintaining stability even in complex, entangled configurations.
User Study: PICS received the highest ratings for Realism and Consistency among 20 participants, significantly outperforming competitors.

5. Significance and Impact

Solving the "Multi-Turn" Gap: PICS addresses a critical limitation in current generative editing: the inability to maintain global coherence when multiple objects are added. It provides a robust solution for complex scene synthesis.
Explicit Spatial Reasoning: By explicitly modeling pairwise interactions (support, containment, occlusion) rather than treating them as implicit side effects, the paper advances the field of compositional reasoning in vision-language models.
Practical Applications: The method is directly applicable to Virtual Try-On (handling garment overlaps), Image Editing (inserting multiple items), and Film Production (seamless integration of vintage footage or CGI elements).
Future Direction: The work suggests that future generative models must move beyond single-object conditioning to explicitly reason about the relational geometry between multiple entities to achieve true physical realism.

In summary, PICS represents a significant leap forward in image compositing by treating the insertion of multiple objects as a unified, interactive problem rather than a sequence of independent tasks, resulting in visually plausible and physically consistent composite images.