Referring Layer Decomposition

Imagine you have a photograph of a busy street scene. Right now, if you want to move a car or change the color of a tree, you have to use tools like Photoshop to carefully "cut" them out of the picture. But here's the problem: if the car is partially hidden behind a person, or if the tree is behind a fence, your cut-out is incomplete. You only get the visible parts. To fix the hidden parts, you'd have to manually paint them in, which is tedious and often looks fake.

This paper introduces a new way to think about images, called "Referring Layer Decomposition" (RLD).

Think of a digital photo not as a flat piece of paper, but as a transparent stack of clear plastic sheets (like the layers in Photoshop). In this new system, the computer doesn't just see a flat image; it sees the individual "sheets" that make up the scene, including the parts you can't see because they are hidden behind other objects.

Here is a simple breakdown of what the researchers did:

1. The Problem: The "Flat" Photo

Current AI image generators are like artists who paint on a single canvas. If they paint a dog behind a fence, they only paint the fence. If you ask them to "move the dog," they can't, because the dog isn't a separate object; it's just pixels on the canvas.

2. The Solution: The "Magic Stack"

The researchers created a system that takes a flat photo and magically separates it into transparent layers.

The Magic Trick: If you ask the system to "give me the layer for the brown horse," it doesn't just cut out the visible part of the horse. It imagines and reconstructs the parts of the horse hidden behind the fence, giving you a complete, transparent image of the whole horse.
The Control: You can tell the system what you want in three ways:
- Pointing: "That horse right there."
- Boxing: "The thing inside this square."
- Talking: "The brown and white horse."

3. The Dataset: "RefLade" (The Training Library)

To teach an AI to do this, you need millions of examples. But finding photos where the "hidden parts" are already known is impossible because, well, they are hidden!

So, the team built a massive automated factory (a data engine) to create its own training data.

The Factory: It takes real photos, uses smart AI to guess what objects are hidden, and then "paints" the missing parts to create a perfect, complete layer.
The Scale: They built a library of 1.11 million of these "image + hidden-layer + instruction" sets. They call this dataset RefLade. It's like a massive library of "before and after" magic tricks that the AI can study.

4. The Model: "RefLayer" (The Student)

They trained a new AI model called RefLayer using this massive library.

How it works: You give it a photo and a command (like "give me the layer for the red car"). The model looks at the photo, figures out where the red car is, and then generates a transparent image of the entire car, filling in the parts that were blocked by other objects.
The Result: You get a clean, high-quality "sticker" of the car that you can move, resize, or edit anywhere, and it will look realistic because the AI "invented" the hidden parts based on what it learned.

5. Why This Matters (The Analogy)

Imagine you are building a model city.

Old Way: You have a photo of the city. If you want to move a building, you have to cut it out of the photo. If a tree is in front of it, you have to guess what the building looks like behind the tree. It's messy and often looks wrong.
New Way (RLD): The city is built on a stack of transparent sheets. The building is on one sheet, the tree is on another. You can lift the building sheet up, move it, and put it down somewhere else. The "hidden" parts of the building are already there on the sheet, ready to go.

Summary

This paper solves the problem of editing complex scenes by teaching AI to see images as separate, complete objects rather than just a flat picture. They built a huge training library (RefLade) and a smart model (RefLayer) that can take a photo, listen to your instructions, and hand you back a perfect, transparent "sticker" of any object in the scene, complete with the parts that were previously hidden.

This opens the door for much more realistic and flexible photo editing, video creation, and even generating new scenes where objects can be rearranged without looking fake.

1. Problem Definition

Current generative image models excel at synthesizing realistic images but typically operate on a "flat" pixel level, lacking explicit representations of individual objects, scene structure, or occluded regions. This makes selective editing, compositional generation, and maintaining semantic consistency difficult. While existing region-based editing techniques (using masks or boxes) can modify visible pixels, they fail to:

Reconstruct occluded (hidden) parts of objects.
Provide a complete, transparent representation (RGBA) of an object for reuse in other contexts.
Support flexible, multi-modal user intent (combining text, points, boxes, or masks).

The paper introduces Referring Layer Decomposition (RLD), a novel task defined as predicting a complete, object-aware RGBA layer (including RGB content and an Alpha transparency mask) from a single RGB image, conditioned on flexible user prompts. The goal is to isolate specific objects (or backgrounds) and generate their full, unoccluded appearance, effectively separating the scene into composable, stackable units similar to Photoshop layers.

2. Methodology

The paper proposes a comprehensive framework consisting of three core components: a scalable data engine, a large-scale dataset, and a baseline model.

A. The RefLade Dataset & Data Engine

To address the lack of high-quality training data for RLD, the authors developed a scalable, automated data engine to construct RefLade, a dataset of 1.11 million image-layer-prompt triplets.

Pipeline: The engine processes raw natural images through six stages:
1. Pre-filtering: Removes low-quality or cluttered images.
2. Scene Understanding: Uses an ensemble of detectors (RT-DETR, OWL-V2, GPT-4o grounding) to identify "interesting" objects.
3. Layer Completion: Uses depth estimation and generative inpainting (Bria AI) to reconstruct occluded regions of objects.
4. Post-completion: Refines masks and predicts high-fidelity alpha mattes.
5. Prompt Generation: Generates diverse referring expressions (spatial coordinates, bounding boxes, masks, and natural language descriptions) for each layer.
6. Post-filtering: Uses commercial Vision-Language Models (Gemini-2.0) and CLIP to evaluate fidelity, realism, and semantic consistency.
Statistics: The dataset contains ~430K images with ~1.11M layers (foreground and background), covering 12K object classes with a high occlusion rate (60.8%). It includes a 100K manually curated high-fidelity subset for fine-tuning.

B. Evaluation Protocol

The authors define a Human Preference Aligned (HPA) score to benchmark RLD, addressing the lack of standard metrics. The protocol evaluates three axes:

Preservation ( $S_{vis}$ ): Measures how well the visible parts of the object are preserved using LPIPS on the visible region.
Completion ( $S_{gen}$ ): Measures the semantic consistency of the reconstructed (occluded) parts using CLIP directional similarity.
Faithfulness ( $S_{fid}$ ): Measures the distributional similarity between the generated layer and the ground truth using FID after alpha blending.

Aggregation: These metrics are normalized and averaged to create the HPA score, which shows a strong correlation (Pearson 0.96) with human Elo rankings.

C. RefLayer (Baseline Model)

RefLayer is a diffusion-based baseline designed for prompt-conditioned layer decomposition.

Architecture: Built upon Stable Diffusion 3. It uses a VAE encoder to encode the input image and the prompt.
Prompt Encoding: Spatial prompts (points, boxes, masks) are unified into a color-coded RGB image (e.g., green for boxes, red for masks) and encoded into the latent space alongside the image.
Dual Decoders:
- A standard RGB Decoder reconstructs the visual content.
- A custom Alpha Decoder (a lightweight convolutional head) predicts the transparency mask directly from the latent space, isolating the alpha prediction task.
Training Strategy: The model is trained in a two-stage manner: pre-training on the large-scale (1M) noisy RefLade data, followed by fine-tuning on the high-quality (100K) RefLadeQ subset.

3. Key Contributions

Task Formulation: Formalized Referring Layer Decomposition (RLD), the first task to unify object segmentation, amodal completion, and RGBA generation under flexible multi-modal prompts.
RefLade Dataset: Created the first large-scale benchmark (1.11M triplets) with a robust automated data engine, overcoming the reliance on synthetic data or small-scale human annotation.
Evaluation Protocol: Established the HPA score, a unified metric that aligns closely with human judgment across preservation, completion, and faithfulness.
Baseline Model: Introduced RefLayer, demonstrating that a simple diffusion-based architecture can achieve high visual fidelity and semantic alignment, with strong zero-shot generalization capabilities.

4. Experimental Results

Dataset Quality: Models trained on RefLade significantly outperform those trained on previous datasets (like MuLAn), even with fewer training samples, proving the superior quality of RefLade's data.
Scaling Laws: Foreground decomposition performance scales positively with data quantity (up to 1M samples), while background decomposition benefits more from high-quality fine-tuning.
Prompt Sensitivity: Spatial prompts (masks, boxes) yield higher accuracy than text-only prompts. However, combining text with masks improves the model's ability to handle occlusions (completion) by providing semantic context.
Zero-Shot Generalization: RefLayer achieves state-of-the-art results on the COCOA amodal segmentation benchmark without ever being trained on it, demonstrating its ability to infer occluded object structures.
Human Evaluation: The best model (RefLayer trained on RefLade+Q) achieves a Passrate@10 of 79% for foreground layers and 74% for background layers.

5. Significance and Impact

Bridging the Gap: RLD moves image editing from "pixel manipulation" to "object-centric composition," enabling workflows similar to professional graphic design tools but driven by AI.
Enabling New Applications: The technology supports full-scene decomposition agents, scalable RGBA image generation for training other models, and seamless integration with tasks like object removal, insertion, and repositioning.
Foundation for Future Research: By providing a standardized dataset, evaluation protocol, and baseline, the paper establishes a clear research trajectory for compositional understanding and controllable generative AI.
Limitations & Future Work: The current pipeline ignores complex physical effects (shadows, reflections) and operates at the instance level rather than part-level. Future work aims to refine these aspects and explore more advanced architectures.

In summary, this paper provides a foundational step toward composable AI, transforming static images into dynamic, editable, and semantically structured 3D-like layers through the power of large-scale data and diffusion models.