Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

Imagine you are looking at a busy street scene through a window. You see a red car, a person walking a dog, a traffic light, and a coffee shop.

The Problem: The "Muddy Bucket" Approach
Current AI models trying to understand this scene often use a method called "Slot Attention." Think of this like having a team of workers (called "slots") who are supposed to pick up specific items from the street and put them in their own buckets.

However, in the old way, the workers were messy:

The Muddy Bucket (Entanglement): One worker might try to carry the red car, but they also accidentally grab the dog and a piece of the coffee shop sign. Their bucket is a muddy mix of everything. If you ask the AI to "show me just the car," it can't, because the car is mixed with the dog.
The Confused Worker (Weak Alignment): Sometimes, a worker picks up the entire street instead of just the car. Other times, they split the car into three different buckets. They don't know exactly where one object ends and another begins.

This makes it hard for the AI to do cool things like "remove the car but keep the dog" or "swap the red car for a blue truck."

The Solution: CODA (The Organized Warehouse)
The paper introduces a new system called CODA (Contrastive Object-centric Diffusion Alignment). It fixes the mess using two clever tricks:

Trick 1: The "Trash Can" Workers (Register Slots)

Imagine you have a team of workers, but you also hire a few extra workers whose only job is to be Trash Cans.

When the main workers are trying to pick up the "Red Car," they might get distracted by the background noise (the sky, the pavement, or the fact that the car is next to a tree).
Instead of forcing the main workers to hold onto this confusing background noise, they can just toss it into the Trash Can workers.
The Result: The main workers now hold only the clean, pure concept of the "Red Car." The Trash Can workers absorb all the leftover junk. This keeps the main buckets perfectly organized and separate.

Trick 2: The "Strict Manager" (Contrastive Alignment)

In the old system, the workers were just told, "Try to rebuild the street scene." They didn't get punished for being lazy or confused.

CODA adds a Strict Manager who uses a game of "Spot the Difference":

The manager shows a worker a bucket labeled "Red Car" and asks, "Does this look like the red car in the photo?"
Then, the manager shows them a bucket labeled "Red Car" but filled with "Dog" or "Coffee Shop" (a mismatch).
The manager says, "If you pick the wrong one, you get a penalty!"
The Result: The workers learn to be extremely precise. They realize, "Oh, I must only grab the red car, or I get in trouble." This forces them to align perfectly with the specific objects in the image.

Why This Matters (The Superpower)

Because the workers are now organized and precise, the AI gains a superpower: Compositional Editing.

Before: If you asked the AI to "remove the car," it might remove the car and the dog, or leave a weird hole in the sky.
With CODA: You can say, "Remove the car," and the AI knows exactly which bucket holds the car. It takes that bucket away, and the rest of the scene (the dog, the traffic light) stays perfectly intact. You can even swap the "Red Car" bucket with a "Blue Truck" bucket, and the AI generates a brand new, realistic image with a blue truck in that exact spot.

The Bottom Line

The authors built a system that teaches AI to look at a messy picture, sort every single object into its own clean, distinct box, and throw away the background noise. This allows the AI to not just see the world, but to understand it well enough to rearrange it, edit it, and imagine new scenes with perfect logic.

It's like going from a child dumping a whole box of LEGOs onto the floor, to a master builder who has sorted every brick by color and shape, ready to build anything they can imagine.

1. Problem Statement

Object-Centric Learning (OCL) aims to decompose complex scenes into structured, interpretable object representations (slots) to enable tasks like visual reasoning, causal inference, and compositional generation. While Slot Attention (SA) combined with pretrained Diffusion Models (e.g., Stable Diffusion) has shown promise, existing methods suffer from two critical limitations:

Slot Entanglement: Slots often encode features from multiple objects or fragments of objects rather than distinct entities. This leads to "unfaithful" single-slot generations (where generating an image from one slot produces a distorted mix of concepts) and hinders compositional editing.
Weak Alignment: Slots fail to consistently correspond to distinct image regions. This manifests as over-segmentation (splitting one object into multiple slots), under-segmentation (merging multiple objects), or inaccurate boundaries.
Text-Conditioning Bias: Pretrained diffusion models are heavily biased toward text prompts. When used as decoders for visual slots without modification, they prioritize language-driven semantics over the visual content encoded in the slots, degrading generation fidelity.

2. Methodology: CODA

The authors propose Contrastive Object-centric Diffusion Alignment (CODA), a framework that extends Slot Attention with a pretrained diffusion decoder. CODA introduces three core components to address the above challenges:

A. Register Slots (Mitigating Entanglement)

Concept: The model introduces register slots, which are input-independent, semantically empty vectors.
Mechanism: These slots act as "attention sinks." In cross-attention mechanisms, the softmax constraint forces attention weights to sum to one. When a query from the diffusion U-Net does not strongly match any semantic object slot, the attention mass would otherwise spread arbitrarily across semantic slots, causing entanglement. Register slots absorb this residual attention.
Implementation: Register slots are generated by encoding padding tokens through a frozen text encoder (e.g., CLIP). They are concatenated with semantic slots and fed into the diffusion model's cross-attention layers.
Benefit: This isolates semantic slots, forcing them to focus strictly on meaningful object-concept associations, thereby reducing interference and improving disentanglement.

B. Finetuning Cross-Attention Projections (Mitigating Bias)

Problem: Pretrained diffusion models expect text embeddings. Using them directly for slot conditioning introduces a bias where the model ignores slot features in favor of its internal text priors.
Solution: Instead of training a diffusion model from scratch or adding heavy adapter layers, CODA finetunes only the key ( $K$ ), value ( $V$ ), and output projections in the cross-attention layers of the pretrained Stable Diffusion model.
Benefit: This lightweight adaptation aligns the slot representations with the visual content of the diffusion model, eliminating text-conditioning bias while preserving the generative power of the pretrained backbone.

C. Contrastive Alignment Objective (Strengthening Representation)

Goal: To explicitly encourage slots to capture concepts present in the image and discourage overlap between slots.
Mechanism: The training objective combines the standard diffusion denoising loss ( $L_{dm}$ $L_{d m}$ ) with a contrastive loss ( $L_{cl}$ ).
- $L_{dm}$ : Minimizes the error between predicted and true noise given aligned slots.
- $L_{cl}$ : Maximizes the error (minimizes likelihood) when the model attempts to reconstruct the image using negative slots (mismatched slots sampled from other images or mixed combinations).
Negative Slot Construction: To create "hard negatives," the method takes slots from image $A$ and randomly replaces a subset (e.g., 50%) with slots from image $B$ . This forces the model to refine its representations to distinguish between correct and incorrect slot-image pairings.
Theoretical Connection: The authors prove that minimizing this combined objective serves as a tractable surrogate for maximizing the Mutual Information (MI) between the input image and the slot representations, thereby improving the quality and informativeness of the learned slots.

3. Key Contributions

Register-Augmented Slot Diffusion: The introduction of input-independent register slots to absorb residual attention, effectively solving slot entanglement without architectural complexity.
Lightweight Adaptation: A strategy to mitigate text-conditioning bias by finetuning only cross-attention projections, avoiding the need for full model retraining or complex adapter layers.
Contrastive Alignment: A novel training objective that explicitly aligns slots with image content via contrastive learning, theoretically linked to maximizing Mutual Information.
Comprehensive Evaluation: Demonstrated state-of-the-art performance across synthetic (MOVi-C/E) and real-world (VOC, COCO) datasets in object discovery, property prediction, and compositional generation.

4. Experimental Results

CODA was evaluated against strong baselines including Stable-LSD, SlotAdapt, SlotDiffusion, and SLATE.

Object Discovery (Segmentation):
- COCO: Improved Foreground Adjusted Rand Index (FG-ARI) by +6.14% over the best baseline (SlotAdapt).
- VOC: Improved instance-level mBO by +3.88% and mIoU by +3.97%; semantic-level mBO by +5.72% and mIoU by +7.00%.
- Synthetic (MOVi-E): Improved FG-ARI by +2.59% and mIoU by +3.36%.
Property Prediction:
- On MOVi datasets, CODA significantly outperformed baselines in predicting object categories (e.g., 74.12% accuracy on MOVi-C vs. ~46% for baselines) and spatial positions.
Compositional Generation:
- Single-Slot Generation: Unlike baselines which fail to generate coherent images from individual slots (due to entanglement), CODA produces faithful, single-concept images.
- Compositional Editing: CODA achieved the best Fréchet Inception Distance (FID) and Kernel Inception Distance (KID) scores for compositional generation on COCO (FID: 31.03 vs. 40.57 for SlotAdapt), demonstrating superior ability to recombine slots into novel scenes.
Efficiency: The addition of register slots and the lightweight finetuning strategy result in negligible computational overhead, keeping the model scalable.

5. Significance

This paper represents a significant advancement in unsupervised Object-Centric Learning.

Robustness: It bridges the gap between synthetic and real-world OCL, showing that object-centric representations can be learned effectively on complex, cluttered datasets like COCO without manual annotations.
Compositional Control: By solving slot entanglement, CODA enables faithful compositional generation, a prerequisite for advanced applications like robotic manipulation, video editing, and world modeling where specific objects must be manipulated independently.
Simplicity and Scalability: The approach avoids complex architectural changes or heavy supervision, relying instead on a simple contrastive objective and register slots. This makes it a practical and efficient framework for integrating OCL with powerful pretrained diffusion models.

In summary, CODA provides a robust, scalable, and effective framework for learning disentangled, object-centric representations, paving the way for more controllable and interpretable generative AI systems.

Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

Trick 1: The "Trash Can" Workers (Register Slots)

Trick 2: The "Strict Manager" (Contrastive Alignment)

Why This Matters (The Superpower)

The Bottom Line

1. Problem Statement

2. Methodology: CODA

A. Register Slots (Mitigating Entanglement)

B. Finetuning Cross-Attention Projections (Mitigating Bias)

C. Contrastive Alignment Objective (Strengthening Representation)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks