Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

The paper introduces CODA, a method that enhances object-centric learning by integrating register slots to mitigate slot entanglement and applying a contrastive alignment loss to strengthen slot-image correspondence, resulting in superior object discovery and generation performance on both synthetic and real-world datasets.

Bac Nguyen, Yuhta Takida, Naoki Murata, Chieh-Hsin Lai, Toshimitsu Uesaka, Stefano Ermon, Yuki Mitsufuji

Published 2026-02-20
📖 4 min read☕ Coffee break read

Imagine you are looking at a busy street scene through a window. You see a red car, a person walking a dog, a traffic light, and a coffee shop.

The Problem: The "Muddy Bucket" Approach
Current AI models trying to understand this scene often use a method called "Slot Attention." Think of this like having a team of workers (called "slots") who are supposed to pick up specific items from the street and put them in their own buckets.

However, in the old way, the workers were messy:

  1. The Muddy Bucket (Entanglement): One worker might try to carry the red car, but they also accidentally grab the dog and a piece of the coffee shop sign. Their bucket is a muddy mix of everything. If you ask the AI to "show me just the car," it can't, because the car is mixed with the dog.
  2. The Confused Worker (Weak Alignment): Sometimes, a worker picks up the entire street instead of just the car. Other times, they split the car into three different buckets. They don't know exactly where one object ends and another begins.

This makes it hard for the AI to do cool things like "remove the car but keep the dog" or "swap the red car for a blue truck."

The Solution: CODA (The Organized Warehouse)
The paper introduces a new system called CODA (Contrastive Object-centric Diffusion Alignment). It fixes the mess using two clever tricks:

Trick 1: The "Trash Can" Workers (Register Slots)

Imagine you have a team of workers, but you also hire a few extra workers whose only job is to be Trash Cans.

  • When the main workers are trying to pick up the "Red Car," they might get distracted by the background noise (the sky, the pavement, or the fact that the car is next to a tree).
  • Instead of forcing the main workers to hold onto this confusing background noise, they can just toss it into the Trash Can workers.
  • The Result: The main workers now hold only the clean, pure concept of the "Red Car." The Trash Can workers absorb all the leftover junk. This keeps the main buckets perfectly organized and separate.

Trick 2: The "Strict Manager" (Contrastive Alignment)

In the old system, the workers were just told, "Try to rebuild the street scene." They didn't get punished for being lazy or confused.

CODA adds a Strict Manager who uses a game of "Spot the Difference":

  • The manager shows a worker a bucket labeled "Red Car" and asks, "Does this look like the red car in the photo?"
  • Then, the manager shows them a bucket labeled "Red Car" but filled with "Dog" or "Coffee Shop" (a mismatch).
  • The manager says, "If you pick the wrong one, you get a penalty!"
  • The Result: The workers learn to be extremely precise. They realize, "Oh, I must only grab the red car, or I get in trouble." This forces them to align perfectly with the specific objects in the image.

Why This Matters (The Superpower)

Because the workers are now organized and precise, the AI gains a superpower: Compositional Editing.

  • Before: If you asked the AI to "remove the car," it might remove the car and the dog, or leave a weird hole in the sky.
  • With CODA: You can say, "Remove the car," and the AI knows exactly which bucket holds the car. It takes that bucket away, and the rest of the scene (the dog, the traffic light) stays perfectly intact. You can even swap the "Red Car" bucket with a "Blue Truck" bucket, and the AI generates a brand new, realistic image with a blue truck in that exact spot.

The Bottom Line

The authors built a system that teaches AI to look at a messy picture, sort every single object into its own clean, distinct box, and throw away the background noise. This allows the AI to not just see the world, but to understand it well enough to rearrange it, edit it, and imagine new scenes with perfect logic.

It's like going from a child dumping a whole box of LEGOs onto the floor, to a master builder who has sorted every brick by color and shape, ready to build anything they can imagine.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →