TRACE: Your Diffusion Model is Secretly an Instance Edge Detector

Imagine you have a magic paintbrush that can turn a blank canvas into a beautiful, detailed painting just by hearing a description like "two cats playing." This is what modern Diffusion Models (the AI behind tools like Midjourney or DALL-E) do. They start with a screen full of static noise (like TV snow) and slowly, step-by-step, refine it until a clear image appears.

For a long time, computer scientists thought these models were just "generators"—they create pictures, but they don't really "see" or understand the individual objects inside them in a way that helps us separate them.

Enter TRACE.

The researchers behind this paper discovered a secret: While the AI is painting the picture, it actually knows exactly where one object ends and another begins, even before the picture is finished. They call this framework TRACE (TRAnsforming diffusion Cues to instance Edges).

Here is the simple breakdown of how it works, using some everyday analogies:

1. The "Magic Moment" (Instance Emergence Point)

Imagine you are watching a time-lapse video of a sculptor carving a statue out of a block of stone.

At the start: It's just a rough block. You can't tell if it's a horse or a dog.
At the end: It's a perfect statue, but the "magic" of the carving process is over.
The Secret: There is a specific moment in the middle where the sculptor makes the first clear cut that separates the horse's leg from the body. Before that, it was just a blob; after that, the shape is locked in.

The TRACE team found that diffusion models have this exact "magic moment." They call it the Instance Emergence Point (IEP). At this specific split-second during the AI's "denoising" process, the internal maps the AI uses to think about the image suddenly shift from "this is a blurry blob" to "this is a distinct object."

2. The "X-Ray Vision" (Attention Boundary Divergence)

Once they found that magic moment, they needed to see the edges.
Think of the AI's "self-attention" as a group of tiny workers inside the computer.

Inside an object: All the workers are talking to each other, agreeing on the details. They are a tight-knit team.
Across an edge: The workers on the left side of a cat stop talking to the workers on the right side of a dog. The conversation stops abruptly.

The paper uses a method called ABDiv (Attention Boundary Divergence) to measure this "silence." Where the conversation between the workers stops abruptly, that's the edge of the object. It's like using a metal detector to find the exact boundary between two different types of soil.

3. The "Speedy Translator" (One-Step Distillation)

Here's the problem: Finding that "magic moment" and measuring the "silence" for every single photo takes a long time (like watching the whole time-lapse video for every image). It's too slow for real-world use.

So, the researchers built a Speedy Translator.

They used the slow, careful method to teach a tiny, super-fast AI student.
The student learned to look at a photo and instantly say, "Ah, I see the edges!" without needing to watch the whole time-lapse video.
This makes the process 81 times faster. It's like going from reading a whole book to understand a story, to just glancing at the title and knowing the plot.

Why is this a Big Deal?

The Old Way: To teach a computer to separate two cats sitting next to each other, humans had to draw outlines around every single cat in thousands of photos. This is expensive, boring, and slow.
The TRACE Way: We don't need humans to draw anything. The AI already knows how to separate the cats because it learned it while learning how to draw them.

The Results:

No Labels Needed: It works without any human drawing boxes or points.
Better Separation: It stops the AI from merging two different cats into one giant "cat-monster."
Faster: It's incredibly quick.
Versatile: It works on everything from standard photos to complex scenes, helping other AI tools (like the famous "Segment Anything" model) do their jobs much better.

In a Nutshell

The paper reveals that AI image generators are secretly expert edge-detectives. They just needed someone to ask the right question at the right time during the painting process. TRACE is the tool that asks that question, listens to the answer, and gives us a perfect map of where every object begins and ends—all without a single human drawing a line.

1. Problem Statement

Instance and panoptic segmentation traditionally rely on dense, pixel-level annotations (masks, boxes, or points), which are expensive, inconsistent, and difficult to scale.

Unsupervised approaches often fail to separate adjacent objects of the same class or fragment single instances because they rely on semantic feature clustering (e.g., from Vision Transformers like DINO) rather than explicit instance boundaries.
Weakly-supervised approaches (using image tags or points) reduce annotation costs but struggle to disambiguate overlapping instances or separate adjacent objects without explicit boundary information.
Existing Edge Detectors (e.g., Canny, HED) focus on low-level texture and color gradients, which do not align with semantic instance boundaries, leading to high false positives on internal textures.

The core challenge is to achieve high-quality instance separation without per-instance annotations, leveraging existing pre-trained models without expensive retraining.

2. Methodology: TRACE

The authors propose TRACE (TRAnsforming diffusion Cues to instance Edges), a framework that decodes instance boundaries directly from pre-trained text-to-image diffusion models. The method operates on the insight that diffusion models encode instance-level structural cues early in their denoising process, specifically within self-attention maps.

The framework consists of four key stages:

A. Identifying the Instance Emergence Point (IEP)

Diffusion models transition from noise to semantic content. The authors observe that instance-level structure emerges briefly before the full semantic content stabilizes.

They measure the Kullback-Leibler (KL) divergence between self-attention maps of consecutive denoising timesteps.
The Instance Emergence Point ( $t^*$ ) is defined as the timestep where this divergence peaks. At this specific moment, the self-attention map ( $SA_{inst}$ ) reveals the most distinct instance boundaries before semantic refinement blurs them.

B. Attention Boundary Divergence (ABDiv)

Once the optimal self-attention map is identified, TRACE extracts edges using a non-parametric scoring mechanism.

Concept: Pixels within the same object exhibit similar self-attention distributions, while pixels across different objects diverge sharply.
Mechanism: ABDiv calculates the KL divergence between opposite 4-neighbors (left/right and top/bottom) for every pixel. High divergence scores indicate instance boundaries.
Reliability Filtering: To handle noise, pixels with intermediate scores (between $\mu - \sigma$ and $\mu + \sigma$ ) are marked as "uncertain" and excluded from training supervision, ensuring the model learns only from high-confidence boundaries.

C. One-Step Self-Distillation

Running the full IEP search and ABDiv calculation for every image at inference is computationally expensive.

Solution: TRACE fine-tunes the diffusion backbone (using LoRA) and trains a lightweight edge decoder ( $G_\phi$ ) via self-distillation.
Training Objective: The model minimizes a reconstruction loss (image fidelity) and a Dice loss against the pseudo-edge maps generated by ABDiv.
Result: The decoder learns to predict connected, precise instance edges in a single forward pass at $t=0$ , eliminating the need for the iterative IEP search during inference. This achieves an 81× speedup compared to the raw diffusion process.

D. Boundary-Guided Propagation (BGP)

The extracted edges are used as priors for downstream segmentation.

Refinement: Existing unsupervised or weakly-supervised masks are refined by treating TRACE edges as hard separators.
Propagation: A random-walk propagation algorithm fills gaps within instance boundaries and merges overlapping masks, effectively separating adjacent objects and repairing fragmented instances.

3. Key Contributions

Discovery of Hidden Priors: The paper demonstrates that text-to-image diffusion models inherently encode instance-level boundary information in their self-attention maps during the early denoising stages, a property not found in standard discriminative Vision Transformers (ViTs).
Novel Framework (TRACE): Introduces a unified pipeline combining IEP (temporal selection of the optimal step) and ABDiv (spatial extraction of boundaries) to generate annotation-free instance edges.
Efficiency via Distillation: Develops a one-step self-distillation strategy that compresses the complex diffusion process into a fast, real-time edge predictor, removing the need for per-image inversion.
Model Agnosticism: The method works across various diffusion architectures (SD1.5, SDXL, SD3.5, FLUX.1) and improves both unsupervised and weakly-supervised segmentation without requiring instance-level labels.

4. Experimental Results

The authors evaluated TRACE on COCO, VOC, and other benchmarks, comparing it against state-of-the-art unsupervised and weakly-supervised baselines.

Unsupervised Instance Segmentation (UIS):
- TRACE improves the baseline (MaskCut/ProMerge) by +5.1 AP on COCO.
- It significantly outperforms depth-based methods (e.g., CutS3D) by +2.2 AP, proving that attention-based cues are superior to depth priors for instance separation.
- It achieves 81× faster inference than running the full diffusion inversion.
Weakly-Supervised Panoptic Segmentation (WPS):
- Using only image-level tags (no points or boxes), TRACE-enhanced models outperform point-supervised baselines by +1.7 PQ on COCO and +7.1 PQ on VOC 2012.
- It successfully separates adjacent objects that tag-supervised models typically merge.
Edge Quality:
- On a custom instance-edge benchmark derived from COCO, TRACE achieves an ODS (Optimal Dataset Scale) of 0.889, more than doubling the performance of the strongest conventional edge detector (DiffusionEdge: 0.428).
- It demonstrates superior topological connectivity (clDice: 0.826), crucial for separating adjacent instances.
Comparison with Non-Diffusion Models:
- TRACE applied to diffusion backbones significantly outperforms massive non-diffusion models (e.g., 72B parameter Qwen2.5-VL or 13B LLaVA), confirming that the instance cues are specific to the generative nature of diffusion models.

5. Significance and Impact

Annotation-Free Scalability: TRACE offers a practical path to high-quality instance segmentation without the prohibitive cost of manual mask/point annotation, addressing a major bottleneck in computer vision.
Rethinking Diffusion Models: The work shifts the perspective of diffusion models from purely generative tools to rich sources of structural priors for discriminative tasks.
Complement to SAM: The generated edges serve as high-quality "seeds" for Segment Anything Model (SAM), enabling SAM to separate adjacent objects that it would otherwise merge, outperforming open-vocabulary detectors.
Limitations: The method struggles with extremely tiny objects (e.g., in satellite imagery) due to VAE spatial compression and out-of-domain medical images where natural-image priors do not align with histopathology textures.

In conclusion, TRACE reveals that diffusion models are "secretly" instance edge detectors. By decoding these hidden signals, the authors provide a scalable, efficient, and highly accurate alternative to traditional supervised segmentation pipelines.