CoreEditor: Correspondence-constrained Diffusion for Consistent 3D Editing

Imagine you have a magical, 3D photo album. You can walk around the objects in it, looking at them from every angle. Now, imagine you want to change one of those objects using just a sentence, like "turn the stone horse into a zebra."

This is the dream of 3D Editing. But here's the problem: current tools are like a team of artists who can't talk to each other. If you ask them to paint a zebra, Artist A (looking from the left) paints a zebra with stripes. Artist B (looking from the right) paints a zebra with spots. Artist C (looking from behind) paints a horse. When you stitch these views together, the result is a blurry, glitchy mess that looks nothing like a real 3D object.

The paper introduces CoreEditor, a new system that acts like a super-organized project manager to fix this chaos. Here is how it works, broken down into simple concepts:

1. The Problem: The "Silent Artists"

Current methods try to edit 3D scenes by editing 2D pictures from different angles. But because the computer doesn't know which pixel in the "left view" corresponds to which pixel in the "right view," it gets confused.

The Analogy: Imagine trying to build a 3D puzzle where the pieces from different boxes are mixed up. You try to force a piece from the "sky" box into the "ground" slot. The result is a blurry, nonsensical image.

2. The Solution: The "Correspondence-Constrained Attention" (The Magic Glue)

CoreEditor's secret sauce is a mechanism called Correspondence-Constrained Attention (CCA).

The Analogy: Think of the 3D scene as a group of people holding hands in a circle. In the old methods, everyone was shouting their own ideas, and no one was listening.
How CoreEditor fixes it: It forces the "people" (pixels) who are actually the same object to hold hands and whisper to each other. If the pixel representing the "left eye" of a statue in View A is talking to the "left eye" in View B, they must agree on what color to be. They are constrained to stay consistent.

3. The Twist: Geometry isn't Enough (The "Ghost" Problem)

Sometimes, you can't see the "hand-holding" because something is blocking the view (occlusion).

The Scenario: Imagine a statue of a bear. From the front, you see both eyes. From the side, the nose blocks the left eye. The computer can't find the "left eye" pixel in the side view because it's hidden.
The Old Way: The computer gives up and leaves that spot blank or blurry.
CoreEditor's Way: It uses Semantic Similarity. Even if the eye is hidden, the computer knows, "Hey, the right eye in this view looks very similar to the left eye in the front view." It uses semantic clues (meaning) to fill in the gaps, not just geometric clues (position).
The Metaphor: It's like a detective who can't see a suspect because they are behind a wall. Instead of giving up, the detective looks at the suspect's shadow or their voice to figure out where they are. CoreEditor uses "semantic shadows" to find the hidden parts.

4. The "Selective Editing" Pipeline (The Editor's Choice)

Sometimes, the computer tries to edit a scene and comes up with five different versions of a "zebra." One is cute, one is scary, one is cartoonish. If you just mash them all together, you get a weird hybrid.

The Innovation: CoreEditor lets you (the user) pick the version you like best first.
The Analogy: Imagine a chef making five different soups. Instead of blending them all into one gross pot, you taste them, pick the "Spicy Tomato" one, and say, "Make all the bowls taste like this." CoreEditor takes your favorite version and uses it as a Reference to guide the other views, ensuring they all follow the same style.

5. The Result: A Crystal Clear 3D World

By combining these three things:

Forcing pixels to talk to their counterparts (CCA).
Using "meaning" to find hidden parts (Semantic Support).
Letting the user pick the best style first (Selective Pipeline).

...CoreEditor creates 3D edits that are sharp, consistent, and actually look like the object you asked for. No more blurry textures or glitchy 3D models.

In Summary

If current 3D editing is like a chaotic choir where everyone sings a different song, CoreEditor is the conductor who:

Makes sure the singers are looking at the same sheet music (Consistency).
Helps them hear each other even when they are behind a wall (Semantic Support).
Lets the audience pick the song they want to hear first, then teaches everyone to sing that specific song (Selective Pipeline).

The result? A harmonious, high-quality 3D masterpiece.

1. Problem Statement

Text-driven 3D editing aims to modify 3D scenes based on natural language prompts. While recent methods leverage pre-trained 2D diffusion models to edit multi-view images, they face significant challenges in maintaining 3D consistency:

Inconsistency: Stochastic diffusion processes often produce conflicting edits across different viewpoints, leading to flickering artifacts and blurry textures when the 3D scene is reconstructed.
Lack of Precise Control: Existing strategies (e.g., depth-conditioned ControlNet, cross-frame attention) lack precise constraints on the direction of information exchange between views. This results in "averaging" effects where distinct edits are blended unnaturally, or insufficient visual changes where the edit fails to propagate.
Geometric Limitations: In complex scenes (e.g., 360° views) or occluded areas, purely geometric correspondence (based on depth) is often sparse or unavailable, causing attention mechanisms to become unstable.

2. Methodology: CoreEditor

CoreEditor is a zero-shot framework that integrates precise multi-view constraints into a pre-trained Text-to-Image (T2I) diffusion model without fine-tuning. It operates on a 3D Gaussian Splatting (GS) representation. The pipeline consists of three main components:

A. Selective Editing Pipeline

To address the issue of conflicting per-view edits, CoreEditor introduces a user-centric selection mechanism:

Per-view Generation: The system first generates independent edits for all $N$ views using standard DDIM inversion.
Reference Selection: The user (or an automated preference predictor) selects one preferred edit ( $I_r$ ) from the candidates.
Reference Attention (RA): The features of the selected reference edit are injected into the diffusion U-Net. This aligns the global editing style across all views before local consistency is enforced, effectively reducing the solution space for consistent results.

B. Geometric and Semantic Co-supported Correspondence

To handle cases where geometric correspondence is insufficient (e.g., occlusions or large viewpoint changes), the authors propose a hybrid correspondence strategy:

Geometric Correspondence: Derived from depth maps, mapping pixels between views via back-projection and re-projection.
Semantic Correspondence: Derived from the diffusion features themselves. For pixels lacking valid geometric matches, the system calculates cosine similarity between diffusion feature maps to find semantically similar patches.
Fusion: These two sources are combined to create a comprehensive correspondence set, ensuring that even occluded or background regions have valid tokens for attention.

C. Correspondence-constrained Attention (CCA)

This is the core innovation. A CCA module is inserted after the self-attention (or RA) layers in the diffusion U-Net.

Mechanism: Instead of allowing a query token to attend to all tokens in the source view (standard self-attention), CCA restricts the Key and Value sets to only the corresponding patches identified in the co-supported correspondence set.
Effect: This enforces structured interactions between corresponding pixels across views, ensuring that the diffusion process generates visually consistent details locally while maintaining the global style defined by the RA module.

3. Key Contributions

Correspondence-constrained Attention (CCA): A novel attention mechanism that enforces strict interaction between corresponding image patches across views, significantly improving multi-view consistency without fine-tuning the diffusion model.
Geometric and Semantic Co-supported Strategy: A method to extract robust correspondences by combining depth-based geometry with diffusion-based semantic similarity, solving instability issues in occluded or sparse regions.
Selective Editing Pipeline: A flexible framework allowing users to select a preferred editing pattern from multiple candidates, which is then propagated via Reference Attention to ensure global style alignment.
Zero-shot Integration: The method works with existing pre-trained models (Stable Diffusion + ControlNet) and 3D representations (Gaussian Splatting) without requiring additional training or heavy computational overhead.

4. Experimental Results

The authors evaluated CoreEditor on seven diverse scenes (e.g., "bear," "face," "bicycle") with 20 challenging prompts covering local editing, global stylization, and character modification.

Qualitative Performance: CoreEditor produces sharper textures and significantly fewer artifacts (blurry regions, flickering) compared to state-of-the-art baselines like GaussCtrl, DGE, EditSplat, and GaussianEditor. It successfully handles complex tasks like transforming a "bear statue" into a "panda" or a "stone horse" into a "zebra" with high fidelity.
Quantitative Metrics:
- CLIP Metrics: Achieved the highest CLIP similarity and directional similarity scores, indicating better alignment with text prompts.
- Met3R: Achieved the lowest Met3R score (0.281 vs. >0.33 for others), demonstrating superior 3D consistency.
- User Study: CoreEditor won 45.2% of votes for quality and 42.0% for consistency, significantly outperforming the next best method (~19%).
Efficiency: The method completes editing in approximately 8 minutes on a single GPU (18GB VRAM), offering a favorable trade-off between speed and quality compared to slower iterative methods (e.g., GaussianEditor at 25 mins).

5. Significance

CoreEditor represents a significant advancement in 3D content creation by solving the fundamental "consistency vs. quality" trade-off in text-driven 3D editing.

Robustness: By leveraging semantic features to supplement geometric data, it overcomes limitations of depth-based methods in complex 360° scenes.
User Control: The selective pipeline empowers users to guide the editing style, making the process more intuitive and less prone to the "averaging" failures of previous joint-editing methods.
Generalizability: The framework is model-agnostic regarding the 2D editor (demonstrated with both Stable Diffusion and InstructPix2Pix) and the 3D representation, suggesting broad applicability in the field of neural rendering and generative AI.

In summary, CoreEditor establishes a new state-of-the-art for consistent 3D editing by introducing a mechanism that strictly enforces cross-view correspondence within the diffusion process, resulting in high-quality, flicker-free 3D edits.