CoreEditor: Correspondence-constrained Diffusion for Consistent 3D Editing

CoreEditor is a novel framework for consistent text-driven 3D editing that introduces a correspondence-constrained attention mechanism leveraging both geometric and semantic similarities to enforce cross-view consistency and produce high-quality, sharp edits while offering flexible user control through a selective editing pipeline.

Zhe Zhu, Honghua Chen, Peng Li, Mingqiang Wei

Published 2026-02-20
📖 4 min read☕ Coffee break read

Imagine you have a magical, 3D photo album. You can walk around the objects in it, looking at them from every angle. Now, imagine you want to change one of those objects using just a sentence, like "turn the stone horse into a zebra."

This is the dream of 3D Editing. But here's the problem: current tools are like a team of artists who can't talk to each other. If you ask them to paint a zebra, Artist A (looking from the left) paints a zebra with stripes. Artist B (looking from the right) paints a zebra with spots. Artist C (looking from behind) paints a horse. When you stitch these views together, the result is a blurry, glitchy mess that looks nothing like a real 3D object.

The paper introduces CoreEditor, a new system that acts like a super-organized project manager to fix this chaos. Here is how it works, broken down into simple concepts:

1. The Problem: The "Silent Artists"

Current methods try to edit 3D scenes by editing 2D pictures from different angles. But because the computer doesn't know which pixel in the "left view" corresponds to which pixel in the "right view," it gets confused.

  • The Analogy: Imagine trying to build a 3D puzzle where the pieces from different boxes are mixed up. You try to force a piece from the "sky" box into the "ground" slot. The result is a blurry, nonsensical image.

2. The Solution: The "Correspondence-Constrained Attention" (The Magic Glue)

CoreEditor's secret sauce is a mechanism called Correspondence-Constrained Attention (CCA).

  • The Analogy: Think of the 3D scene as a group of people holding hands in a circle. In the old methods, everyone was shouting their own ideas, and no one was listening.
  • How CoreEditor fixes it: It forces the "people" (pixels) who are actually the same object to hold hands and whisper to each other. If the pixel representing the "left eye" of a statue in View A is talking to the "left eye" in View B, they must agree on what color to be. They are constrained to stay consistent.

3. The Twist: Geometry isn't Enough (The "Ghost" Problem)

Sometimes, you can't see the "hand-holding" because something is blocking the view (occlusion).

  • The Scenario: Imagine a statue of a bear. From the front, you see both eyes. From the side, the nose blocks the left eye. The computer can't find the "left eye" pixel in the side view because it's hidden.
  • The Old Way: The computer gives up and leaves that spot blank or blurry.
  • CoreEditor's Way: It uses Semantic Similarity. Even if the eye is hidden, the computer knows, "Hey, the right eye in this view looks very similar to the left eye in the front view." It uses semantic clues (meaning) to fill in the gaps, not just geometric clues (position).
  • The Metaphor: It's like a detective who can't see a suspect because they are behind a wall. Instead of giving up, the detective looks at the suspect's shadow or their voice to figure out where they are. CoreEditor uses "semantic shadows" to find the hidden parts.

4. The "Selective Editing" Pipeline (The Editor's Choice)

Sometimes, the computer tries to edit a scene and comes up with five different versions of a "zebra." One is cute, one is scary, one is cartoonish. If you just mash them all together, you get a weird hybrid.

  • The Innovation: CoreEditor lets you (the user) pick the version you like best first.
  • The Analogy: Imagine a chef making five different soups. Instead of blending them all into one gross pot, you taste them, pick the "Spicy Tomato" one, and say, "Make all the bowls taste like this." CoreEditor takes your favorite version and uses it as a Reference to guide the other views, ensuring they all follow the same style.

5. The Result: A Crystal Clear 3D World

By combining these three things:

  1. Forcing pixels to talk to their counterparts (CCA).
  2. Using "meaning" to find hidden parts (Semantic Support).
  3. Letting the user pick the best style first (Selective Pipeline).

...CoreEditor creates 3D edits that are sharp, consistent, and actually look like the object you asked for. No more blurry textures or glitchy 3D models.

In Summary

If current 3D editing is like a chaotic choir where everyone sings a different song, CoreEditor is the conductor who:

  1. Makes sure the singers are looking at the same sheet music (Consistency).
  2. Helps them hear each other even when they are behind a wall (Semantic Support).
  3. Lets the audience pick the song they want to hear first, then teaches everyone to sing that specific song (Selective Pipeline).

The result? A harmonious, high-quality 3D masterpiece.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →