Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints

This paper presents a zero-shot 3D object alignment framework that optimizes relative pose using CLIP-driven gradients and geometry-aware constraints via a differentiable renderer, achieving semantically faithful and physically plausible results without requiring new model training.

Rotem Gatenyo, Ohad Fried

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine you have two 3D objects on a computer screen, like a bun and a burger patty, but they are floating in empty space, far apart and not touching. You want to put them together to make a burger, but you don't want to manually drag and drop them. Instead, you just want to type a sentence: "Put the patty on the bun."

This paper introduces a clever AI system called COPY-TRANSFORM-PASTE that does exactly that. It acts like a super-smart, invisible robot hand that reads your text and physically moves the 3D objects until they fit together perfectly, just like you imagined.

Here is how it works, broken down into simple concepts:

1. The "Magic Eye" (Vision-Language)

First, the AI needs to understand what you want. It uses a tool called CLIP (think of it as a super-advanced librarian that knows how images and words connect).

  • How it works: The AI takes a picture of the floating objects, shows it to the "librarian," and asks, "Does this look like a burger?"
  • The Feedback Loop: If the answer is "No, that looks like a floating mess," the AI gets a tiny nudge (a gradient) telling it to move the objects slightly. It keeps doing this, taking thousands of tiny steps, until the picture it sees matches your text description perfectly.

2. The "Ghost Hands" (Geometric Constraints)

Here's the problem: The "Magic Eye" is great at understanding ideas, but it's bad at understanding physics. If you just ask it to "put the patty on the bun," it might slide the patty inside the bun like a ghost, or leave a huge gap.

To fix this, the authors added two "rules" to the AI's brain:

  • The Sticky Tape (Soft-ICP): Imagine the surfaces of the objects are covered in tiny, invisible Velcro dots. The AI is told to stick the closest dots of the two objects together. This ensures they actually touch.
  • The Anti-Ghost Rule (Penetration Loss): The AI is strictly forbidden from letting one object pass through the other (unless you specifically ask for something like a knife cutting an apple). If the patty tries to go inside the bun, the AI feels a "pain" signal and pushes it back out.

3. The "Zoom-In" Strategy (Phased Optimization)

Trying to solve the whole puzzle at once is hard. So, the AI plays a game of "Hot and Cold" in stages:

  • Phase 1 (The Wide Shot): The AI starts with a wide camera view. It's just trying to get the objects roughly near each other. It's allowed to be a bit messy and even let them pass through each other slightly (like sliding a flower into a vase).
  • Phase 2 & 3 (The Zoom-In): As it gets closer, the camera zooms in tight on the interaction area. The "Sticky Tape" and "Anti-Ghost" rules get stronger. Now, the AI is obsessed with making sure the surfaces touch perfectly and nothing is overlapping.

4. The "Try Again" Safety Net

Sometimes, the AI might get stuck in a bad spot (like putting the burger patty on the side of the bun instead of the top). To fix this, the system doesn't just try once. It runs the simulation five times starting from different random positions. At the end, it picks the result that looks the best according to your text.

Why is this cool?

  • No Training Needed: Usually, to teach a computer to do this, you need thousands of examples of "burgers" and "pizzas." This method works zero-shot, meaning it has never seen a burger before. It just uses its general knowledge of language and geometry to figure it out on the fly.
  • It's Creative: You can ask for weird things, like "Pinocchio wearing a hat" or "A golden necklace on a stand," and it will figure out the physics of how a hat sits on a head or how a necklace drapes.

The Bottom Line

This paper is about teaching computers to be 3D editors. Instead of you manually moving every piece, you just describe the scene, and the AI uses a mix of "reading your mind" (text) and "feeling the physics" (geometry) to assemble the objects for you, just like a master chef plating a dish.