Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control

Imagine you have a photo of a parrot sitting on a branch, and you want to turn it into a hat, or a soccer ball into a guitar, without touching the background trees or the sky.

Doing this with current AI tools is like trying to repaint a car while driving it at 100 mph. You either crash the car (ruin the background) or fail to change the color (the shape doesn't change enough).

The paper "Follow-Your-Shape" introduces a new, smarter way to do this. Think of it as a magic sculptor that knows exactly where to cut and paste, leaving everything else perfectly untouched.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Blurry Map"

Older AI editing tools are like a painter with a shaky hand. They try to guess which part of the image to change based on the words you type (e.g., "change the parrot to a hat").

The Issue: They often get confused. They might change the parrot, but accidentally turn the sky purple or the branch into a snake. They lack a precise "map" of where the change should happen.

2. The Solution: The "Trajectory Divergence Map" (TDM)

The authors came up with a clever trick. Instead of guessing, they watch the AI's "thought process" in real-time.

Imagine the AI is walking a path to create an image.

Path A (The Original): The AI walks a path to recreate the original parrot.
Path B (The Edit): The AI tries to walk a path to create a hat.

In the beginning, both paths look very similar (they are both just "noise"). But as the AI gets closer to the final image, the two paths split apart.

The path to the hat veers sharply toward the parrot's body.
The path for the background (the trees) stays exactly the same for both.

The Trajectory Divergence Map (TDM) is like a heat map that highlights exactly where these two paths split.

Red Hot Spots: "Here is where the parrot becomes a hat!" (Change this).
Cool Blue Areas: "Here is where the path didn't change." (Leave the trees alone).

This map is generated automatically by the AI itself, so you don't need to draw a mask or tell the computer exactly where the object is.

3. The Strategy: "The Three-Act Play"

The paper realizes that if you try to change the shape immediately, the AI gets confused because the image is still just static noise. So, they break the editing process into three stages, like a play:

Act 1: The Anchor (Stabilization)
- Analogy: Imagine you are trying to change a car's color while it's moving. First, you need to park it.
- What happens: The AI spends the first few seconds just "rebuilding" the original image perfectly. It locks the background in place so nothing drifts away.
Act 2: The Exploration (The Split)
- Analogy: Now that the car is parked, you start painting.
- What happens: The AI starts the transformation. It uses the TDM (the heat map we mentioned earlier) to see exactly where the "parrot" and "hat" paths are diverging. It gathers data on where the change is happening.
Act 3: The Precision Cut (The Final Touch)
- Analogy: You take the paintbrush and apply the new color only to the hot spots on your map, ignoring the rest.
- What happens: The AI mixes the "new hat" features with the "old parrot" features, but only in the areas the map told it to. The background remains 100% untouched.

4. The Result: "Follow-Your-Shape"

Because this method uses the AI's own internal "path splitting" to find the object, it is incredibly good at:

Big Changes: Turning a small bird into a giant dragon, or a cup into a lion.
Clean Backgrounds: The trees, sky, and floor stay exactly as they were.
No Masks Needed: You don't need to draw a circle around the object; the AI figures it out on its own.

Summary Analogy

Think of the old way as trying to swap a tire on a moving car by guessing where the wheel is. It's messy and dangerous.

Follow-Your-Shape is like putting the car on a lift, watching the exact moment the wheel separates from the axle, and then swapping it with surgical precision, ensuring the rest of the car doesn't even shake.

The authors also built a new test called ReShapeBench (like a driving test for shape-changing) to prove their method works better than anyone else's, and it passed with flying colors.

1. Problem Statement

Recent image editing models based on diffusion and flow-matching (e.g., Rectified Flow) have demonstrated strong general-purpose capabilities. However, they struggle significantly with large-scale shape transformations (e.g., changing a parrot into a hat, or a car into a bike).

Limitations of Current Methods:
- Rigid Masks: Methods relying on external binary masks (e.g., SAM) are too rigid, often failing to capture fine boundary details or struggling with significant geometric changes.
- Unreliable Attention: Methods using cross-attention maps to infer editable regions are often noisy and inconsistent, leading to "ghosting" or incomplete transformations.
- Global Injection: Unconditional Key-Value (KV) injection preserves background structure well but lacks selectivity, often suppressing the intended edits or failing to localize the shape change.
Core Challenge: The inability to precisely localize where the shape change should occur while strictly preserving the non-target background, especially when the object's geometry undergoes a fundamental structural shift.

2. Methodology: Follow-Your-Shape

The authors propose Follow-Your-Shape, a training-free and mask-free framework that achieves precise shape editing by dynamically analyzing the model's behavior during the generation process.

Key Components:

A. Trajectory Divergence Map (TDM)

Concept: The method posits that the semantic difference between a source prompt and a target prompt manifests as a divergence in their denoising trajectories within the latent space.
Computation:
1. Inversion: The source image is inverted to obtain a latent sequence $\{x_t\}$ guided by the source prompt ( $c_{src}$ ).
2. Denoising: A parallel editing trajectory $\{z_t\}$ is generated using the target prompt ( $c_{tgt}$ ).
3. Velocity Difference: At each timestep $t$ , the TDM ( $\delta_t$ ) is calculated as the L2 norm of the difference between the predicted velocity fields of the two trajectories:
  $\delta^{(i)}_t = \| v_\theta(z^{(i)}_t, t, c_{tgt}) - v_\theta(x^{(i)}_t, t, c_{src}) \|_2$
4. Normalization: The map is min-max normalized to a $[0, 1]$ scale, where high values indicate regions of significant semantic change (the object to be edited) and low values indicate stable background regions.

B. Scheduled KV Injection Strategy
Directly applying TDM guidance across all timesteps is suboptimal because early timesteps (high noise) produce unstable TDMs. The authors introduce a three-stage pipeline:

Stage 1: Initial Trajectory Stabilization:
- For the first $k_{front}$ steps, the method performs unconditional KV injection from the source inversion path.
- Goal: Anchor the generation to a faithful reconstruction manifold to prevent semantic drift and stabilize the background before any shape modification begins.
Stage 2: Editing and TDM Aggregation:
- The model enters an editing window where it explores the target prompt.
- TDMs are computed and stored at each step.
- Temporal Fusion: To create a robust edit mask, the TDMs are aggregated using a softmax-weighted temporal fusion. This ensures that tokens changing over time are captured even if they appear static at a single timestep.
- Mask Generation: The aggregated map is smoothed with a Gaussian kernel and binarized using Otsu's thresholding to create a precise spatial mask ( $M_S$ ).
Stage 3: Structural and Semantic Conformance:
- Blended KV Injection: The final edit is performed by blending Key-Value features based on the mask $M_S$ $M_{S}$ :
  $\{K^*, V^*\} \leftarrow M_S \odot \{K_{tgt}, V_{tgt}\} + (1 - M_S) \odot \{K_{inv}, V_{inv}\}$
  - $K_{tgt}, V_{tgt}$ : Features from the target prompt (for the object).
  - $K_{inv}, V_{inv}$ : Features from the source inversion (for the background).
- ControlNet Guidance: Optional ControlNet (Depth/Canny) is applied to enforce structural consistency, particularly for complex geometry.

3. Key Contributions

Follow-Your-Shape Framework: A novel, training-free, and mask-free editing framework that enables large-scale shape transformations while strictly preserving background content.
Trajectory Divergence Map (TDM): A mechanism to dynamically localize editable regions by quantifying the velocity divergence between source and target denoising trajectories, eliminating the need for external segmentation masks.
Scheduled KV Injection: A staged approach that stabilizes the initial trajectory and adaptively applies guidance, solving the instability issues of early-stage editing.
ReShapeBench: A new benchmark specifically designed for shape-aware editing, containing 120 new images and 290 editing cases (single and multi-object) with curated prompt pairs to evaluate large-scale structural changes.

4. Experimental Results

The method was evaluated on ReShapeBench and the existing PIE-Bench, comparing against state-of-the-art diffusion (MasaCtrl, PnPInversion) and flow-based (RF-Edit, FlowEdit, KV-Edit) baselines.

Quantitative Performance:
- Background Preservation: Achieved the highest PSNR (35.79 on ReShapeBench) and lowest LPIPS (8.23), significantly outperforming baselines. This confirms superior retention of non-target content.
- Text-Image Alignment: Achieved the highest CLIP Similarity (33.71), indicating strong adherence to the target prompt.
- Image Quality: Achieved the highest Aesthetic Score (6.57).
Qualitative Performance:
- The method successfully handles complex transformations (e.g., "A parrot" $\to$ "A hat", "A car" $\to$ "A bike") where baselines often fail to change the shape or degrade the background.
- It effectively handles multi-object scenarios without requiring manual masks.
Ablation Studies:
- $k_{front}$ (Stabilization steps): $k_{front}=2$ provided the optimal trade-off between background stability and editability. Too few steps caused drift; too many suppressed the edit.
- ControlNet: While beneficial for structural guidance, the method remains robust even without ControlNet, proving the efficacy of the TDM-guided KV injection itself.

5. Significance and Impact

Paradigm Shift: Moves away from reliance on external segmentation tools (masks) or noisy attention maps, instead deriving edit regions directly from the generative model's internal dynamics.
Solving a Hard Problem: Addresses the specific and difficult challenge of large-scale shape replacement, a task where current SOTA models typically fail or produce artifacts.
Benchmarking: The introduction of ReShapeBench fills a critical gap in the literature, providing a standardized way to evaluate shape-aware editing, which was previously conflated with general text-to-image editing tasks.
Practicality: Being training-free and compatible with existing flow-based models (like FLUX.1), the method is immediately applicable to current generative pipelines without requiring model retraining.

In summary, Follow-Your-Shape represents a significant advancement in controllable image generation, offering a robust solution for precise, large-scale structural edits while maintaining high-fidelity background preservation.

Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control

1. The Problem: The "Blurry Map"

2. The Solution: The "Trajectory Divergence Map" (TDM)

3. The Strategy: "The Three-Act Play"

4. The Result: "Follow-Your-Shape"

Summary Analogy

1. Problem Statement

2. Methodology: Follow-Your-Shape

Key Components:

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation