Score-Guided Proximal Projection: A Unified Geometric Framework for Rectified Flow Editing

Imagine you have a very talented artist (the AI model) who can draw beautiful pictures from scratch. However, you want to give them a specific instruction: "Take this photo of a house, but turn it into a castle, while keeping the same roof shape and window placement."

This is the challenge of Image Editing with AI. You want to change the meaning (house to castle) without losing the identity (the specific house structure).

Current methods have two big problems:

The "Rigid Robot" (Inversion-based): This method tries to retrace the exact steps the AI took to create the original house. It's so strict that if you ask for a castle, the robot says, "I can't! I'm locked to the house path!" The result is a weird hybrid that looks like a house with castle textures, but the structure doesn't change.
The "Wobbly Acrobat" (Posterior Sampling): This method tries to guess the perfect castle by calculating millions of possibilities. It's powerful but unstable. It often trips over its own feet, creating blurry, messy images, or it takes so long to calculate that it's impractical.

The Solution: SGPP (Score-Guided Proximal Projection)

The authors propose a new method called SGPP. Think of it as giving the artist a smart, elastic guide rope instead of a rigid metal bar or a chaotic free-for-all.

Here is how it works, using simple analogies:

1. The Elastic Rope (The "Proximal" Part)

Imagine the original house photo is tied to a post. The AI is trying to walk away to draw a castle.

Old Method: The rope is a steel chain. The AI can't move far enough to change the shape of the building. It's "geometrically locked."
SGPP Method: The rope is made of elastic. It pulls the AI back toward the original house (so the roof and windows stay recognizable), but it stretches enough to let the AI walk over to the "Castle" zone.
The Magic Knob: The authors introduce a variable called $\sigma_p$ (proximal variance).
- Turn the knob to 0: The rope becomes a steel chain (Rigid/Strict). You get perfect identity preservation but no real change.
- Turn the knob to 0.5: The rope becomes stretchy (Soft Guidance). The AI can stretch the house into a castle, keeping the spirit of the house but changing the structure.

2. The Invisible Magnet (The "Score" Part)

The AI model has a built-in "magnet" (called the Score Field) that knows where "real" images live. Imagine a landscape where "real" images are valleys and "fake" images are high mountains.

If the AI tries to draw something weird (like a house with a dragon head), the magnet pulls it back down into the valley of "realistic images."
SGPP uses this magnet to ensure that even while stretching the elastic rope, the AI never wanders off into "nonsense land." It guarantees the final image looks like a photo, not a glitch.

3. The "Snap-Back" Safety Net

One of the paper's biggest claims is Geometric Stability.
Imagine you are walking on a narrow, winding mountain path (the "Data Manifold"). If you step off the path, you fall.

Old methods might let you step off the path and then try to guess where you should be, often failing.
SGPP acts like a magnetic safety rail. If you step too far off the path, the rail gently but firmly snaps you back onto the trail. The paper proves mathematically that this "snap-back" force is so strong that you can never fall off the cliff, ensuring the image stays realistic.

The Big Picture: Why is this a big deal?

The paper unifies two worlds that were previously fighting each other:

Optimization: Being precise and deterministic (like a calculator).
Sampling: Being creative and random (like a dice roll).

SGPP shows that these are actually the same thing, just viewed through a different lens. By adjusting that "Elasticity Knob" ( $\sigma_p$ ), you can slide smoothly between:

Strict Reconstruction: "Fix this blurry photo exactly as it was."
Creative Editing: "Turn this cat into a lion, but keep its pose."

Summary in a Nutshell

SGPP is a new way to edit AI images that uses a stretchy, magnetic guide.

It prevents the AI from getting stuck (too rigid).
It prevents the AI from falling off the cliff (creating nonsense).
It lets you dial in exactly how much you want to change the image, from "just fix the noise" to "completely transform the object," all without needing to retrain the AI or do complex math on the fly.

It's the difference between trying to push a boulder up a hill with a sledgehammer (old methods) and using a pulley system that does the heavy lifting for you (SGPP).

1. Problem Statement

The paper addresses the challenge of controlled inverse problems (e.g., semantic image editing, blind image recovery) using Rectified Flow (RF) models. While RF models offer state-of-the-art generation quality with straighter transport trajectories than standard diffusion models, controlling them for precise tasks remains difficult due to the "perception-distortion trade-off": balancing fidelity (preserving the input's identity/structure) with realism (ensuring the output lies on the learned data manifold).

Current approaches suffer from two distinct limitations:

Inversion-Based Guidance (e.g., RF-Inversion): Enforces "hard guidance" by rigidly retracing the noise inversion path of the source image. This leads to "geometric locking," where the model cannot deviate sufficiently from the original path to accommodate significant semantic changes or correct large out-of-distribution (OOD) corruptions.
Posterior Sampling (e.g., DPS, MCG): Attempts to optimize a likelihood objective ( $\nabla_{x_t} \log p(x_{ref}|x_t)$ ). However, methods like Diffusion Posterior Sampling (DPS) require backpropagating through the denoising network Jacobian, which is computationally expensive and unstable at high noise levels. Manifold Constrained Gradients (MCG) rely on explicit, approximate projections that are often brittle.

2. Methodology: Score-Guided Proximal Projection (SGPP)

The authors propose SGPP, a unified framework that bridges deterministic optimization and stochastic sampling by reformulating the recovery task as a proximal optimization problem on a time-dependent manifold.

Core Formulation

SGPP defines a time-dependent energy potential $J_t(x_t)$ that balances two terms:

Fidelity Potential: Anchors the trajectory to the reference input $x_{ref}$ . Modeled as a Gaussian likelihood with a proximal variance hyperparameter $\sigma_p^2(t)$ .
Generative Potential: Derived from the pre-trained score field $\nabla \log p_t(x_t)$ of the Rectified Flow model.

The update rule is a gradient descent step on this energy:
$x_{k+1} = x_k + \eta_k \left( s_\psi(x_k, t_k) - \frac{x_k - (1-t_k)x_{ref}}{(1-t_k)^2\sigma_p^2 + t_k^2} \right)$
Crucially, this formulation is Jacobian-free. It leverages the intrinsic geometry of the RF score field rather than computing unstable backpropagation gradients.

Theoretical Foundations

The paper provides rigorous geometric proofs regarding the behavior of SGPP within a tubular neighborhood of the data manifold $M_t$ :

Score Decomposition (Prop 3.2): The RF score is decomposed into a normal restoring force ( $-n_t/t^2$ ), an intrinsic tangential gradient ( $\nabla_T \log p_{M_t}$ ), and a curvature drift term ( $H_t/2$ ).
Normal Contraction (Prop 3.3): The authors prove that the gradient flow exhibits a normal contraction property. The restoring force exponentially contracts the distance of out-of-distribution inputs to the manifold, guaranteeing geometric stability without the instability of DPS.
Tangential Drift (Prop 3.4): The motion along the manifold surface corresponds to the semantic evolution of the image, perturbed only by a bounded geometric drift error that vanishes as the trajectory approaches the manifold.
MAP Equivalence (Theorem 3.5): The equilibrium state of the SGPP dynamics corresponds exactly to the Manifold-Constrained Maximum A Posteriori (MAP) estimator. Unlike MCG, which requires explicit projection operators, SGPP achieves this constraint implicitly through the pre-trained score field.

From Optimization to Sampling

While the deterministic update converges to the posterior mode (MAP), which can lead to over-smoothed results, the authors introduce a stochastic sampler (SGPP-SDE) derived from the posterior ODE. This allows the model to sample from the full posterior distribution, recovering high-frequency textures and diversity while maintaining geometric safety.

3. Key Contributions

Unified Framework: SGPP unifies deterministic optimization (inversion) and stochastic sampling (posterior sampling) under a single geometric framework.
Soft Guidance Mechanism: By introducing the proximal variance $\sigma_p$ , SGPP enables "soft guidance."
- As $\sigma_p \to 0$ , SGPP reduces to RF-Inversion (hard guidance).
- As $\sigma_p > 0$ , it allows "soft guidance," enabling the trajectory to deviate flexibly from the rigid inversion path to satisfy semantic constraints while remaining geometrically safe.
Theoretical Guarantees: The paper proves Normal Contraction, ensuring that OOD inputs are snapped onto the data manifold, and establishes the equivalence of the method to Manifold-Constrained MAP estimation.
Jacobian-Free Efficiency: The method avoids the computationally expensive and unstable Jacobian calculations required by DPS, relying instead on closed-form guidance terms derived from the linear geometry of Rectified Flows.
Zero-Shot Capability: SGPP is training-free and requires no auxiliary networks or prompt tuning; it uses the reference image itself as a proximal constraint.

4. Results

The authors validate SGPP through:

2D Geometric Experiments: On a "two-moons" manifold, SGPP demonstrates robust convergence and "snaps" OOD points to the manifold spine. In contrast, DPS shows instability (overshooting/misguidance) at high noise, and RF-Inversion exhibits geometric locking (collapsing to the reference).
Semantic Editing (FLUX Model):
- Task: Transforming a "cat" image into a "lion" while preserving pose/background.
- Baseline (RF-Inversion): Fails to generate meaningful semantic changes; the output is a texture-swapped hybrid due to geometric locking.
- SGPP: Successfully generates a lion with a mane and broader muzzle while maintaining the original pose. This is achieved by relaxing the proximal constraint ( $\sigma_p = 0.2$ ) to allow tangential deviation.
Fidelity-Realism Trade-off: Experiments show a continuous spectrum controlled by $\sigma_p$ . Low $\sigma_p$ yields strict reconstruction (high fidelity), while higher $\sigma_p$ allows for generative freedom (hallucinating realistic details absent in the reference).

5. Significance

This work provides a theoretically grounded, training-free solution for controlled generation with Rectified Flow models. By reinterpreting inversion-based methods as a limiting case of a proximal optimization problem, SGPP resolves the tension between strict identity preservation and generative flexibility. It offers a robust alternative to Jacobian-based posterior sampling, eliminating the "exploding gradient" problem while providing rigorous geometric guarantees for stability and manifold adherence. This framework is particularly significant for applications requiring high-fidelity image recovery and flexible semantic editing without the need for retraining or complex auxiliary networks.