BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling

Imagine you have a photo of a friend, and you want to make it look perfect for a social media post. You want to get rid of a few pesky pimples or a stray hair, but you don't want to turn them into a plastic doll. You want them to still look like them, just the best version of themselves.

This is the tricky problem the paper BeautyGRPO solves. Here is the breakdown in simple terms:

The Problem: The "Uncanny Valley" of Photo Editing

Current photo editing tools face a dilemma, like a chef trying to cook a perfect meal:

The "Copy-Paste" Chef (Supervised Learning): These tools are trained by looking at thousands of "before and after" photos. They try to copy the "after" photo pixel-by-pixel.
- The Flaw: They are too rigid. If the "perfect" photo in their training data had a weirdly smooth chin, the tool will make every chin look like plastic. They mimic the data but miss the feeling of what looks good to a human.
The "Wild Experiment" Chef (Standard Reinforcement Learning): These tools try to learn by guessing and checking what humans like. They are creative and can find new, beautiful styles.
- The Flaw: They are too chaotic. Because they are "guessing," they often add weird noise, grain, or distortions. It's like a chef who keeps adding random spices until the dish tastes like a science experiment.

BeautyGRPO is the solution that combines the best of both worlds: the creativity to find new styles, but with the discipline to keep the photo looking real and high-quality.

The Secret Sauce: Three Magic Ingredients

1. The "Taste Tester" (FRPref-10K & The Reward Model)

Before the AI can learn to edit, it needs to know what "good" looks like. The researchers built a massive library called FRPref-10K.

The Analogy: Imagine a panel of 10,000 expert art critics. They don't just say "I like this." They break it down: "The skin looks too waxy," "The mole is gone (bad!)," "The pores look natural (good!)."
They trained a specialized AI "Taste Tester" (Reward Model) that can judge these tiny details. It knows the difference between a natural pore and a plastic smudge.

2. The "Safety Net" (Dynamic Path Guidance - DPG)

This is the paper's biggest innovation. When the AI tries to "explore" new ways to edit the photo (to find something better than the original), it usually drifts off course and creates noise.

The Analogy: Imagine you are walking through a foggy forest (the editing process) trying to find a hidden treasure (the perfect photo).
- Old Way: You wander blindly. You might find the treasure, but you might also fall into a swamp (create noise/artifacts).
- BeautyGRPO Way: You have a tether attached to a sturdy tree (the "Anchor"). You are allowed to wander far and wide to explore, but the tether gently pulls you back if you get too close to the swamp.
How it works: The AI uses a high-quality "anchor" image as a reference point. It explores freely, but if it starts to drift into "noise territory," the tether (Dynamic Path Guidance) gently steers it back toward a clear, high-quality path without stopping the exploration.

3. The "Fine-Tuning" (GRPO)

Once the AI has the "Taste Tester" and the "Safety Net," it starts practicing. It generates many versions of a photo, the Taste Tester scores them, and the AI learns to make the next one even better. It's like a student taking practice tests, getting graded, and studying specifically on the questions they got wrong.

Why is this a Big Deal?

If you look at the results in the paper, you can see the difference:

Old Tools: Often make skin look like a smooth, shiny balloon (over-smoothed) or leave acne behind because they are scared to change the image too much.
BeautyGRPO: It removes the acne perfectly but keeps the natural texture of the skin. It keeps moles and freckles (which are part of a person's identity) while making the skin look healthy and glowing.

In a nutshell:
BeautyGRPO is like a master portrait artist who has a safety harness. They are brave enough to try new, beautiful ways to enhance a face, but the harness ensures they never accidentally ruin the photo with weird glitches or plastic-looking skin. It learns to edit based on human taste, not just pixel copying.

1. Problem Statement

Face retouching presents a fundamental trade-off between high-fidelity identity preservation and subjective aesthetic enhancement.

Limitations of Supervised Learning (SFT): Existing methods rely on pixel-level supervision (mimicking labeled reference images). This approach fails to capture complex, subjective human aesthetic preferences, often resulting in rigid, unnatural edits or overfitting to specific styles. It struggles to balance removing imperfections (acne, blemishes) while preserving unique identity features (moles, pores, skin texture).
Limitations of Standard Reinforcement Learning (RL): While online RL (e.g., FlowGRPO) excels at aligning with human preferences through exploration, its stochastic nature introduces noise artifacts and trajectory drift. In face retouching, where visual stability is critical, the accumulated stochastic noise in standard SDE (Stochastic Differential Equation) sampling degrades image fidelity, leading to unnatural textures and identity distortion.
Reward Granularity: Existing reward models for image generation focus on global aesthetics or instruction following, lacking the fine-grained sensitivity required to evaluate specific retouching dimensions like skin smoothness vs. texture naturalness.

2. Methodology: BeautyGRPO

The authors propose BeautyGRPO, a reinforcement learning framework designed to align face retouching with human aesthetic preferences while maintaining high fidelity. The framework consists of three core components:

A. FRPref-10K Dataset & Specialized Reward Model

To address the lack of fine-grained preference data, the authors constructed FRPref-10K, a large-scale dataset containing 10,000 high-resolution preference pairs.

Dimensions: Annotations cover five key dimensions: Skin Smoothing, Blemish Removal, Texture Quality, Clarity, and Identity Preservation.
Annotation Pipeline: A hybrid approach using Vision-Language Models (VLMs) for initial reasoning-based scoring across the five dimensions, followed by human verification and expert adjudication to ensure alignment with aesthetic judgment.
Reward Model Training: A specialized reward model (based on Qwen2.5-VL) is trained using a three-stage strategy:
1. SFT: Structured reasoning initialization.
2. Self-Training: Consistency filtering to reinforce reliable reasoning patterns.
3. GRPO: Robustness enhancement via Group Relative Policy Optimization to handle inconsistent samples.

B. Dynamic Path Guidance (DPG)

To resolve the conflict between exploration (needed for RL) and fidelity (needed for face retouching), the authors introduce Dynamic Path Guidance (DPG).

The Conflict: Standard FlowGRPO converts the deterministic ODE trajectory into an SDE by injecting noise to enable exploration. However, this causes the sampling trajectory to drift away from the high-fidelity manifold, creating artifacts.
The Solution: DPG stabilizes the stochastic trajectory by dynamically computing an anchor-based ODE path.
- Stability Anchors: High-preference exemplars from FRPref-10K serve as anchors ( $x_{anchor}$ ). These are not used as ground-truth supervision targets but as geometric guides.
- Replanning: At each sampling timestep, DPG replans a guided trajectory toward the anchor. It calculates a correction vector ( $z_{anchor}$ ) that steers the trajectory back toward the high-fidelity manifold.
- Controlled Stochasticity: The final update blends the correction vector with standard Gaussian noise using a time-dependent coefficient $\lambda(t)$ . Early timesteps use strong anchor guidance to correct structural deviations, while later timesteps rely more on noise for fine-grained exploration.
- Result: This allows the model to explore solutions that surpass the anchor's quality (driven by the reward signal) while preventing the trajectory from drifting into low-fidelity regions.

C. Training Framework

The system uses a FluxKontext backbone with LoRA fine-tuning. The policy is optimized using the GRPO objective, where the reward is provided by the specialized retouching reward model, and the sampling process is guided by DPG.

3. Key Contributions

FRPref-10K Dataset: A large-scale, fine-grained preference dataset covering five critical retouching dimensions, enabling the training of a reward model capable of detecting subtle perceptual differences.
Specialized Reward Model: A vision-language model trained to provide dense, interpretable feedback on skin smoothing, blemish removal, and identity preservation, surpassing general image editing reward models.
Dynamic Path Guidance (DPG): A novel algorithm that reconciles the stochastic exploration of online RL with the high-fidelity constraints of face retouching. It stabilizes sampling trajectories without sacrificing the ability to discover aesthetic improvements beyond the training data.
BeautyGRPO Framework: An end-to-end RL framework that outperforms both specialized face retouching models and general image editing models.

4. Experimental Results

Extensive experiments on the FFHQR and In-the-wild datasets demonstrate the superiority of BeautyGRPO:

Quantitative Performance: BeautyGRPO achieves state-of-the-art scores on no-reference perceptual metrics (NIQE, NIMA, MUSIQ, MANIQA, TOPIQ) and maintains high identity preservation (ArcFace scores). It significantly outperforms specialized models (e.g., RetouchFormer, VRetouchEr) and general editing models (e.g., NanoBanana, SeedDream).
Qualitative Improvements: Visual comparisons show that BeautyGRPO effectively removes blemishes and smooths skin while preserving natural texture, pores, and distinctive features (e.g., moles). In contrast, baselines often suffer from over-smoothing (plastic look), incomplete blemish removal, or identity shifts.
User Study: In a study with 100 participants, BeautyGRPO achieved a 63.25% win rate, significantly higher than the next best method (12.00%), indicating strong alignment with human aesthetic preferences.
Ablation Studies:
- The specialized reward model consistently outperforms existing general reward models (EditReward, UnifiedReward).
- DPG is crucial; removing it (using standard FlowGRPO) leads to excessive noise and lower quality.
- The method generalizes well across different backbone models (e.g., Qwen-Image-Edit).

5. Significance

This paper addresses a critical gap in generative AI for portrait editing: how to automate subjective aesthetic enhancement without sacrificing identity or realism.

Paradigm Shift: It moves face retouching from pixel-level mimicry (SFT) to preference-based alignment (RL), allowing models to generate results that are aesthetically superior to the training labels.
Technical Innovation: The Dynamic Path Guidance mechanism offers a novel solution to the "fidelity-exploration" conflict in diffusion/flow-matching models, making online RL viable for high-stakes, high-fidelity tasks like portrait retouching where standard stochastic exploration usually fails.
Practical Impact: The proposed framework provides a robust foundation for next-generation mobile and professional portrait editing tools that deliver natural, high-quality, and personalized beauty enhancements.