PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning

PhotoAgent is an autonomous image editing system that leverages explicit aesthetic planning, tree search, and closed-loop visual feedback to execute multi-step editing tasks without requiring detailed user prompts, supported by the newly introduced UGC-Edit benchmark for evaluation.

Mingde Yao, Zhiyuan You, King-Man Tam, Menglu Wang, Tianfan Xue

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine you have a beautiful but slightly imperfect photo. Maybe the sky is a bit gray, the colors are dull, or there's a stranger in the background you want to remove.

In the past, fixing this was like trying to fix a car engine with a manual written in a language you don't speak. You had to know exactly which "sliders" to move (brightness, contrast, saturation) and in what order. If you asked a computer, "Make this picture better," it would often just guess, sometimes making it look weird or over-the-top.

PhotoAgent is like hiring a super-intelligent, patient, and artistic personal editor who doesn't just follow orders but actually thinks about how to make your photo look amazing.

Here is how it works, broken down into simple steps:

1. The Problem: The "Human Loop" is Tiring

Currently, if you want to edit a photo, you have to be the boss. You have to tell the computer: "First, make the sky blue. Then, remove that person. Then, make the colors pop."

  • The issue: Most of us aren't professional photographers. We don't know the right steps. We get tired after trying four different commands, and the result is still messy.

2. The Solution: The "PhotoAgent" Team

PhotoAgent acts like a tiny, autonomous film crew working on your photo. It has four main characters, each with a specific job:

  • 👀 The Perceiver (The Art Critic):
    This is the team member who looks at your photo and says, "Hmm, the sky looks flat, and the car is too red." Instead of just waiting for you to tell it what to do, it comes up with a list of ideas: "Let's add clouds," "Let's brighten the grass," or "Let's change the car to blue."

  • 🧠 The Planner (The Chess Master):
    This is the smartest part. Instead of just picking the first idea the Perceiver had, the Planner plays out a game of "What if?"

    • Analogy: Imagine you are playing chess. You don't just move a piece; you think, "If I move here, my opponent might move there, and then I can win."
    • PhotoAgent uses a technique called MCTS (Monte Carlo Tree Search). It simulates dozens of different editing paths in its head. It asks, "If I add clouds first, then change the sky color, will it look good? Or is it better to change the color first?" It picks the path that leads to the best possible result, avoiding "short-sighted" mistakes.
  • 🛠️ The Executor (The Handyman):
    Once the Planner picks the best plan, the Executor actually does the work. It has a toolbox full of different "apps." Sometimes it uses simple tools (like a digital brush to crop the image), and sometimes it uses powerful AI generators (like a magic wand to create a sunset that didn't exist before).

  • 📏 The Evaluator (The Judge):
    After the Executor makes a change, the Judge steps in. It doesn't just look at the picture; it has been trained on thousands of real photos taken by regular people (not just AI art) to know what looks "good."

    • The Rule: If the new photo scores higher on beauty than the old one, it keeps the change. If the new photo looks worse, it says, "Nope, undo that," and tries a different plan.

3. The Secret Sauce: "UGC-Edit"

One of the biggest problems with AI is that it often thinks "pretty" means "bright, saturated, and fake-looking."
To fix this, the creators of PhotoAgent built a special training library called UGC-Edit (User-Generated Content Edit).

  • Analogy: Instead of teaching the AI to judge photos based on Hollywood movie posters (which are often fake), they taught it to judge photos based on real people's vacation photos, family dinners, and street snaps.
  • This ensures that when PhotoAgent edits your photo, it makes it look like a better version of your real life, not a cartoon.

4. The Result: A Closed-Loop Magic

The whole system runs in a loop: Look → Plan → Do → Judge → Repeat.
It keeps doing this until the photo is perfect or it decides, "Okay, this is as good as it gets." You don't have to type a single command. You just upload the photo, and the agent does the rest.

In summary:
PhotoAgent takes the hard work of "figuring out how to edit" away from you. It acts like a professional photo editor who sits down, thinks about the best strategy, tries out different ideas in their head, and only shows you the final, polished masterpiece. It turns the complex task of photo editing into a simple "one-click" experience that actually understands what makes a photo beautiful.