Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping

This paper introduces Venus, a two-stage framework built upon the new AesGuide dataset, which empowers multimodal large language models to provide actionable aesthetic guidance and achieve state-of-the-art performance in aesthetic cropping by addressing the gap between ordinary users and professional photographers.

Tianxiang Du, Hulingxiao He, Yuxin Peng

Published 2026-03-02
📖 4 min read☕ Coffee break read

Imagine you have a brand-new smartphone camera. You take a photo of a beautiful sunset, but it looks a bit flat. You ask your friend, "How's this photo?"

  • Old AI (like GPT-4o or AesExpert): "Wow, great job! The colors are vibrant, and the composition is lovely! Keep it up!" (They are being too nice to be helpful).
  • The Problem: They don't tell you what is wrong or how to fix it. They are like a cheerleader who never gives you a playbook.

Enter "Venus."

Think of Venus not just as a camera app, but as a wise, experienced photography mentor who is always in your pocket. This new system, created by researchers at Peking University, is designed to bridge the gap between taking a "good enough" snapshot and capturing a "masterpiece."

Here is how Venus works, broken down into simple steps:

1. The Missing Piece: "Aesthetic Guidance"

Before Venus, AI could describe a photo or give it a score, but it couldn't act like a pro photographer.

  • The Analogy: Imagine a cooking app that tells you, "This soup tastes good!" but refuses to tell you, "You forgot the salt, and the heat is too high."
  • The Venus Solution: Venus is the first system that can say, "Hey, the sky is too bright and washing out the building. Try lowering your camera angle to frame the columns better, or wait for the clouds to move." It gives actionable advice while you are taking the picture.

2. The Secret Sauce: The "AesGuide" Cookbook

To teach Venus how to be a mentor, the researchers couldn't just use random internet photos. They needed a special training manual.

  • The Analogy: They created a massive library called AesGuide. Imagine a library with 10,000 photos, but instead of just captions, each photo has a detailed critique written by a team of 20 professional photographers.
  • The Process: They didn't just ask the AI to guess. They used a "Refinement Factory":
    1. An AI drafts a critique.
    2. Real human experts (the "Head Chefs") review it, fix the mistakes, and ensure the advice is actually useful.
    3. This creates a high-quality dataset where the AI learns to spot flaws (like a messy background) and suggest fixes (like "blur the background").

3. The Two-Stage Training: "Learn to Critique, Then Learn to Crop"

Venus learns in two distinct phases, like a student first learning theory and then practicing on the field.

Stage 1: The Critic (Aesthetic Guidance)

  • What happens: Venus is trained to look at a photo and answer three questions: "How good is this?" (Score), "What's wrong?" (Analysis), and "How do I fix it?" (Guidance).
  • The Metaphor: It's like a film director watching a rough cut of a movie and saying, "The lighting is too harsh on the actor's face, and the camera is too shaky. Let's fix the light and steady the shot."
  • The Result: The AI stops being a "yes-man" and starts being a constructive critic.

Stage 2: The Editor (Aesthetic Cropping)

  • What happens: Now that Venus knows why a photo is bad, it learns to fix it by cropping (cutting out the bad parts).
  • The Twist: Most AI just guesses where to cut. Venus uses Chain-of-Thought (CoT).
  • The Metaphor: Instead of just handing you a cropped photo, Venus explains its logic like a detective: "I am cutting off the top of the building because it looks cluttered. I am moving the person to the center because the empty space on the left makes them look lonely. Here is the new frame, and here is exactly why it looks better."
  • Why this matters: It's not just a black box cutting the image; it's an interactive conversation. You can say, "I don't like those boats," and Venus will re-crop to focus on the mountains, explaining why that new composition works.

4. Why This is a Big Deal

  • For Regular People: You don't need to be a photography expert to take amazing photos. You just need to talk to Venus, get advice, and let it help you frame the shot.
  • For the Tech World: It moves AI from being a "passive judge" (giving a score) to an "active partner" (helping you create).
  • The Proof: In tests, Venus didn't just give better advice than other AIs; it also cropped photos better than specialized tools, all while explaining its reasoning in plain English.

Summary

Venus is like giving every smartphone user a personal photography coach. It doesn't just say "Good job!"; it says, "Here's what's missing, here's how to fix it, and here is the perfect crop to make your photo shine." It turns the art of photography from a guessing game into a guided, interactive experience.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →