Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

This paper exposes a critical evaluation pitfall where common human preference models are biased toward large guidance scales, leading to inflated scores despite degraded image quality, and proposes a novel guidance-aware evaluation framework (GA-Eval) alongside a new method (TDG) to demonstrate that many recent diffusion guidance improvements are illusory and that simply increasing CFG scales often outperforms them in practice.

Dian Xie, Shitong Shao, Lichen Bai, Zikai Zhou, Bojun Cheng, Shuo Yang, Jun Wu, Zeke Xie

Published 2026-02-27
📖 4 min read☕ Coffee break read

Imagine you are a judge at a cooking competition. The goal is to create the most delicious dish based on a specific recipe (the text prompt).

For a long time, the judges (evaluation metrics) have been using a very specific rule: "The more colorful and intense the dish looks, the better it tastes."

This paper, titled "Guidance Matters," is a group of researchers shouting, "Wait a minute! That rule is broken!" They discovered that many new cooking techniques (diffusion guidance methods) aren't actually making the food taste better; they are just turning up the spice level (increasing the "guidance scale") until the dish is so bright and saturated that the judges love it, even if the food is now burnt, oversalted, or looks fake.

Here is the breakdown of their discovery using simple analogies:

1. The Problem: The "Volume Knob" Trap

In the world of AI image generation, there is a setting called Classifier-Free Guidance (CFG). Think of this as a Volume Knob for how strictly the AI follows your instructions.

  • Low Volume: The AI is relaxed and creative but might ignore your prompt.
  • High Volume: The AI screams the instructions, making the image match the text perfectly, but often at the cost of quality. It makes colors super bright, adds weird artifacts, and makes things look "overcooked."

The Pitfall: The researchers found that the "Judges" (AI models like HPS v2 and ImageReward that score images) are biased. They love high-volume images. If you just crank the volume knob to 20, the AI gets a high score, even if the picture looks like a neon nightmare.

2. The "Fake" Innovators

Recently, many scientists proposed fancy new "cooking techniques" (new guidance methods) claiming they produce better images.

  • The Reality Check: The researchers tested these new methods. They found that most of them were just hiding behind the Volume Knob.
  • They would use their fancy new technique and turn the volume knob way up. The high score wasn't because the technique was good; it was because the volume was loud.
  • The Analogy: It's like a magician claiming they have a new trick to make a rabbit appear. But when you check, they are just using a louder trumpet to distract the audience while they pull a rabbit out of a hat they already had.

3. The Solution: The "GA-Eval" Framework

To fix this, the authors created a new way to judge the competition called GA-Eval (Guidance-Aware Evaluation).

  • How it works: Instead of just looking at the final score, GA-Eval asks: "If we took away the extra loudness (the high volume knob) and only kept the 'effective' part of your new technique, would you still win?"
  • The Result: When they turned down the volume to a fair level, most of the "fancy new techniques" lost. They couldn't compete with the standard method anymore. Their "magic" was just the volume knob all along.

4. The "TDG" Trick: A Proof of Concept

To prove their point, the authors invented a fake method called Transcendent Diffusion Guidance (TDG).

  • The Trick: They took the text prompt, randomly deleted half the words (making it a "weak" prompt), and used that to confuse the AI slightly.
  • The Result: In the old, broken judging system, TDG got amazing scores! It looked like a breakthrough. But in the new, fair GA-Eval system, it was revealed to be useless. It was just another way to game the "loudness" bias.

5. The One Exception: Z-Sampling

Out of all the methods tested, only one (called Z-Sampling) actually did something new. Even when the volume knob was turned down to a fair level, Z-Sampling still won. This suggests it actually has a real "secret ingredient" that isn't just about turning up the volume.

The Big Takeaway

The paper is a wake-up call for the AI community.

  • Don't be fooled by the noise: Just because an image gets a high score from current AI judges doesn't mean it's actually better. It might just be "louder" and more saturated.
  • We need new judges: We need evaluation tools that can tell the difference between a genuinely creative improvement and just a trick that makes the colors pop.

In short: The field has been celebrating "loud" images as "good" images. This paper says, "Let's turn down the volume, see who is actually singing well, and stop rewarding people just for shouting."

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →