VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

VQQA is a unified multi-agent framework that enhances video generation quality by leveraging dynamically generated visual questions and VLM critiques as semantic gradients to enable efficient, black-box prompt optimization without requiring white-box model access.

Yiwen Song, Tomas Pfister, Yale Song

Published 2026-03-16
📖 4 min read☕ Coffee break read

Imagine you are a director trying to get a perfect scene from a very talented, but slightly stubborn, AI actor. You give the actor a script (a text prompt) like, "A cat wearing a red hat chases a laser pointer across a room."

The AI tries its best, but the result is a bit weird: the cat has six legs, the hat is blue, and the laser pointer is stuck in the wall.

In the past, fixing this was like playing a game of "Guess the Mistake." You'd have to guess why the AI failed, rewrite the script, try again, and hope for the best. It was a slow, frustrating process of trial and error.

VQQA (Video Quality Question Answering) is like hiring a team of three expert film critics who work together to fix the movie for you, automatically and quickly. Here is how it works, broken down into simple steps:

The Three-Agent Team

Instead of just looking at the video and giving it a generic "thumbs up" or "thumbs down," VQQA uses a team of three specialized AI agents that talk to each other:

  1. The Question Generator (The Detective):
    This agent looks at your original script and the messy video. Instead of just saying "It's bad," it asks specific, targeted questions.

    • Analogy: Imagine a detective looking at a crime scene photo. Instead of saying "This is a mess," they ask: "Is the hat red?" "Does the cat have four legs?" "Is the laser moving?"
    • It creates a checklist of questions specifically designed to find the exact errors in your video.
  2. The Question Answerer (The Judge):
    This agent watches the video and answers the Detective's questions with a score from 0 to 100.

    • Analogy: The Judge looks at the video and says, "The hat is blue, so that's a 0/100. The cat has six legs, so that's a 10/100. But the laser is moving perfectly, so that's a 100/100."
    • This creates a detailed "diagnostic report" of exactly what went wrong.
  3. The Prompt Refiner (The Script Doctor):
    This agent takes the Judge's low scores and the specific questions that failed. It then rewrites your original script to fix those specific problems.

    • Analogy: The Script Doctor sees the Judge's report and says, "Okay, the AI got confused about the hat color. Let's change the script to: 'A cat wearing a bright, unmistakable RED hat.' And for the legs, let's add: 'Ensure the cat has exactly four legs.'"

The Magic Loop

This team doesn't just do this once. They work in a closed loop:

  1. They generate a video.
  2. They ask questions and grade it.
  3. They rewrite the script based on the grades.
  4. They generate a new video with the new script.
  5. They repeat until the video is perfect.

The "Global Rater" (The Safety Net)

There is one more important part. Sometimes, as the Script Doctor tries to fix one problem (like the hat color), they might accidentally change the whole story (e.g., the cat stops chasing the laser).

To prevent this, VQQA has a Global Rater. This is like a final producer who looks at all the versions of the video and asks: "Does this still match the original idea the user wanted?" If the new video is technically perfect but has lost the original spirit, the Global Rater picks the version that best balances quality with the user's original intent.

Why is this a big deal?

  • No "Black Box" required: You don't need to be a computer scientist or have access to the AI's internal code. You just talk to it in plain English.
  • It's efficient: Instead of generating 100 random videos and hoping one is good (which wastes a lot of computer power), VQQA learns from its mistakes and fixes them in just a few steps.
  • It understands nuance: It can fix complex things like "the cat's tail is flickering" or "the physics of the water splash is wrong," which older methods couldn't do.

In summary: VQQA turns video creation from a game of "guess and check" into a smart, self-correcting conversation. It's like having a personal editor who watches your video, tells you exactly what's wrong, rewrites your instructions, and keeps doing it until the movie is perfect.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →