VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

Imagine you are a director trying to get a perfect scene from a very talented, but slightly stubborn, AI actor. You give the actor a script (a text prompt) like, "A cat wearing a red hat chases a laser pointer across a room."

The AI tries its best, but the result is a bit weird: the cat has six legs, the hat is blue, and the laser pointer is stuck in the wall.

In the past, fixing this was like playing a game of "Guess the Mistake." You'd have to guess why the AI failed, rewrite the script, try again, and hope for the best. It was a slow, frustrating process of trial and error.

VQQA (Video Quality Question Answering) is like hiring a team of three expert film critics who work together to fix the movie for you, automatically and quickly. Here is how it works, broken down into simple steps:

The Three-Agent Team

Instead of just looking at the video and giving it a generic "thumbs up" or "thumbs down," VQQA uses a team of three specialized AI agents that talk to each other:

The Question Generator (The Detective):
This agent looks at your original script and the messy video. Instead of just saying "It's bad," it asks specific, targeted questions.
- Analogy: Imagine a detective looking at a crime scene photo. Instead of saying "This is a mess," they ask: "Is the hat red?" "Does the cat have four legs?" "Is the laser moving?"
- It creates a checklist of questions specifically designed to find the exact errors in your video.
The Question Answerer (The Judge):
This agent watches the video and answers the Detective's questions with a score from 0 to 100.
- Analogy: The Judge looks at the video and says, "The hat is blue, so that's a 0/100. The cat has six legs, so that's a 10/100. But the laser is moving perfectly, so that's a 100/100."
- This creates a detailed "diagnostic report" of exactly what went wrong.
The Prompt Refiner (The Script Doctor):
This agent takes the Judge's low scores and the specific questions that failed. It then rewrites your original script to fix those specific problems.
- Analogy: The Script Doctor sees the Judge's report and says, "Okay, the AI got confused about the hat color. Let's change the script to: 'A cat wearing a bright, unmistakable RED hat.' And for the legs, let's add: 'Ensure the cat has exactly four legs.'"

The Magic Loop

This team doesn't just do this once. They work in a closed loop:

They generate a video.
They ask questions and grade it.
They rewrite the script based on the grades.
They generate a new video with the new script.
They repeat until the video is perfect.

The "Global Rater" (The Safety Net)

There is one more important part. Sometimes, as the Script Doctor tries to fix one problem (like the hat color), they might accidentally change the whole story (e.g., the cat stops chasing the laser).

To prevent this, VQQA has a Global Rater. This is like a final producer who looks at all the versions of the video and asks: "Does this still match the original idea the user wanted?" If the new video is technically perfect but has lost the original spirit, the Global Rater picks the version that best balances quality with the user's original intent.

Why is this a big deal?

No "Black Box" required: You don't need to be a computer scientist or have access to the AI's internal code. You just talk to it in plain English.
It's efficient: Instead of generating 100 random videos and hoping one is good (which wastes a lot of computer power), VQQA learns from its mistakes and fixes them in just a few steps.
It understands nuance: It can fix complex things like "the cat's tail is flickering" or "the physics of the water splash is wrong," which older methods couldn't do.

In summary: VQQA turns video creation from a game of "guess and check" into a smart, self-correcting conversation. It's like having a personal editor who watches your video, tells you exactly what's wrong, rewrites your instructions, and keeps doing it until the movie is perfect.

1. Problem Statement

Despite rapid advancements in video generation models (e.g., diffusion and transformer-based architectures), aligning their outputs with complex human intent remains a significant challenge. Users frequently encounter issues such as compositional errors, temporal inconsistencies, and physical hallucinations. Current solutions face two main limitations:

Evaluation Limitations: Traditional metrics (e.g., FVD, IS) measure basic distributions but fail to provide actionable feedback. Existing Vision-Language Model (VLM) evaluators act as passive observers, lacking the flexibility to adapt to new tasks or provide specific instructions for correction.
Optimization Limitations: Existing test-time optimization methods are either computationally expensive (requiring large candidate pools for selection) or require white-box access to model internals (e.g., gradient updates on hidden states), making them incompatible with commercial black-box APIs.

There is a critical need for a black-box, interpretable, closed-loop system that can diagnose visual flaws and iteratively refine video generation prompts without accessing model weights.

2. Methodology: The VQQA Framework

The authors propose VQQA (Video Quality Question Answering), a unified multi-agent framework that transforms video evaluation from a passive metric into an active, dynamic question-answering paradigm. It treats the text prompt as the optimization variable and uses VLM-generated critiques as semantic gradients.

Core Architecture

VQQA operates via three specialized agents in a closed loop:

Question Generation (QG) Agent: Analyzes the generated video, the prompt, and any conditions (e.g., reference images) to dynamically generate a set of targeted visual questions ( $Q$ $Q$ ). These questions cover three dimensions:
- Video-Prompt Alignment.
- Visual Quality (artifacts, physics).
- Condition Fidelity (for Image-to-Video tasks).
Question Answering (QA) Agent: Evaluates the video against the generated questions, assigning normalized scores ( $s \in [0, 100]$ ) and constructing a diagnostic map of critical visual flaws.
Prompt Refinement (PR) Agent: Synthesizes the low-scoring QA pairs (the "semantic gradient") to formulate a revised prompt ( $p_{t+1}$ ). It aims to mitigate the identified localized errors in the next generation iteration while preserving the original user intent.

Key Mechanisms

Semantic Gradients: Instead of mathematical gradients, VQQA uses natural language critiques ( $R_f$ ) from the QA agent to guide the PR agent. The update rule is defined as $p_{t+1} = \text{VLM}(p_t, R_{f,t})$ .
Global Selection Mechanism: To prevent semantic drift (where iterative refinements deviate from the original user intent), a Global VLM Rater evaluates all candidate videos against the original prompt. The final output is selected based on the highest Global Score ($GS$), ensuring alignment with the user's primary goals.
Dynamic Stopping Criterion: The process terminates early if:
1. A target quality threshold ( $\gamma$ ) is met.
2. The global score stagnates over a "patience" window (no marginal improvement), preventing unnecessary compute overhead.

3. Key Contributions

Unified Multi-Agent Framework: VQQA is the first framework to convert video evaluation into a dynamic QA paradigm, yielding actionable feedback across diverse generative tasks (Text-to-Video and Image-to-Video) without task-specific fine-tuning.
Black-Box Test-Time Scaling: The authors formalize test-time scaling as a discrete, text-based optimization problem. By leveraging VLM critiques as semantic gradients, they achieve precise error correction without requiring model weight access, effectively preventing semantic drift.
Model Agnosticism: The framework is strictly model-agnostic, generalizing across open-weights (e.g., CogVideoX) and proprietary models (e.g., Google Veo, Gemini) without architectural changes.

4. Experimental Results

The authors evaluated VQQA on three major benchmarks: T2V-CompBench, VBench2, and VBench-I2V.

Performance Gains:
- T2V-CompBench: VQQA (using Gemini-3-Pro) achieved an average score of 53.46%, an absolute improvement of +11.57% over vanilla generation and +4.76% over the strongest baseline (VQAScore). Significant gains were observed in consistency (+22.94%), spatial understanding (+14.31%), and numeracy (+13.85%).
- VBench2: VQQA achieved a total score of 50.41%, an +8.43% absolute increase over vanilla generation, outperforming state-of-the-art prompt optimization and sampling baselines.
- VBench-I2V: VQQA achieved the highest performance across all axes, improving upon vanilla generation by +1.24%.
Efficiency: Despite the iterative nature, VQQA converges rapidly. On T2V-CompBench, it required an average of only 1.245 optimization rounds to converge, resulting in a total VLM inference cost comparable to a standard Best-of-5 (BoN) baseline (~7.23 calls vs. 5 calls).
Ablation Studies:
- Removing the Global Selection mechanism led to a 1.02% drop in performance due to semantic drift.
- Injecting the Global Score into the refinement loop (GS-in-the-loop) degraded performance, confirming that granular feedback should be decoupled from holistic selection to avoid optimization noise.

5. Significance

VQQA represents a paradigm shift in video generation optimization. By replacing static, passive evaluation metrics with a dynamic, agentic, closed-loop system, it enables:

Interpretability: Users receive human-readable diagnostics explaining why a video failed and how to fix it.
Accessibility: It democratizes high-quality video refinement for commercial black-box models that previously required white-box access for optimization.
Scalability: It offers a computationally efficient alternative to exhaustive sampling (Best-of-N) or complex trajectory search methods, making iterative refinement viable for real-world applications.

The framework effectively bridges the gap between complex human intent and current generative capabilities, paving the way for more controllable and reliable AI-driven content creation.

VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

The Three-Agent Team

The Magic Loop

The "Global Rater" (The Safety Net)

Why is this a big deal?

1. Problem Statement

2. Methodology: The VQQA Framework

Core Architecture

Key Mechanisms

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Multi-Agent Home Energy Management Assistant

ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Fundamentals of Computing Continuous Dynamic Time Warping in 2D under Different Norms

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Efficient Model Repository for Entity Resolution: Construction, Search, and Integration