VIVECaption: A Split Approach to Caption Quality Improvement

Imagine you are trying to teach a robot artist how to paint pictures based on your descriptions. You tell the robot, "Paint a brave knight in a blue suit holding a red sword." If the robot paints a wizard in a green robe holding a green wand, the picture is wrong.

In the world of AI, this "robot artist" is a Text-to-Image model. But before it can learn, someone has to write the instructions (the captions) for the thousands of pictures it studies. This paper, VIVECaption, is about fixing the people (or other robots) who write those instructions, because they are currently making a lot of mistakes.

Here is the breakdown of the problem and their solution, using some everyday analogies.

The Problem: The "Hallucinating" Librarian

Currently, companies use powerful AI models (called Visual Language Models) to look at a picture and write a description for it. Think of these AIs as librarians who are very smart but have a bad habit: they hallucinate.

The Mistake: If you show a librarian a picture of a dog named "Spot," they might confidently write, "This is a cat named Whiskers," because in their training data, cats are more common than dogs.
The Consequence: If you feed these wrong descriptions to your robot artist, the artist learns the wrong lessons. It might start painting cats when you ask for dogs, or it might mix up the colors and shapes. The paper calls this "misalignment."

The Solution: The "Two-Step" Detective Team

The authors propose a new way to write these descriptions. Instead of asking one robot to do everything, they split the job into two specialized steps, like a detective team.

Step 1: The "Character Spotter" (The Detective)

First, they use a specialized AI just to look at the picture and answer one simple question: "Who is actually in this picture?"

The Trick: They don't just let the AI guess. They give it a "Gold Standard" cheat sheet. They show the AI pictures of every character that could appear (like a lineup of suspects) and ask it to pick the ones present.
The Training: They teach this "Spotter" AI using a small, perfectly labeled set of pictures (the Gold Standard). It's like a student taking a practice test with an answer key until they get a perfect score.

Step 2: The "Storyteller" (The Writer)

Once the "Spotter" has confirmed, "Yes, this is Ellie, and she is holding a knife," this information is passed to a second AI, the "Storyteller."

The Job: The Storyteller doesn't have to guess who is in the picture. It just has to describe the scene, the background, and the mood, using the names the Spotter gave it.
The Result: Because the Storyteller isn't guessing the names, it doesn't make up fake characters. The description becomes accurate and structured.

The "Gold Standard" Dataset: The Recipe Book

How do they teach the "Spotter" to be perfect? They created a Gold Standard Dataset.

Imagine you want to teach a chef to make the perfect burger. You can't just throw random ingredients at them. You need a perfect recipe with exact measurements.

The authors took 310 images from an open-source movie.
They carefully labeled every single character in those 310 images by hand (or with high-quality tools).
They used this small, perfect "recipe book" to train their Spotter AI. Even though it's a small book, it taught the AI the right habits so it could handle the rest of the movie (thousands of frames) correctly.

Why This Matters: The "Vegan" Data Approach

The paper emphasizes using "Vegan" data. In AI terms, this means using data that is 100% original and open-source, without scraping the internet where you might accidentally steal copyrighted art or stories.

The Benefit: By using this two-step method with open-source models, companies can build high-quality AI artists without worrying about legal trouble or using "poisoned" (copyrighted) data.

The Results: Small Changes, Big Impact

The paper shows that when they used this "Spotter + Storyteller" team:

Accuracy Skyrocketed: The AI stopped calling "Ellie" a "Victoria."
Better Descriptions: The final descriptions were not just factually correct about the characters, but also better at describing the mood and background.
Small Models Work Great: They proved you don't need a massive, expensive supercomputer. A smaller, cheaper AI model, once "trained" on this Gold Standard, performed just as well as the giant models.

The Takeaway

VIVECaption is like hiring a specialized fact-checker before you let a writer publish a story.

Old Way: Ask one person to guess the facts and write the story. (Result: Lots of lies and mistakes).
New Way: Ask a fact-checker to verify the names, then hand those verified names to a writer. (Result: A story that is both accurate and well-written).

This approach ensures that the AI artists we build in the future will actually paint what we ask them to, rather than painting whatever they think we asked for.

Here is a detailed technical summary of the paper "VIVECaption: A Split Approach to Caption Quality Improvement."

1. Problem Statement

The paper addresses a critical bottleneck in training high-quality Text-to-Image (T2I) and Text-to-Video (T2V) generative models: poor image-caption alignment.

The Issue: While Visual Language Models (VLMs) are commonly used to auto-generate captions for training data, they suffer from significant limitations:
- Hallucinations: Generating objects or characters based on statistical priors rather than visual evidence.
- Poor Compositional Reasoning: Failing to correctly describe relationships between objects.
- Lack of Fine-Grained Understanding: Missing subtle details like text or specific character identities.
Consequence: Misaligned image-caption pairs degrade the performance of downstream generative models.
Constraint: Many enterprise teams require "vegan" training data (no web-scraped content to avoid copyright issues), necessitating high-quality, self-contained datasets where captioning errors cannot be corrected by external data sources.

2. Methodology: VIVECaption

The authors propose VIVECaption, a systematic two-sided approach to improve caption quality, focusing on structured captions (dictionary-based) rather than dense paragraphs to facilitate better parsing and evaluation.

Side A: Gold-Standard Dataset Creation

To align models effectively, a high-quality reference dataset is required.

Challenge: Creating a stratified sample of characters is a "chicken-and-egg" problem; one needs a good detector to sample evenly, but needs the sample to train the detector.
Solution: The authors used CLIP embeddings of all frames from the open-source film Sprite Fright, projected via UMAP and clustered using HDBSCAN.
Process: They sampled one image from each of the 310 resulting clusters to ensure diverse visual coverage. Human annotators then labeled the ground-truth characters for these 310 images.
Outcome: A small but principled "golden-standard" dataset (310 images) representing the distribution of the full dataset (2,161 frames).

Side B: Model Alignment Strategy

The approach utilizes a two-stage pipeline using open-source models (Qwen2.5-VL and InternVL3):

Context Alignment (In-Context Learning): Providing character images and descriptions as context to the VLM to establish a baseline.
Parameter-Level Alignment (Supervised Fine-Tuning - SFT):
- Task: Fine-tuning a character detection model (Qwen2.5-VL) to accurately identify characters present in an image based on the gold-standard dataset.
- Technique: Parameter-Efficient Fine-Tuning (PEFT/LoRA) with 5 epochs.
- Pipeline: The fine-tuned detector identifies characters $\rightarrow$ passes this list to a larger captioning model (InternVL3-38B) $\rightarrow$ generates a structured JSON caption.

3. Key Contributions

A. Taxonomy of Caption Metrics

The paper introduces a classification system for caption evaluation metrics to help teams choose appropriate tools:

Model-Free Universal Metrics: Calculable without extra references (e.g., caption length, format adherence). Useful for initial "health checks" but lack semantic depth.
Model-Based Universal Metrics: Use a reference model (e.g., CLIP score, LLM-as-a-judge). Good for holistic evaluation without manual labels but limited as optimization objectives.
Instance-Grounded Metrics: Require a ground-truth reference $r$ for each pair (e.g., character precision/recall). These are the most effective for optimization (SFT) but require manual curation of a gold-standard dataset.

B. The Two-Stage Pipeline

The authors demonstrate that offloading specific tasks (character detection) to a specialized, fine-tuned model significantly improves the holistic quality of the final caption, preventing the main captioning model from being overwhelmed by long context windows or reasoning errors.

4. Experimental Results

The experiments were conducted on the Sprite Fright dataset using Qwen2.5-VL (3B, 7B, 32B) and InternVL3-38B.

Instance-Grounded Metrics (Character Detection):
- Fine-tuned (SFT) models drastically outperformed off-the-shelf baselines.
- MacroF1 Score: Improved from ~0.44 (3B Baseline) to 0.88 (3B SFT).
- Mistakes: Reduced from 2.05 to 0.34 per example for the 3B model.
- Key Finding: A fine-tuned 3B model performed comparably to or better than off-the-shelf 7B/32B models, proving SFT allows smaller models to "punch above their weight."
Holistic Caption Quality (LLM-as-Eval):
- Using Gemini-2.5-Pro to score captions on four axes: Scene, Background, Characters, and Salient Objects.
- Overall Score: Improved from 5.89 (Baseline) to 7.35 (SFT) for the 7B model.
- Statistical Significance: Paired t-tests confirmed the improvements were statistically significant ( $p < 0.01$ ) across all categories except "Background" (which was already high).
- Qualitative Improvement: The improved pipeline eliminated hallucinations (e.g., misidentifying "Ellie" as "Victoria") and provided more precise, structured descriptions.

5. Significance and Conclusion

Data-Centric AI: The paper reinforces that data quality (specifically alignment) is often more critical than model architecture for generative AI performance.
Practical "Vegan" Solution: It provides a reproducible, low-cost methodology for enterprise teams to create high-quality training data without relying on potentially copyrighted web-scraped content.
Efficiency: The approach demonstrates that SFT on a small, curated dataset combined with PEFT is a highly efficient way to fix specific failure modes (like character hallucination) in VLMs.
Scalability: The "Split Approach" (specialized detector + general captioner) offers a scalable architecture for complex media with many entities, reducing the cognitive load on the primary captioning model.

In summary, VIVECaption proves that a disciplined, two-sided approach involving stratified gold-standard dataset creation and targeted SFT can significantly rectify the hallucination and alignment issues inherent in off-the-shelf VLMs, leading to superior generative model training data.