SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning

Imagine you have a very smart robot friend who loves watching videos and describing what happens in them. You want this robot to be a master storyteller, not just a robot that says, "A dog runs." You want it to say, "A golden retriever joyfully sprints across the grass, its tail wagging like a metronome, chasing a red ball."

This paper, SynPO, is about teaching that robot how to become a master storyteller without losing its mind in the process.

Here is the story of how they did it, broken down into three simple parts:

1. The Problem: The Robot is Getting "Stuck"

The researchers tried using a popular training method called DPO (Direct Preference Optimization). Think of DPO like a strict teacher who shows the robot two stories: one good one and one bad one. The teacher says, "Pick the good one! Don't pick the bad one!"

The problem? The robot started acting weird.

The "Negative" Trap: The robot became so obsessed with avoiding the bad stories that it started writing terrible stories just to make sure they were different from the bad ones. It was like a student so afraid of getting a "F" that they stopped trying to get an "A" and just wrote nonsense to be safe.
The "Reference" Glitch: The old method required the robot to constantly compare itself to a "perfect" version of itself from the past (a reference model). This was like trying to run a race while looking in a rearview mirror. It slowed everything down and made the robot forget how to speak naturally.

2. The Solution: The "SynPO" Recipe

The authors created a new method called SynPO (Synergistic Preference Optimization). Think of SynPO as a Gourmet Chef who knows exactly how to balance flavors.

Step A: Cooking the Ingredients (Data Construction)

Before the robot can learn, you need good recipes.

The Old Way: You'd hire a super-expert human (or a super-expert AI) to grade every story the robot wrote. This is expensive and slow.
The SynPO Way: They built a clever kitchen pipeline.
1. The robot writes 10 different stories about the same video.
2. They use a "Self-Check" system: The robot looks at its own stories and asks, "Does this match the facts? Is the grammar good? Do these stories agree with each other?"
3. A simple AI helper gives a quick score.
4. The best story becomes the "Gold Star" (Positive), and the worst becomes the "Red X" (Negative).
- Analogy: Instead of hiring a famous food critic to taste every dish, the chef lets the kitchen staff taste-test their own creations and picks the best and worst ones. It's cheap, fast, and surprisingly accurate.

Step B: The New Training Method (The Optimization)

This is the magic sauce. SynPO changes how the robot learns from the Gold Star and the Red X.

Fixing the "Negative Trap": In the old method, the robot was punished too hard for the "Red X" stories, which made it panic. SynPO balances the scale. It says, "Don't just avoid the bad stuff; actively chase the good stuff." It stops the robot from getting obsessed with what not to do.
The "Language Guardian": The researchers added a special rule: "No matter what, you must still sound like a human." They added a "Language Capability" bonus. If the robot starts writing gibberish just to win the game, it loses points. This keeps the robot's vocabulary and grammar sharp.
Ditching the Rearview Mirror: SynPO doesn't need that "reference model" (the past version of the robot). It learns directly.
- Analogy: Imagine learning to ride a bike. The old way was having someone hold the back of your seat (the reference model) while you pedaled. SynPO is like training with a coach who gives you immediate feedback but lets you ride freely. It's 20% faster because you aren't dragging that extra weight.

3. The Result: A Better Storyteller

When they tested SynPO, the results were amazing:

Better Stories: The robot started writing much more detailed, accurate, and lively descriptions of videos. It could capture small details like "the way the light hit the water" or "the child's nervous fidgeting."
Faster Learning: Because it didn't need the "rearview mirror" (reference model), it trained 20% faster.
Smarter Robot: Even when they tested it on tasks that had nothing to do with videos (like answering math questions or writing essays), the robot was better at those too. It proved that SynPO didn't just make the robot a better video describer; it made it a smarter, more reliable AI overall.

In a Nutshell

SynPO is like giving a student a better study guide. Instead of just telling them "Don't get the answer wrong" (which makes them anxious and prone to mistakes), it teaches them "Here is exactly what a great answer looks like, and here is how to keep your grammar perfect while you find it." The result is a student who learns faster, writes better, and doesn't panic.

1. Problem Statement

Fine-grained video captioning aims to generate detailed, temporally coherent descriptions of video content. While Vision-Language Models (VLMs) have advanced this field, existing methods face two critical bottlenecks:

Data Scarcity: High-quality preference pairs (positive/negative caption pairs) essential for preference learning are scarce. Existing datasets often lack detail or rely on expensive manual annotation or stronger VLMs for scoring, which is cost-prohibitive for smaller teams.
Limitations of Direct Preference Optimization (DPO): While DPO is a popular fine-tuning method that bypasses explicit reward modeling, it suffers from specific issues in video captioning:
- Reward Collapse: Both positive and negative reward values tend to decrease simultaneously during training.
- Negative Dominance: The optimization process becomes dominated by suppressing negative preferences rather than promoting positive ones.
- Capability Degradation: The model's general language generation capability (fluency, coherence) often degrades as it over-optimizes for preference ranking, leading to a shift from generating high-quality text to merely discriminating between options.

2. Methodology

The authors propose a two-pronged solution: an automated pipeline for constructing high-quality preference data and a novel optimization algorithm called SynPO.

A. Automated Preference Pair Construction Pipeline

To address data scarcity without relying on expensive external scorers, the authors designed a pipeline leveraging the intrinsic properties of VLMs and limited LLM assistance:

Enhanced Inference: The VLM generates multiple candidate captions for a single video using two strategies:
- Contrastive Decoding: Reduces hallucinations by contrasting logits from sparse frames against full sequences.
- Self-Retrospective Strategy: The model iteratively refines its initial caption by feeding its own output back as context, enhancing detail and coherence.
Three-Criteria Scoring: An LLM scores the generated candidates based on:
- Factuality (Temporal Decomposition): The video is split into clips; the LLM checks if the full-video caption aligns with the clip-level captions to detect hallucinations.
- Instruction Fidelity & Fluency: Evaluates adherence to prompts, linguistic naturalness, and objectivity.
- Self-Consistency: Uses multi-sample analysis to reward captions that maintain consistent entities and actions across different generations.
Selection: The highest-scoring and lowest-scoring candidates form the positive and negative preference pairs, respectively.

B. Synergistic Preference Optimization (SynPO)

SynPO is a novel optimization objective designed to fix the theoretical and empirical flaws of DPO. The loss function is formulated as:
$L_{SynPO} = -\mathbb{E} \left[ \sigma \left( \alpha \cdot e^{\log S(y_w)} - \alpha \cdot e^{\log S(y_l)} \right) + \beta \cdot S(y_w) \right]$

Key innovations in SynPO include:

Exponential Transformation: Instead of using raw log-probabilities (as in DPO), SynPO applies an exponential function to the reward terms. This prevents the simultaneous decrease of positive and negative rewards, ensuring that the optimization gradient is driven by the improvement of positive samples rather than just the suppression of negative ones.
Explicit Language Capability Term: A new term ( $\beta \cdot S(y_w)$ ) is added to the loss. This term uses arithmetic averaging (rather than log-averaging) to explicitly encourage the model to maintain high token-level fluency and grammatical correctness, preventing the "objective drift" where the model loses its generative ability.
Reference-Free Design: SynPO eliminates the need for a reference model ( $\pi_{ref}$ ) during training, which is a requirement in standard DPO. This reduces computational overhead and memory usage.

3. Key Contributions

Novel Data Pipeline: An automated, cost-effective method to generate high-quality preference pairs for fine-grained video captioning using VLM self-consistency and LLM scoring, removing the need for human annotation or stronger VLMs.
SynPO Algorithm: A new preference optimization framework that:
- Reformulates reward computation to prevent negative preference dominance.
- Explicitly preserves language generation capabilities via an auxiliary loss term.
- Achieves training efficiency by removing the reference model.
Comprehensive Evaluation: Extensive experiments demonstrating superiority over DPO and its variants (IPO, SimPO, KTO, etc.) across video captioning benchmarks and general NLP tasks.

4. Experimental Results

The authors evaluated SynPO on multiple models (AuroraCap, LLaVA-1.6, InternVL-2) and datasets (VDC, VDD, VATEX, MSR-VTT, ShareGPT4Video).

Video Captioning Performance: SynPO consistently outperformed DPO and six other variants. On the VDC benchmark, it achieved significant improvements in scores related to camera, background, main object, and detail description.
Training Efficiency: By eliminating the reference model, SynPO achieved approximately 20% improvement in training efficiency compared to standard DPO.
Language Capability Preservation: Unlike DPO, which showed a decline in language capability after initial gains (as seen in Figure 1 and Figure 4), SynPO maintained or improved fluency and coherence throughout training.
General NLP Tasks: The method was validated on NLP benchmarks (MT-Bench, AlpacaEval2, HuggingFace Open LLM Leaderboard tasks like MMLU-PRO, GSM8K). SynPO consistently achieved state-of-the-art or superior results compared to other preference optimization methods, proving its domain-agnostic effectiveness.

5. Significance

This paper addresses a critical gap in the alignment of Vision-Language Models. By identifying the theoretical flaws in DPO (specifically the gradient dynamics leading to capability degradation) and proposing a mathematically grounded solution (SynPO), the authors provide a robust framework for fine-tuning models on complex, detail-oriented tasks. The work demonstrates that it is possible to align models with human preferences without sacrificing their fundamental generative capabilities, a crucial step for deploying reliable AI in video understanding and general language applications. The open-source release of the pipeline and code further facilitates reproducibility and adoption in the community.