SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning

This paper introduces SynPO, a novel preference optimization framework that synergizes descriptiveness and preference learning to enhance fine-grained video captioning by constructing cost-effective preference pairs and eliminating the reference model, thereby outperforming DPO variants with improved training efficiency and preserved language capabilities.

Jisheng Dang, Yizhou Zhang, Hao Ye, Teng Wang, Siming Chen, Huicheng Zheng, Yulan Guo, Jianhuang Lai, Bin Hu

Published 2026-03-24
📖 5 min read🧠 Deep dive

Imagine you have a very smart robot friend who loves watching videos and describing what happens in them. You want this robot to be a master storyteller, not just a robot that says, "A dog runs." You want it to say, "A golden retriever joyfully sprints across the grass, its tail wagging like a metronome, chasing a red ball."

This paper, SynPO, is about teaching that robot how to become a master storyteller without losing its mind in the process.

Here is the story of how they did it, broken down into three simple parts:

1. The Problem: The Robot is Getting "Stuck"

The researchers tried using a popular training method called DPO (Direct Preference Optimization). Think of DPO like a strict teacher who shows the robot two stories: one good one and one bad one. The teacher says, "Pick the good one! Don't pick the bad one!"

The problem? The robot started acting weird.

  • The "Negative" Trap: The robot became so obsessed with avoiding the bad stories that it started writing terrible stories just to make sure they were different from the bad ones. It was like a student so afraid of getting a "F" that they stopped trying to get an "A" and just wrote nonsense to be safe.
  • The "Reference" Glitch: The old method required the robot to constantly compare itself to a "perfect" version of itself from the past (a reference model). This was like trying to run a race while looking in a rearview mirror. It slowed everything down and made the robot forget how to speak naturally.

2. The Solution: The "SynPO" Recipe

The authors created a new method called SynPO (Synergistic Preference Optimization). Think of SynPO as a Gourmet Chef who knows exactly how to balance flavors.

Step A: Cooking the Ingredients (Data Construction)

Before the robot can learn, you need good recipes.

  • The Old Way: You'd hire a super-expert human (or a super-expert AI) to grade every story the robot wrote. This is expensive and slow.
  • The SynPO Way: They built a clever kitchen pipeline.
    1. The robot writes 10 different stories about the same video.
    2. They use a "Self-Check" system: The robot looks at its own stories and asks, "Does this match the facts? Is the grammar good? Do these stories agree with each other?"
    3. A simple AI helper gives a quick score.
    4. The best story becomes the "Gold Star" (Positive), and the worst becomes the "Red X" (Negative).
    • Analogy: Instead of hiring a famous food critic to taste every dish, the chef lets the kitchen staff taste-test their own creations and picks the best and worst ones. It's cheap, fast, and surprisingly accurate.

Step B: The New Training Method (The Optimization)

This is the magic sauce. SynPO changes how the robot learns from the Gold Star and the Red X.

  • Fixing the "Negative Trap": In the old method, the robot was punished too hard for the "Red X" stories, which made it panic. SynPO balances the scale. It says, "Don't just avoid the bad stuff; actively chase the good stuff." It stops the robot from getting obsessed with what not to do.
  • The "Language Guardian": The researchers added a special rule: "No matter what, you must still sound like a human." They added a "Language Capability" bonus. If the robot starts writing gibberish just to win the game, it loses points. This keeps the robot's vocabulary and grammar sharp.
  • Ditching the Rearview Mirror: SynPO doesn't need that "reference model" (the past version of the robot). It learns directly.
    • Analogy: Imagine learning to ride a bike. The old way was having someone hold the back of your seat (the reference model) while you pedaled. SynPO is like training with a coach who gives you immediate feedback but lets you ride freely. It's 20% faster because you aren't dragging that extra weight.

3. The Result: A Better Storyteller

When they tested SynPO, the results were amazing:

  • Better Stories: The robot started writing much more detailed, accurate, and lively descriptions of videos. It could capture small details like "the way the light hit the water" or "the child's nervous fidgeting."
  • Faster Learning: Because it didn't need the "rearview mirror" (reference model), it trained 20% faster.
  • Smarter Robot: Even when they tested it on tasks that had nothing to do with videos (like answering math questions or writing essays), the robot was better at those too. It proved that SynPO didn't just make the robot a better video describer; it made it a smarter, more reliable AI overall.

In a Nutshell

SynPO is like giving a student a better study guide. Instead of just telling them "Don't get the answer wrong" (which makes them anxious and prone to mistakes), it teaches them "Here is exactly what a great answer looks like, and here is how to keep your grammar perfect while you find it." The result is a student who learns faster, writes better, and doesn't panic.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →