Imagine How To Change: Explicit Procedure Modeling for Change Captioning

The paper introduces ProCap, a novel framework that improves change captioning by reformulating static image comparison into dynamic procedure modeling through a two-stage design that learns latent change dynamics from sparse keyframes and utilizes learnable procedure queries to generate temporally coherent descriptions of how changes occur.

Jiayang Sun, Zixin Guo, Min Cao, Guibo Zhu, Jorma Laaksonen

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you are trying to explain to a friend how a magic trick happened. You have two photos: one of a magician holding a red ball, and one of the magician holding a blue ball.

The Old Way (Static Comparison):
Most computer programs today just look at those two photos side-by-side. They squint and say, "Okay, the ball changed color." But they miss the story. Did the magician swap it? Did he paint it? Did he pull a rabbit out of a hat that turned into a ball? Because they only see the "Before" and "After," they often get confused by distractions, like if the magician moved his hand or the lighting changed. They miss the how.

The New Way (ProCap):
This paper introduces a new system called ProCap. Instead of just staring at the two end photos, ProCap acts like a movie director. It imagines the entire movie scene that happened between the two photos.

Here is how it works, broken down into simple steps:

1. The "Imagination" Phase (Filling in the Blanks)

Imagine you have a flipbook. You have the first page and the last page, but all the pages in the middle are blank.

  • The Problem: If you try to draw every single frame between the start and finish, you'd draw thousands of pages. Most of them would look almost identical (like a ball moving one millimeter to the left). That's a waste of time and energy.
  • ProCap's Solution: It uses a smart "AI artist" to quickly sketch the missing frames. Then, it acts like a film editor. It looks at all those sketches and says, "Okay, we don't need 100 frames of the ball just sitting there. Let's keep only the 3 or 4 most important moments where the action actually happens."
    • Analogy: It's like summarizing a 2-hour movie into a 30-second highlight reel that captures the essence of the plot.

2. The "Study" Phase (Learning the Rules)

Now that ProCap has its "highlight reel" of key moments, it studies them intensely.

  • It tries to play a game of "Blind Reconstruction." It covers up parts of the video (like hiding the ball or the background) and asks itself, "Based on the text description I have, what should be hidden here?"
  • By doing this over and over, it learns the rules of change. It learns that if a ball moves, it usually follows a smooth path. It learns to ignore distractions (like a cloud passing by) and focus on the actual change (the ball moving).

3. The "Storytelling" Phase (The Magic Trick)

This is the clever part. When ProCap is ready to describe a new pair of photos for a user, it doesn't actually generate the video frames again. That would be too slow and heavy.

  • Instead, it uses invisible "magic slots" (called learnable queries). Think of these as empty placeholders in a sentence.
  • The system asks these slots: "Based on what I learned in the study phase, what would happen between these two photos?"
  • The slots fill in the "ghost" of the movement, and the system writes the caption based on that invisible story.

Why is this a big deal?

  • It's not just a detective; it's a historian. Old methods just say "The ball is blue now." ProCap says, "The ball rolled from the left, changed color, and stopped here."
  • It ignores the noise. If the camera shook or the sun moved, old methods get confused. ProCap knows that the story is about the ball, not the sun, so it filters out the noise.
  • It's fast. Because it learned the "rules" of how things change during its study phase, it doesn't need to re-draw the movie every time it answers a question. It just recalls the pattern.

In a nutshell:
ProCap changes the game from "Spot the difference between two still photos" to "Imagine the movie that connects them." By understanding the journey of the change, not just the destination, it writes much better, more accurate, and more human-like descriptions.