Imagine How To Change: Explicit Procedure Modeling for Change Captioning

Imagine you are trying to explain to a friend how a magic trick happened. You have two photos: one of a magician holding a red ball, and one of the magician holding a blue ball.

The Old Way (Static Comparison):
Most computer programs today just look at those two photos side-by-side. They squint and say, "Okay, the ball changed color." But they miss the story. Did the magician swap it? Did he paint it? Did he pull a rabbit out of a hat that turned into a ball? Because they only see the "Before" and "After," they often get confused by distractions, like if the magician moved his hand or the lighting changed. They miss the how.

The New Way (ProCap):
This paper introduces a new system called ProCap. Instead of just staring at the two end photos, ProCap acts like a movie director. It imagines the entire movie scene that happened between the two photos.

Here is how it works, broken down into simple steps:

1. The "Imagination" Phase (Filling in the Blanks)

Imagine you have a flipbook. You have the first page and the last page, but all the pages in the middle are blank.

The Problem: If you try to draw every single frame between the start and finish, you'd draw thousands of pages. Most of them would look almost identical (like a ball moving one millimeter to the left). That's a waste of time and energy.
ProCap's Solution: It uses a smart "AI artist" to quickly sketch the missing frames. Then, it acts like a film editor. It looks at all those sketches and says, "Okay, we don't need 100 frames of the ball just sitting there. Let's keep only the 3 or 4 most important moments where the action actually happens."
- Analogy: It's like summarizing a 2-hour movie into a 30-second highlight reel that captures the essence of the plot.

2. The "Study" Phase (Learning the Rules)

Now that ProCap has its "highlight reel" of key moments, it studies them intensely.

It tries to play a game of "Blind Reconstruction." It covers up parts of the video (like hiding the ball or the background) and asks itself, "Based on the text description I have, what should be hidden here?"
By doing this over and over, it learns the rules of change. It learns that if a ball moves, it usually follows a smooth path. It learns to ignore distractions (like a cloud passing by) and focus on the actual change (the ball moving).

3. The "Storytelling" Phase (The Magic Trick)

This is the clever part. When ProCap is ready to describe a new pair of photos for a user, it doesn't actually generate the video frames again. That would be too slow and heavy.

Instead, it uses invisible "magic slots" (called learnable queries). Think of these as empty placeholders in a sentence.
The system asks these slots: "Based on what I learned in the study phase, what would happen between these two photos?"
The slots fill in the "ghost" of the movement, and the system writes the caption based on that invisible story.

Why is this a big deal?

It's not just a detective; it's a historian. Old methods just say "The ball is blue now." ProCap says, "The ball rolled from the left, changed color, and stopped here."
It ignores the noise. If the camera shook or the sun moved, old methods get confused. ProCap knows that the story is about the ball, not the sun, so it filters out the noise.
It's fast. Because it learned the "rules" of how things change during its study phase, it doesn't need to re-draw the movie every time it answers a question. It just recalls the pattern.

In a nutshell:
ProCap changes the game from "Spot the difference between two still photos" to "Imagine the movie that connects them." By understanding the journey of the change, not just the destination, it writes much better, more accurate, and more human-like descriptions.

Here is a detailed technical summary of the paper "IMAGINE HOW TO CHANGE: EXPLICIT PROCEDURE MODELING FOR CHANGE CAPTIONING" (ProCap).

1. Problem Statement

Change Captioning is the task of generating textual descriptions that explicitly detail the differences between two visually similar images (a "before" and "after" pair). While existing methods have achieved success, they suffer from a fundamental limitation: they treat the task as a static image comparison.

The Gap: Current approaches ignore the rich temporal dynamics of the change process. In reality, the transition between two states involves intermediate steps (motion, occlusion, transformation) that are implicitly encoded in the static pair but explicitly revealed in a temporal sequence.
Consequences: Static models struggle to distinguish between actual semantic changes and distractors (e.g., viewpoint shifts, illumination changes, background clutter) because they lack the "how" of the change, focusing only on the "what."

2. Methodology: ProCap Framework

The authors propose ProCap, a novel two-stage framework that shifts the paradigm from static comparison to dynamic procedure modeling.

Stage 1: Explicit Procedure Modeling

This stage aims to learn the latent dynamics of the change by reconstructing the transition process. It consists of three modules:

Procedure Generation Module:
- Uses a pre-trained Frame Interpolation (FI) model (based on optical flow) to synthesize a dense sequence of intermediate frames ( $P_{FI}$ ) between the input image pair ( $I_{bef}, I_{aft}$ ).
- This transforms the implicit change into an explicit, observable temporal sequence.
Confidence-Based Frame Sampling Module:
- Addresses the redundancy and noise in the dense generated sequence.
- Assigns a confidence score to each frame based on its semantic distance from the start and end states (frames closest to the "midpoint" of change are most informative).
- Samples a sparse set of keyframes ( $P_s$ ) to form a compact procedure sequence $P = \{I_{bef}, I_{s1}, ..., I_{aft}\}$ .
Procedure Modeling Module:
- A Transformer-based Procedure Encoder learns the spatio-temporal dynamics of the sampled keyframes.
- Training Objective: A caption-conditioned masked reconstruction task. The model is trained to reconstruct masked visual patches (using multi-granularity masking: entire frames, random patches, in-block, and out-of-block) given the change caption.
- Loss Functions:
  - $L_{msm}$ : Masked sequence modeling (reconstruction).
  - $L_{align}$ : Cross-modal alignment between visual procedure and text.
  - $L_{csy}$ : Temporal consistency (ensuring the sequence order is preserved).

Stage 2: Implicit Procedure Captioning

This stage generates the final caption without the computational cost of synthesizing frames during inference.

Learnable Procedure Queries: Instead of feeding explicit intermediate frames (which are noisy and expensive), the model inserts a set of learnable query embeddings ( $k \cdot n_I$ ) between the "before" and "after" image features.
Mechanism: These queries act as "slots" that prompt the frozen Procedure Encoder to implicitly infer the latent change dynamics.
Decoder: A Transformer decoder generates the text caption based on the encoded representation.
Optimization: The entire model is trained end-to-end using an autoregressive captioning loss ( $L_{CAP}$ ).

3. Key Contributions

Paradigm Shift: ProCap is the first work to reformulate change captioning from static image comparison to dynamic procedure modeling, explicitly modeling the transition process to understand "how" changes occur.
Explicit Procedure Modeling: Introduces a pipeline to synthesize, sample, and model intermediate frames using a caption-conditioned masked reconstruction task, effectively capturing latent spatio-temporal dynamics.
Implicit Procedure Captioning: Proposes learnable procedure queries to replace explicit frame synthesis at inference time. This bypasses the computational overhead and sensitivity to synthesis noise while maintaining the benefits of temporal reasoning.
State-of-the-Art Performance: Demonstrates superior robustness to viewpoint changes and complex scene dynamics compared to existing non-LLM and LLM-based baselines.

4. Experimental Results

The model was evaluated on three benchmark datasets: CLEVR-Change (synthetic), Spot-the-Diff (real-world surveillance), and Image-Editing-Request (open-ended editing).

Performance:
- CLEVR-Change: ProCap achieved a CIDEr score of 135.6, outperforming all non-LLM baselines and competitive LLM-based methods (e.g., FINER, LLaVA-1.5) while using a much lighter architecture.
- Spot-the-Diff: Achieved a CIDEr of 42.7, demonstrating strong ability to disentangle concurrent changes in cluttered scenes.
- Image-Editing-Request: Outperformed all non-LLM baselines, showing strong generalization to open-ended vocabulary.
Efficiency:
- ProCap is significantly faster than methods relying on LLMs or explicit frame synthesis.
- Inference speed (Tokens Per Second) is high (e.g., ~699 TPS on CLEVR-Change with $k=2$ ), whereas explicit frame synthesis approaches are much slower.
Ablation Studies:
- Removing the explicit procedure modeling stage or the learnable queries significantly degrades performance, confirming the necessity of both stages.
- The Visual+Text similarity strategy for frame sampling proved most effective.
- Multi-granularity masking was crucial for learning both global context and fine-grained details.

5. Significance and Impact

Robustness to Distractors: By modeling the full transition path, ProCap can distinguish between genuine object changes and irrelevant variations (like camera movement), a common failure point for static models.
Efficiency vs. Effectiveness: The "Implicit Procedure Captioning" design offers a unique solution to the trade-off between the rich temporal information provided by video/frames and the computational cost of processing them. It achieves video-like reasoning capabilities with image-like inference costs.
Generalization: The framework shows that learning the "procedure" of change provides a more robust representation than static feature comparison, leading to better generalization in open-ended scenarios.
Future Direction: The paper suggests that while 2D generative models work well for moderate changes, future work could integrate 3D scene modeling to handle extreme geometric discontinuities and viewpoint shifts.

In summary, ProCap introduces a principled approach to change captioning by treating the difference between images not as a static gap, but as a dynamic, learnable procedure, resulting in more accurate, coherent, and efficient descriptions.

Imagine How To Change: Explicit Procedure Modeling for Change Captioning

1. The "Imagination" Phase (Filling in the Blanks)

2. The "Study" Phase (Learning the Rules)

3. The "Storytelling" Phase (The Magic Trick)

Why is this a big deal?

1. Problem Statement

2. Methodology: ProCap Framework

Stage 1: Explicit Procedure Modeling

Stage 2: Implicit Procedure Captioning

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms

PACED: Distillation at the Frontier of Student Competence

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

Reversible Lifelong Model Editing via Semantic Routing-Based LoRA