ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos

Imagine you are trying to teach a robot how to make a sandwich. You show it a picture of two slices of bread and a picture of a finished turkey sandwich. Your goal is for the robot to figure out the steps in between: put bread down, add turkey, add lettuce, put top bread on.

The problem is, if you just ask a super-smart robot (like a giant AI) to "figure it out," it might get confused. It might try to put the turkey on the table before the bread, or it might forget that you need to put the bottom slice down first. To fix this, current AI models try to "memorize" millions of sandwich recipes by reading huge amounts of data. They become massive, expensive, and slow, like a librarian who has read every book in the world but still gets confused when asked to make a simple sandwich.

Enter ViterbiPlanNet.

The authors of this paper came up with a smarter, lighter, and more efficient way to teach the robot. Instead of forcing the AI to memorize every single rule, they gave it a map.

The Core Idea: The "Recipe Map" (Procedural Knowledge Graph)

Imagine you have a physical map of a city.

The Intersections are the actions (e.g., "Add Turkey").
The Roads are the valid moves between actions.
The Traffic Signs tell you which roads are one-way (you can't add lettuce before the bread is down).

This map is called a Procedural Knowledge Graph (PKG). In the past, AI researchers would use this map only after the AI made a guess, just to fix mistakes (like a GPS recalculating a route after you took a wrong turn).

ViterbiPlanNet does something different: it builds the map into the AI's brain while it is learning.

The Secret Sauce: The "Differentiable Viterbi Layer"

This is the technical magic, but here is the simple version:

Usually, computers are bad at learning from maps because the map is rigid. If the AI makes a tiny mistake, the map says "NO," and the computer can't learn from that error because the "No" button is a hard stop.

The authors invented a "Soft Map" (the Differentiable Viterbi Layer).

Imagine the map isn't made of concrete walls, but of soft, stretchy rubber bands.
If the AI tries to put the turkey on the table, the rubber band stretches and gently pulls it back, saying, "Hey, that's not quite right, try the bread first."
Because the map is "soft" (mathematically smooth), the AI can feel the pull and learn why it was wrong. It learns to predict the right steps by feeling the shape of the map, rather than just memorizing the destination.

Why This is a Big Deal

It's a Lightweight Backpack, Not a Heavy Suit:
Current AI models trying to do this are like wearing a 300-pound suit of armor (billions of parameters). They need massive computers to run. ViterbiPlanNet is like wearing a lightweight hiking backpack. It uses 1,000 times fewer resources but still walks the path faster and more accurately.
It Learns Faster (Sample Efficiency):
Because the AI has the map, it doesn't need to see a million sandwich videos to learn. It only needs to see a few, because the map tells it what usually happens. It's like learning to drive: if you have a map of the rules of the road, you don't need to crash a million times to learn how to stop at a red light.
It Doesn't Get Lost in the Dark:
The paper tested the AI on "shorter" tasks than it was trained on. Imagine training the AI to make a 6-step sandwich, then asking it to make a 3-step one.
- Old AI: "I only know how to make a 6-step sandwich! I'm confused!"
- ViterbiPlanNet: "Oh, I just need to follow the map for the first three stops. Easy!"
  It understands the structure of the process, not just the specific length of the task.

The "Unified Test" (The Fair Play Rule)

The authors also noticed that previous researchers were playing unfair. Some were testing their AI on different days, with different rules, or using different "rulers" to measure success. It was like comparing a sprinter's time on a muddy track to a runner's time on a track made of ice.

They created a Unified Testing Protocol. They made sure everyone ran the same race, on the same track, with the same stopwatch. When they did this, ViterbiPlanNet didn't just win; it dominated, proving that the "map" approach is genuinely superior, not just a lucky fluke.

Summary

ViterbiPlanNet is like giving a robot a GPS and a rulebook while it learns to cook, rather than forcing it to memorize every single recipe in the world.

The Map (PKG): Tells it what steps are possible.
The Soft Pull (Differentiable Viterbi): Gently guides it to learn the right path without getting stuck.
The Result: A robot that is smarter, faster, cheaper to run, and doesn't get confused when the task changes slightly.

It's a move away from "brute force" AI (throwing more data and money at the problem) toward "smart" AI (using logic and structure to learn efficiently).

1. Problem Definition

Video Procedural Planning is the task of predicting a sequence of actions (a plan) that transforms an initial visual state ( $v_s$ ) into a desired goal visual state ( $v_g$ ) based on instructional videos.

Current Limitations: State-of-the-art (SOTA) approaches typically rely on massive models (e.g., Diffusion models, Large Language Models, or Transformers) that learn procedural structures implicitly from large datasets. This leads to:
- Low Sample Efficiency: Requiring vast amounts of data to learn valid action sequences.
- High Computational Cost: Massive parameter counts (often hundreds of millions to billions).
- Lack of Structural Guarantees: Models often generate sequences that violate logical or temporal constraints (e.g., adding filling before bread) because they rely on statistical memorization rather than explicit rules.
- Evaluation Inconsistencies: Prior work suffers from inconsistent data splits, feature extraction methods, and evaluation metrics, making fair comparisons difficult.

2. Methodology: ViterbiPlanNet

The authors propose ViterbiPlanNet, a framework that explicitly integrates Procedural Knowledge into the learning process via a Differentiable Viterbi Layer (DVL).

Core Components

Procedural Knowledge Graph (PKG):
- A pre-computed directed graph $G=(V, E, \omega)$ where nodes are actions and edges represent valid transitions with associated probabilities (estimated from action co-occurrence statistics in the training data).
- This graph encodes the "rules" of the domain (e.g., "Add Turkey" can only follow "Place Bottom Bread").
Visual Encoding ( $f_{enc}$ ):
- Takes start ( $v_s$ ) and goal ( $v_g$ ) video frames as input.
- Uses a frozen visual backbone (S3D) followed by a learnable projection to extract visual features.
Emission Probabilities ( $f_{emiss}$ ):
- A lightweight neural network (Transformer + MLP) that predicts the compatibility of a specific action $a_t$ with the visual states.
- Output: A matrix of emission probabilities $b \in \mathbb{R}^{T \times N}$ , where $N$ is the number of actions and $T$ is the plan length.
- Crucial Distinction: The model only learns to predict emissions (visual compatibility), not the full plan. The structural constraints are handled by the graph.
Differentiable Viterbi Layer (DVL):
- The Innovation: Standard Viterbi decoding uses non-differentiable max and argmax operations, preventing end-to-end training. The authors replace these with smooth relaxations (Log-Sum-Exp for max and Softmax for argmax) based on differentiable dynamic programming.
- Function: The DVL takes the emission probabilities and the fixed PKG transition matrix as input. It recursively computes a "soft plan" (a probability distribution over action sequences) that respects the graph structure.
- Gradient Flow: Because the layer is differentiable, gradients from the planning loss flow back through the DVL to update the emission network ( $f_{emiss}$ ), forcing the visual encoder to learn representations that align with the procedural graph.

Training Objective

The model is trained end-to-end using a composite loss function:
$\mathcal{L} = \mathcal{L}_{plan} + \mathcal{L}_{align} + \mathcal{L}_{task}$

$\mathcal{L}_{plan}$ (Planning Loss): Mean Squared Error (MSE) between the soft plan output by the DVL and the ground-truth one-hot plan.
$\mathcal{L}_{align}$ (Visual-Semantic Alignment): Contrastive loss aligning visual embeddings with textual descriptions of procedural states (before/after states).
$\mathcal{L}_{task}$ (Task Classification): Auxiliary loss to predict the global task label from visual inputs, ensuring semantic consistency.

3. Key Contributions

ViterbiPlanNet Framework: A novel architecture that integrates procedural knowledge end-to-end via a Differentiable Viterbi Layer. This shifts the learning burden from memorizing complex rules to learning simple emission probabilities, resulting in a highly parameter-efficient model.
Unified Evaluation Protocol: The authors identified and addressed significant inconsistencies in prior literature (different splits, metrics, and seeds). They established a standardized benchmark with consistent data splits, metrics (SR, mAcc, mIoU), and statistical significance testing (via bootstrapping) to ensure fair comparison.
Cross-Horizon Consistency: Introduced a new testing protocol where models trained on long horizons (e.g., $T=6$ ) are tested on shorter horizons ( $T=3, 4, 5$ ) to verify if they learn generalizable procedural logic rather than memorizing horizon-specific patterns.

4. Experimental Results

The method was evaluated on three standard datasets: CrossTask, COIN, and NIV.

Performance: ViterbiPlanNet achieved State-of-the-Art (SOTA) performance across all datasets, particularly in Success Rate (SR) (exact sequence match).
- It outperformed diffusion-based planners (e.g., PDPP, MTID) and LLM-based planners (e.g., PlanLLM, SCHEMA).
- On CrossTask ( $T=3$ ), it achieved 38.45% SR, significantly beating SCHEMA (37.24%) and PlanLLM (36.84%).
Parameter Efficiency:
- ViterbiPlanNet uses only ~5.5M parameters.
- This is 2–3 orders of magnitude fewer than competing methods (e.g., MTID: ~1B params, PlanLLM: ~385M params, LLMs: ~30B+ params).
Sample Efficiency:
- When trained on limited data (e.g., 5–25% of the dataset), ViterbiPlanNet significantly outperformed SCHEMA.
- This confirms that the explicit PKG guidance reduces the need for massive data memorization.
Robustness (Cross-Horizon):
- When trained on $T=6$ and tested on shorter horizons, ViterbiPlanNet showed superior robustness compared to all baselines, maintaining high performance where others degraded significantly.
Ablation Studies:
- Structure-Aware Training: The performance gains come primarily from using the DVL during training, not just as a post-processing step at inference.
- Emission Learning: The model learns meaningful emission probabilities that are decoupled from the graph structure, allowing the DVL to handle the structural constraints.

5. Significance and Impact

Paradigm Shift: The paper challenges the trend of using massive, implicit models for procedural planning. It demonstrates that explicit structural priors (via PKG) combined with differentiable decoding yield better, more efficient, and more robust results.
Efficiency for Edge Devices: The drastic reduction in parameters makes this approach viable for deployment on resource-constrained devices (e.g., wearable AI assistants), which is critical for real-world applications.
Scientific Rigor: By establishing a unified evaluation protocol, the paper provides a reliable foundation for future research in video procedural planning, eliminating the "apples-to-oranges" comparisons that plagued the field.
Generalization: The method's ability to generalize across different planning horizons and datasets suggests that learning the structure of a task is more valuable than memorizing specific sequences.

In summary, ViterbiPlanNet proves that injecting domain knowledge directly into the training loop via differentiable algorithms is a superior strategy for procedural planning, offering a lightweight, data-efficient, and highly accurate alternative to current heavy-weight deep learning approaches.

ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos

The Core Idea: The "Recipe Map" (Procedural Knowledge Graph)

The Secret Sauce: The "Differentiable Viterbi Layer"

Why This is a Big Deal

The "Unified Test" (The Fair Play Rule)

Summary

1. Problem Definition

2. Methodology: ViterbiPlanNet

Core Components

Training Objective

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Founder effects shape the evolutionary dynamics of multimodality in open LLM families

From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs

Causal Direct Preference Optimization for Distributionally Robust Generative Recommendation

Graphs RAG at Scale: Beyond Retrieval-Augmented Generation With Labeled Property Graphs and Resource Description Framework for Complex and Unknown Search Spaces

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search