ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos

ViterbiPlanNet introduces a principled framework that injects procedural knowledge into instructional video planning via a Differentiable Viterbi Layer, achieving state-of-the-art performance with significantly fewer parameters and improved sample efficiency compared to existing large-scale models.

Luigi Seminara, Davide Moltisanti, Antonino Furnari

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot how to make a sandwich. You show it a picture of two slices of bread and a picture of a finished turkey sandwich. Your goal is for the robot to figure out the steps in between: put bread down, add turkey, add lettuce, put top bread on.

The problem is, if you just ask a super-smart robot (like a giant AI) to "figure it out," it might get confused. It might try to put the turkey on the table before the bread, or it might forget that you need to put the bottom slice down first. To fix this, current AI models try to "memorize" millions of sandwich recipes by reading huge amounts of data. They become massive, expensive, and slow, like a librarian who has read every book in the world but still gets confused when asked to make a simple sandwich.

Enter ViterbiPlanNet.

The authors of this paper came up with a smarter, lighter, and more efficient way to teach the robot. Instead of forcing the AI to memorize every single rule, they gave it a map.

The Core Idea: The "Recipe Map" (Procedural Knowledge Graph)

Imagine you have a physical map of a city.

  • The Intersections are the actions (e.g., "Add Turkey").
  • The Roads are the valid moves between actions.
  • The Traffic Signs tell you which roads are one-way (you can't add lettuce before the bread is down).

This map is called a Procedural Knowledge Graph (PKG). In the past, AI researchers would use this map only after the AI made a guess, just to fix mistakes (like a GPS recalculating a route after you took a wrong turn).

ViterbiPlanNet does something different: it builds the map into the AI's brain while it is learning.

The Secret Sauce: The "Differentiable Viterbi Layer"

This is the technical magic, but here is the simple version:

Usually, computers are bad at learning from maps because the map is rigid. If the AI makes a tiny mistake, the map says "NO," and the computer can't learn from that error because the "No" button is a hard stop.

The authors invented a "Soft Map" (the Differentiable Viterbi Layer).

  • Imagine the map isn't made of concrete walls, but of soft, stretchy rubber bands.
  • If the AI tries to put the turkey on the table, the rubber band stretches and gently pulls it back, saying, "Hey, that's not quite right, try the bread first."
  • Because the map is "soft" (mathematically smooth), the AI can feel the pull and learn why it was wrong. It learns to predict the right steps by feeling the shape of the map, rather than just memorizing the destination.

Why This is a Big Deal

  1. It's a Lightweight Backpack, Not a Heavy Suit:
    Current AI models trying to do this are like wearing a 300-pound suit of armor (billions of parameters). They need massive computers to run. ViterbiPlanNet is like wearing a lightweight hiking backpack. It uses 1,000 times fewer resources but still walks the path faster and more accurately.

  2. It Learns Faster (Sample Efficiency):
    Because the AI has the map, it doesn't need to see a million sandwich videos to learn. It only needs to see a few, because the map tells it what usually happens. It's like learning to drive: if you have a map of the rules of the road, you don't need to crash a million times to learn how to stop at a red light.

  3. It Doesn't Get Lost in the Dark:
    The paper tested the AI on "shorter" tasks than it was trained on. Imagine training the AI to make a 6-step sandwich, then asking it to make a 3-step one.

    • Old AI: "I only know how to make a 6-step sandwich! I'm confused!"
    • ViterbiPlanNet: "Oh, I just need to follow the map for the first three stops. Easy!"
      It understands the structure of the process, not just the specific length of the task.

The "Unified Test" (The Fair Play Rule)

The authors also noticed that previous researchers were playing unfair. Some were testing their AI on different days, with different rules, or using different "rulers" to measure success. It was like comparing a sprinter's time on a muddy track to a runner's time on a track made of ice.

They created a Unified Testing Protocol. They made sure everyone ran the same race, on the same track, with the same stopwatch. When they did this, ViterbiPlanNet didn't just win; it dominated, proving that the "map" approach is genuinely superior, not just a lucky fluke.

Summary

ViterbiPlanNet is like giving a robot a GPS and a rulebook while it learns to cook, rather than forcing it to memorize every single recipe in the world.

  • The Map (PKG): Tells it what steps are possible.
  • The Soft Pull (Differentiable Viterbi): Gently guides it to learn the right path without getting stuck.
  • The Result: A robot that is smarter, faster, cheaper to run, and doesn't get confused when the task changes slightly.

It's a move away from "brute force" AI (throwing more data and money at the problem) toward "smart" AI (using logic and structure to learn efficiently).