LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos

This paper introduces LAP, a novel procedure planning model that leverages a fine-tuned Vision Language Model to convert visual observations into distinctive text embeddings for a diffusion-based planner, achieving state-of-the-art performance on multiple benchmarks by effectively resolving visual ambiguities through language.

Lei Shi, Victor Aregbede, Andreas Persson, Martin Längkvist, Amy Loutfi, Stephanie Lowry

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot how to make a perfect cup of coffee. You show the robot a video of someone doing it. The robot needs to figure out the steps: grind beans, fill the filter, tamp the coffee, pour water.

This is the challenge of Procedure Planning: looking at a start point (an empty cup) and a goal point (a full cup) and figuring out the invisible steps in between.

The paper introduces a new AI model called LAP (Language-Aware Planning) that solves a major problem robots face: Visual Confusion.

The Problem: "They All Look the Same"

Imagine you are looking at two different cooking videos.

  1. Video A: Someone is "Adding Coffee" to a filter.
  2. Video B: Someone is "Leveling the Surface" of that coffee.

If you just look at the pictures (the visual data), these two steps look almost identical! You see a hand, a coffee filter, and brown powder. To a computer, these two very different actions look like the same thing. It's like trying to tell the difference between a "Stop" sign and a "Do Not Enter" sign just by looking at the color red, without reading the words.

Most old AI models try to solve this by looking harder at the pictures, but they keep getting confused.

The Solution: "Read the Recipe, Don't Just Look at the Pot"

The authors of LAP realized that while the pictures are confusing, the words describing the actions are very clear. "Adding Coffee" and "Leveling the Surface" sound completely different.

LAP works like a translator:

  1. The Translator (VLM): Instead of just staring at the video, LAP uses a smart "translator" (a Vision-Language Model) to look at the video and say, "Ah, I see a hand putting beans in a filter. That means the action is 'Add Coffee'."
  2. The Dictionary (Text Embeddings): It turns that sentence into a mathematical code (a "text embedding"). Think of this like a unique ID card for that specific action. Because the words are different, the ID cards are very distinct and easy to tell apart.
  3. The Planner (Diffusion Model): Now, instead of trying to guess the steps by looking at blurry pictures, the planner uses these clear ID cards. It asks: "Okay, I have the ID card for 'Start' and the ID card for 'Goal'. What steps fit in between?"

The "Professor Forcing" Trick

The paper mentions a clever training trick called Professor Forcing.

Imagine a student learning to write a story.

  • Old Way: The teacher gives the student the first sentence. The student writes the second. Then the teacher gives the real third sentence, and the student writes the fourth. The student gets used to relying on the teacher's perfect hints. When the teacher stops helping (during the test), the student panics and writes nonsense.
  • Professor Forcing (LAP's Way): The teacher lets the student write the second sentence, then pretends to be the student and writes the third sentence based on what the student wrote. This forces the student to learn how to keep the story going on their own, even if they make a small mistake.

This makes the AI much better at translating videos into words without getting confused.

The Results: Why It Matters

The researchers tested LAP on three different "kitchen" datasets (CrossTask, Coin, and NIV).

  • The Result: LAP didn't just win; it dominated. It was significantly better than all previous models.
  • The Analogy: If previous models were like a tourist trying to navigate a city using only a blurry photo of a street sign, LAP is like a local who can read the sign perfectly and knows exactly which turn to take.

Summary

LAP is a new way for AI to plan tasks. Instead of getting lost in the visual fog where different actions look the same, it converts the visual scene into clear, distinct language. By "thinking in words" rather than just "seeing in pictures," it can figure out complex sequences of actions (like making coffee or assembling furniture) much more accurately than ever before.

In short: When pictures are blurry and confusing, LAP reads the subtitles.