LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos

Imagine you are trying to teach a robot how to make a perfect cup of coffee. You show the robot a video of someone doing it. The robot needs to figure out the steps: grind beans, fill the filter, tamp the coffee, pour water.

This is the challenge of Procedure Planning: looking at a start point (an empty cup) and a goal point (a full cup) and figuring out the invisible steps in between.

The paper introduces a new AI model called LAP (Language-Aware Planning) that solves a major problem robots face: Visual Confusion.

The Problem: "They All Look the Same"

Imagine you are looking at two different cooking videos.

Video A: Someone is "Adding Coffee" to a filter.
Video B: Someone is "Leveling the Surface" of that coffee.

If you just look at the pictures (the visual data), these two steps look almost identical! You see a hand, a coffee filter, and brown powder. To a computer, these two very different actions look like the same thing. It's like trying to tell the difference between a "Stop" sign and a "Do Not Enter" sign just by looking at the color red, without reading the words.

Most old AI models try to solve this by looking harder at the pictures, but they keep getting confused.

The Solution: "Read the Recipe, Don't Just Look at the Pot"

The authors of LAP realized that while the pictures are confusing, the words describing the actions are very clear. "Adding Coffee" and "Leveling the Surface" sound completely different.

LAP works like a translator:

The Translator (VLM): Instead of just staring at the video, LAP uses a smart "translator" (a Vision-Language Model) to look at the video and say, "Ah, I see a hand putting beans in a filter. That means the action is 'Add Coffee'."
The Dictionary (Text Embeddings): It turns that sentence into a mathematical code (a "text embedding"). Think of this like a unique ID card for that specific action. Because the words are different, the ID cards are very distinct and easy to tell apart.
The Planner (Diffusion Model): Now, instead of trying to guess the steps by looking at blurry pictures, the planner uses these clear ID cards. It asks: "Okay, I have the ID card for 'Start' and the ID card for 'Goal'. What steps fit in between?"

The "Professor Forcing" Trick

The paper mentions a clever training trick called Professor Forcing.

Imagine a student learning to write a story.

Old Way: The teacher gives the student the first sentence. The student writes the second. Then the teacher gives the real third sentence, and the student writes the fourth. The student gets used to relying on the teacher's perfect hints. When the teacher stops helping (during the test), the student panics and writes nonsense.
Professor Forcing (LAP's Way): The teacher lets the student write the second sentence, then pretends to be the student and writes the third sentence based on what the student wrote. This forces the student to learn how to keep the story going on their own, even if they make a small mistake.

This makes the AI much better at translating videos into words without getting confused.

The Results: Why It Matters

The researchers tested LAP on three different "kitchen" datasets (CrossTask, Coin, and NIV).

The Result: LAP didn't just win; it dominated. It was significantly better than all previous models.
The Analogy: If previous models were like a tourist trying to navigate a city using only a blurry photo of a street sign, LAP is like a local who can read the sign perfectly and knows exactly which turn to take.

Summary

LAP is a new way for AI to plan tasks. Instead of getting lost in the visual fog where different actions look the same, it converts the visual scene into clear, distinct language. By "thinking in words" rather than just "seeing in pictures," it can figure out complex sequences of actions (like making coffee or assembling furniture) much more accurately than ever before.

In short: When pictures are blurry and confusing, LAP reads the subtitles.

Here is a detailed technical summary of the paper "LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos."

1. Problem Definition

Procedure Planning in instructional videos involves predicting a sequence of intermediate actions ( $\pi = [a_1, ..., a_T]$ ) given a start visual observation ( $o_s$ ) and a goal visual observation ( $o_g$ ).

The Core Challenge:
Existing methods rely heavily on visual observations as input. However, visual data suffers from inherent ambiguity. Different actions often share similar visual features (e.g., background, objects, hand positions), making it difficult for models to distinguish between them in the latent space. For instance, the start frames for "Add Coffee" and "Even Surface" may look nearly identical, leading to confusion in action prediction.

The Proposed Solution:
The authors argue that language descriptions offer a more distinctive and expressive representation in the latent space compared to raw visual features. They propose LAP (Language-Aware Planning), a framework that bridges visual observations and action planning by translating visual inputs into text embeddings, which are then used to guide a diffusion-based planning model.

2. Methodology

The LAP framework consists of three main stages: Video-to-Text Translation, Text Embedding Extraction, and Diffusion-based Planning.

A. Video-to-Text Translation (VLM Fine-tuning)

To overcome the ambiguity of short action labels (e.g., "Add Coffee" vs. "Add Sugar" sharing the verb "Add"), the authors introduce Language Enhancement.

Elaboration: A pre-trained Large Language Model (LLM) is used to expand short action labels into detailed, descriptive sentences (e.g., transforming "Add Coffee" into "Pour ground coffee into the filter"). This ensures distinct verb/noun combinations for different actions.
Professor Forcing: A Vision-Language Model (VLM) is fine-tuned to predict these elaborated text descriptions from video clips.
- Technique: They employ Professor Forcing, a training strategy where the model alternates between Teacher Forcing (using ground truth tokens) and Free Running (autoregressive generation) during training.
- Discriminator: A discriminator is trained to distinguish between tokens generated by Teacher Forcing vs. Free Running. The VLM is optimized to minimize the distribution distance between the two, ensuring the model generates high-quality text during inference (free running) without the exposure bias common in standard autoregressive training.

B. Video to Text (Prediction & Extraction)

Action Prediction: For the start ( $o_s$ ) and goal ( $o_g$ ) videos, the fine-tuned VLM generates multiple text descriptions.
Selection: The system calculates ROUGE-1 scores between generated descriptions and the ground truth action labels. The description with the highest score (above a threshold) is selected as the predicted action description. If no description meets the threshold, an "unknown" label is assigned.
Embedding Extraction: The selected text descriptions are passed through a text encoder (pre-trained on HowTo100M) to obtain text embeddings ( $E_{\hat{a}_s}$ and $E_{\hat{a}_g}$ ).

C. Planning with Diffusion Models

The core planning module utilizes a Denoising Diffusion Probabilistic Model (DDPM).

Input Construction: The input to the diffusion model ( $x_0$ ) is a matrix containing the action sequence. Crucially, the start and goal positions are filled with the text embeddings ( $E_{\hat{a}_s}, E_{\hat{a}_g}$ ) rather than visual features.
Denoising Process:
- Gaussian noise is added iteratively to the action dimension of the sequence.
- The text embedding dimensions are kept fixed (un-noised) throughout the diffusion process.
- The model learns to denoise the action sequence conditioned on these fixed, distinctive text embeddings, effectively guiding the generation of the intermediate steps.

3. Key Contributions

Language-Aware Planning (LAP): A novel architecture that leverages the distinctiveness of language descriptions to resolve visual ambiguities in procedure planning.
Professor Forcing for VLMs: The application of Professor Forcing to fine-tune VLMs for generating elaborated action descriptions, significantly improving the quality of the text-to-visual mapping.
State-of-the-Art Performance: The method achieves significant performance gains over existing baselines across multiple metrics and time horizons.
Empirical Validation: Demonstrated that text embeddings provide a more separable and distinctive latent space representation than visual features, particularly for datasets with high visual ambiguity.

4. Experimental Results

The model was evaluated on three benchmarks: CrossTask, Coin, and NIV.

Metrics:

Success Rate (SR): Percentage of plans where all actions and their order are correct (most challenging).
Mean Accuracy (mAcc): Percentage of correct actions regardless of order.
Mean Single Intersection over Union (mSIoU): Measures overlap between predicted and ground truth actions.

Key Findings:

CrossTask: LAP achieved the highest SR (41.14% at T=3) and mAcc (70.13%), outperforming the previous SOTA (ActionDiffusion, PDPP) by a large margin.
Coin: LAP achieved an SR of 44.43% (T=3), significantly surpassing PlanLLM (33.22%) and other baselines, despite PlanLLM having access to intermediate visual observations.
NIV: LAP achieved an SR of 56.51% (T=3), nearly doubling the performance of the next best baseline (MTID at 28.52%).
Ablation Studies:
- Text vs. Visual: Replacing text embeddings with visual features (LAP-vo) caused a drastic drop in performance, especially on Coin and NIV, confirming that text is more distinctive for these tasks.
- Professor Forcing: Models trained with Professor Forcing (LAP) consistently outperformed those trained with standard Teacher Forcing (LAP-tf).
- Language Enhancement: Using elaborated descriptions (via LLM) improved action prediction accuracy compared to using raw short labels.
- VLM Choice: The specialized VLM used in LAP outperformed general-purpose models like LLaVa-NeXT-Video, likely due to better alignment with short, action-specific video clips.

Visualization:
Latent space visualizations (t-SNE) showed that text embeddings form tight, separable clusters, whereas visual observations are more cluttered and overlapping, validating the hypothesis that language reduces ambiguity.

5. Significance

This work fundamentally shifts the paradigm of procedure planning from purely visual reasoning to language-mediated reasoning. By demonstrating that translating visual inputs into distinctive text embeddings allows diffusion models to generate more accurate and robust action sequences, the paper highlights the underutilized potential of natural language in bridging the gap between perception and planning.

The results suggest that for complex instructional tasks where visual cues are ambiguous, semantic understanding via language is a superior conditioning signal for generative planning models than raw visual features. This approach offers a scalable path toward AI systems capable of proactively assisting humans in complex, real-world tasks.