Hierarchical Latent Action Model

Imagine you are trying to teach a robot how to cook a complex meal, like a lasagna.

The Problem with Current Robots
Right now, most robots learn by watching videos of humans cooking. But there's a catch: to learn effectively, the robot usually needs a "script" telling it exactly what to do at every single second (e.g., "move hand left 2cm," "turn knob 15 degrees"). This is like trying to learn a language by memorizing every single letter of every word. It's expensive, boring, and hard to find enough data.

Some newer methods try to learn just by watching the video without the script. They look at two frames of the video and guess, "What tiny movement happened between these two pictures?" This is great for learning small, quick movements (like "grab the spoon"). But it fails at the big picture. It sees the robot grabbing the spoon, then stirring, then pouring, but it doesn't understand that these three tiny movements are actually one big concept: "Make the sauce." It misses the "story" of the action.

The Solution: HiLAM (The "Movie Editor" Robot)
The authors of this paper, HiLAM, built a new system that acts like a smart movie editor. Instead of just looking at individual frames, it watches the whole video and figures out the "scenes."

Here is how HiLAM works, using a simple analogy:

1. The Two-Level Brain (Hierarchical)

Think of HiLAM as having two brains working together:

The "Micro" Brain (Low-Level): This part is like a fast-forward button. It watches the video and breaks it down into tiny, split-second movements. It's very good at seeing how a hand moves, but it doesn't know why.
The "Macro" Brain (High-Level): This is the new magic. It takes all those tiny movements and groups them into Skills. It realizes, "Oh, all those tiny hand movements from second 5 to second 15 are actually just one big skill: Pick up the bowl."

2. The "Dynamic Chunking" (The Smart Cut)

Most old systems tried to force every action to be the same length, like cutting a movie into 5-second clips. But real life isn't like that. Sometimes "picking up a bowl" takes 2 seconds; sometimes it takes 10 seconds if the bowl is slippery.

HiLAM uses a special mechanism called Dynamic Chunking. Imagine a movie editor who doesn't use a timer, but instead looks at the action.

If the robot is just sitting still, the editor keeps the clip going.
The moment the robot starts a new, distinct action (like moving the bowl to a new spot), the editor hits "CUT" and starts a new scene.
This means the robot learns that "Skills" can be short or long, depending on what's actually happening.

3. Learning from "Silent" Movies

The best part? HiLAM doesn't need a script. It learns from "actionless" videos—just raw footage of humans or robots doing things.

Step 1: It watches a video and invents its own "ghost actions" (latent actions) to explain the movement between frames.
Step 2: It uses its "Macro Brain" to group those ghost actions into meaningful skills (like "Pouring," "Stirring," "Serving").
Step 3: It practices predicting what happens next. If it sees the robot start to "Pour," it predicts the future frames of the liquid coming out. If it can predict the future, it truly understands the skill.

4. The Result: A Smarter, Faster Learner

When the researchers tested this on a robot learning to do tasks (like moving objects around a table), the results were impressive:

Less Data Needed: Because HiLAM understood the "big picture" skills, it needed far fewer examples to learn a new task. It was like learning a recipe by understanding the steps rather than memorizing every ingredient measurement.
Better at Long Tasks: It was much better at complex, multi-step tasks (like "clean the whole kitchen") because it could break them down into logical chunks (Wash dishes -> Dry dishes -> Put away).
Interpretability: You can actually see what the robot thinks the "skills" are. If you ask it to "Pick up the bowl," it knows exactly which part of the video corresponds to that skill.

The Bottom Line

HiLAM is like teaching a robot to watch a movie and understand the plot, rather than just memorizing the frame-by-frame animation. By figuring out how to group small movements into big, meaningful skills on its own, it can learn new tasks much faster and more efficiently than before, even without a human teacher giving it a step-by-step manual.

Here is a detailed technical summary of the paper "HiLAM: Hierarchical Latent Action Model" published as a workshop paper at ICLR 2026.

1. Problem Statement

The Challenge:
Robot learning increasingly relies on large-scale data, but obtaining action-labeled data is expensive and limits dataset diversity. Latent Action Models (LAMs) have emerged to infer actions from observation-only (actionless) data by modeling frame transitions. However, existing LAMs suffer from two critical limitations:

Short-Horizon Focus: They typically capture only low-level, short-term motion dynamics.
Rigid Skill Representation: They often fail to capture high-level, temporally extended skills. Existing methods either assume fixed-length skill windows (which misaligns with variable execution speeds in real-world data) or rely on predefined skill sets and language instructions, which limits flexibility and dynamic discovery.

The Goal:
To develop a method that can automatically discover and encode high-level latent skills from unlabeled video data without requiring action labels, fixed skill sets, or fixed temporal windows, while preserving the dynamic motion properties necessary for control.

2. Methodology: HiLAM

The authors propose HiLAM, a hierarchical latent action model that learns a hierarchy of representations from low-level latent actions to high-level latent skills. The framework operates in two main phases: Latent Skill Learning and Hierarchical Policy Learning.

A. Core Architecture & Mechanisms

Low-Level Latent Action Extraction:
- HiLAM utilizes a pretrained Inverse Dynamics Model (IDM) (based on UniSkill) to extract a sequence of low-level latent actions ( $z^l$ ) from observation-only videos. The IDM infers the "action" required to transition between frames $I_t$ and $I_{t+k}$ .
Dynamic Chunking (H-Net Integration):
- To abstract low-level actions into high-level skills, HiLAM adopts the H-Net architecture.
- Mechanism: Instead of fixed windows, H-Net uses a dynamic chunking mechanism. It predicts boundary indicators ( $b_t$ ) for each token based on feature dissimilarity. If consecutive tokens are dissimilar, a boundary is predicted ( $b_t=1$ ), marking the start of a new skill segment.
- Hierarchy: This allows the model to automatically segment variable-length trajectories into meaningful chunks. These chunks are downsampled to form higher-level representations ( $z^s$ ), creating a hierarchy where deeper layers capture longer temporal contexts.
Training Objectives:
The model is trained using a weighted combination of three losses:
- Next-Latent Prediction ( $L_{latent}$ ): Predicts the next latent action token (next-token prediction task).
- Visual Reconstruction ( $L_{rec}$ ): A pretrained Forward Dynamics Model (FDM) predicts future frames conditioned on the predicted latent actions. This ensures the latent space retains "action-like" dynamic properties (i.e., the latent action actually causes motion).
- Chunking Regularizer ( $L_{ratio}$ ): Prevents degenerate boundary patterns and controls the average chunk length.

B. Policy Learning Framework

Once the latent skills are learned, a hierarchical policy is trained in two stages:

Pretraining (Actionless Data):
- High-Level Policy ( $\pi_h$ ): Predicts a target latent skill ( $z^h$ ) given the current observation and task instruction.
- Low-Level Policy ( $\pi_l$ ): Predicts the corresponding low-level latent action ( $z^l$ ) given the observation and the predicted high-level skill.
- Note: Both policies are trained using the pseudo-labels (latent skills/actions) extracted from diverse actionless videos (human or robot).
Fine-Tuning (Target Domain):
- The high-level policy is frozen.
- The low-level policy is fine-tuned on a target domain using ground-truth actions to map the latent action space to the true robot action space.

3. Key Contributions

Hierarchical Latent Action Model (HiLAM): A novel architecture that bridges the gap between low-level motion and high-level skills using a data-driven, dynamic chunking mechanism.
Length-Adaptive Skill Discovery: Unlike prior works that force skills into fixed-length windows, HiLAM automatically segments trajectories based on motion dynamics, accommodating variable execution speeds and durations.
Actionless Pretraining: Demonstrates that high-level skills can be effectively learned from observation-only data without predefined skill sets or language labels, leveraging the dynamic structure of the video itself.
Preservation of Dynamics: By integrating a Forward Dynamics Model into the training loop, HiLAM ensures that the discovered latent skills are not just semantic clusters but retain the physical motion properties required for control.

4. Experimental Results

The authors evaluated HiLAM on the LIBERO benchmark (a suite of long-horizon robotic manipulation tasks).

Performance vs. Baselines:
- HiLAM consistently outperformed the state-of-the-art baseline BAKU across all four suites (Spatial, Object, Goal, and Long).
- On LIBERO-Long (multi-stage, long-horizon tasks), HiLAM achieved a 94% success rate with full data, significantly outperforming BAKU.
Data Efficiency:
- HiLAM showed remarkable data efficiency. With only 10% of expert demonstrations for fine-tuning, HiLAM achieved a 45% success rate, nearly double BAKU's 23%.
- With 50% of demonstrations, HiLAM matched the performance of BAKU trained on 100% of the data.
Ablation Studies:
- Pretraining Data: Pretraining on human videos (Something-Something V2) yielded better results than robot videos, suggesting human motion provides richer skill priors.
- Hierarchy Depth: Using the stage-2 representation ( $s=2$ ) for skills and stage-0 for actions yielded the best performance, confirming the benefit of deeper temporal abstraction.
- Necessity of Hierarchy: Flat policies using latent actions performed worse than the hierarchical approach, proving the value of the skill abstraction.
Qualitative Analysis:
- Dynamic Chunking: Visualizations showed HiLAM correctly identifying semantic boundaries (e.g., separating "moving to bowl," "picking up," and "placing down") without any labels.
- Future Prediction: The model successfully predicted future frames from predicted latent actions, validating that the latent space captures valid motion dynamics.

5. Significance and Future Directions

Significance:
HiLAM addresses a critical bottleneck in robot learning: the scarcity of action-labeled data. By effectively mining high-level skills from vast amounts of unlabeled video, it enables data-efficient transfer learning. The ability to discover skills dynamically (without fixed windows) makes the approach robust to real-world variations in task execution speed and style.

Limitations & Future Work:

Simulation vs. Reality: Experiments were primarily conducted in simulated environments (LIBERO). Real-world validation is needed.
Language Integration: Currently, the model relies on motion cues. The authors suggest that combining hierarchical latent action modeling with natural language instructions could create a complementary synergy, enhancing generalizability for complex tasks.
End-to-End Training: The current approach relies on a pretrained IDM. Future work could explore training the entire architecture end-to-end for a deeper joint understanding of motion and skills.

Hierarchical Latent Action Model

1. The Two-Level Brain (Hierarchical)

2. The "Dynamic Chunking" (The Smart Cut)

3. Learning from "Silent" Movies

4. The Result: A Smarter, Faster Learner

The Bottom Line

1. Problem Statement

2. Methodology: HiLAM

A. Core Architecture & Mechanisms

B. Policy Learning Framework

3. Key Contributions

4. Experimental Results

5. Significance and Future Directions

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers