HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents

Imagine you are trying to teach a very smart, but slightly scattered, robot assistant how to clean your entire house, buy groceries online, and solve a complex puzzle, all in one go.

If you just tell the robot, "Go clean the house," and let it figure out every single step (pick up the sock, open the closet, find the dustpan) in one long, continuous stream of thoughts, it will likely get overwhelmed. It might pick up a sock, forget why it picked it up, wander into the kitchen, and then forget the whole mission. This is what current AI agents struggle with: they get lost in the details and lose track of the big picture.

This paper introduces HiMAC, a new way to train these AI agents. Think of HiMAC not as a single worker, but as a perfectly organized construction team with two distinct roles: The Architect and The Builder.

The Problem: The "One-Brain" Approach

Current AI agents try to do everything with one brain. They think and act at the same time.

The Analogy: Imagine a chef trying to cook a 10-course banquet while simultaneously chopping vegetables, seasoning the soup, and washing dishes, all in one continuous motion without stopping to plan. One small mistake (like burning the garlic) ruins the whole meal, and the chef has no way to recover because they never stopped to look at the recipe.
The Result: The AI gets confused, makes small errors that snowball into big failures, and gives up.

The Solution: HiMAC (The Architect & The Builder)

HiMAC splits the job into two clear layers:

The Macro-Policy (The Architect):
- Role: This is the strategic planner. It doesn't touch the tools. Instead, it looks at the big goal ("Clean the house") and writes a Blueprint.
- The Blueprint: This isn't just a to-do list; it's a structured map of "Milestones." For example: Step 1: Clear the living room. Step 2: Dust the shelves. Step 3: Vacuum the floor.
- Why it helps: The Architect only worries about the order of things, not the messy details of how to do them. If the Architect makes a mistake, it's just a wrong plan, not a broken action.
The Micro-Policy (The Builder):
- Role: This is the focused worker. It receives one milestone from the Blueprint (e.g., "Clear the living room") and focuses only on that.
- The Focus: It picks up socks, moves chairs, and throws away trash until that specific task is done. Once the task is finished, it signals "Done!" and waits for the next instruction.
- Why it helps: The Builder doesn't need to worry about the whole house. It only cares about the current room. If it drops a sock, it can just pick it up again without forgetting the whole mission.

How They Learn: The "Coach and Player" Drill

Training a team like this is tricky. If the Architect changes the plan while the Builder is trying to learn, the Builder gets confused. If the Builder gets better, the Architect might think the old plans were bad when they were actually fine.

To fix this, the authors created a special training method called Iterative Co-Evolution.

The Analogy: Imagine a coach (Architect) and a player (Builder) practicing for a game.
- Phase 1 (Coach's Turn): The coach draws up 5 different game plans. The player runs through them without changing their skills (just following orders). The coach sees which plan worked best and learns to write better plans.
- Phase 2 (Player's Turn): The coach picks the best plan from Phase 1 and says, "Okay, we are sticking to this plan. Now, you practice executing it perfectly." The player gets better at following instructions.
- Repeat: They switch back and forth. The coach gets better at planning, and the player gets better at executing, but they never confuse the two roles.

The "No-Critic" Secret Sauce

Usually, to train AI, you need a "Critic" (a judge) that scores every move. But in complex tasks, this judge is often wrong or hard to train.

HiMAC's Trick: Instead of a judge, they use Group Comparison.
The Analogy: Imagine a talent show. Instead of one judge giving a score out of 10, you have 10 contestants perform. You don't need to know exactly how "good" a song is; you just know that Song A was clearly better than Song B and Song C.
HiMAC does this by generating many plans (for the Architect) or many attempts (for the Builder) and simply asking: "Which one worked better than the others?" This is much easier and faster to learn.

The Results: Why It Matters

The paper tested this on three tough challenges:

ALFWorld: A virtual house where you have to find and move objects.
WebShop: Buying specific items on a noisy, confusing website.
Sokoban: A visual puzzle where you push boxes into targets.

The Outcome:
HiMAC didn't just work; it crushed the competition.

It was 16% better than the next best AI at buying things online (WebShop).
It learned faster, needing fewer tries to get good.
It even developed self-checking habits on its own (like the Architect realizing, "Wait, I should double-check I found the candle before moving on"), something other AIs didn't do.

The Big Takeaway

The most important lesson from this paper is that structure beats size.
You don't need a giant, super-expensive brain to solve hard problems. You just need to organize the brain correctly. By splitting the job into "Planning" and "Doing," and training them separately but together, even a smaller AI can become a master at long, complex tasks.

In short: HiMAC teaches AI to stop trying to do everything at once. It teaches them to plan first, then act, and to practice those two skills separately until they become a perfect team.

1. Problem Statement

Large Language Model (LLM) agents have shown promise in short-horizon tasks but struggle significantly with long-horizon tasks requiring structured planning and reliable execution. Current state-of-the-art approaches rely on flat autoregressive policies, where high-level reasoning (planning) and low-level actions (execution) are generated within a single token sequence. This architecture suffers from three critical failure modes:

Exponential Exploration Complexity: Navigating a vast combinatorial search space using myopic next-token prediction.
Delayed Credit Assignment: Difficulty in attributing success or failure to specific decisions over long trajectories.
Semantic Drift (Context Drift): Minor errors in early steps cascade into irreversible failure states, causing the agent to lose track of the global goal.

Existing Reinforcement Learning (RL) methods (e.g., PPO, GRPO) attempt to optimize these flat policies but often fail to decouple global intent from local control, leading to sample inefficiency and divergence in high-dimensional semantic spaces.

2. Methodology: HiMAC Framework

The authors propose HiMAC (Hierarchical Macro-Micro Agentic Control), a framework that explicitly decomposes decision-making into two distinct levels: a Macro-Policy (Planner) and a Micro-Policy (Executor).

A. Hierarchical Architecture

The task is formulated as a Goal-Conditioned Partially Observable Markov Decision Process (POMDP) where the trajectory generation is factorized:

Macro-Level (Planning): Given a natural language instruction $x$ , the Macro-Policy generates a Structured Blueprint $z$ . This blueprint is a sequence of natural language sub-goals $\{g_1, ..., g_K\}$ that decomposes the long-horizon objective into tractable milestones.
Micro-Level (Execution): Conditioned on a specific blueprint $z^*$ $z^{*}$ , the Micro-Policy generates atomic actions $a_t$ $a_{t}$ to achieve the current sub-goal.
- Sub-goal Transition: The agent autonomously transitions to the next sub-goal only when the Micro-Policy generates a special <sub_done> token, acting as a temporal attention mask to prevent semantic drift.

B. Critic-Free Hierarchical Policy Optimization

To train this hierarchy without unstable parametric value networks (critics), HiMAC extends Group Relative Policy Optimization (GRPO) to a bi-level structure:

Macro-Objective: Samples a group of candidate blueprints. Each blueprint is evaluated by rolling out the current Micro-Policy (in inference mode) to get a return $R(z)$ . The advantage is calculated relative to the group mean, isolating the quality of the plan from execution noise.
Micro-Objective: Samples a group of execution trajectories conditioned on a fixed high-confidence blueprint. The advantage is calculated relative to the group mean, isolating the quality of execution from planning variance.
Key Insight: By using group-relative advantages at each level, the method achieves precise credit assignment without needing a separate critic network.

C. Iterative Co-Evolution Training

Simultaneously optimizing both levels creates a non-stationary problem (the planner chases a shifting executor, and vice versa). HiMAC resolves this via an Iterative Co-Evolution strategy with alternating phases:

Phase A (Macro-Exploration): The Micro-Policy is frozen. The Macro-Policy is updated to generate better blueprints based on the deterministic feedback from the frozen executor.
Phase B (Micro-Adaptation): The Macro-Policy is frozen (specifically, the best blueprint $z^*$ from Phase A is selected and fixed). The Micro-Policy is updated to execute this specific blueprint more effectively.

This alternation converts the unstable bi-level optimization into a sequence of stationary single-level updates, creating a natural curriculum where the planner proposes increasingly complex strategies as the executor improves.

3. Key Contributions

HiMAC Framework: A novel hierarchical architecture that decouples long-horizon reasoning into blueprint generation and goal-conditioned execution, fundamentally reducing exploration complexity and error propagation.
Critic-Free Hierarchical Optimization: An extension of GRPO to bi-level structures using level-specific comparison groups, enabling precise credit assignment without parametric value networks.
Iterative Co-Evolution Strategy: A training paradigm that alternates between planner exploration and executor adaptation to stabilize non-stationary dynamics and induce a self-organizing curriculum.
Emergent Self-Verification: The framework enables the Macro-Policy to spontaneously develop self-verification behaviors (e.g., checking inventory) absent in flat baselines.

4. Experimental Results

The authors evaluated HiMAC on three challenging benchmarks: ALFWorld (embodied reasoning), WebShop (web navigation), and Sokoban (visual spatial planning).

Performance: HiMAC achieved State-of-the-Art (SOTA) results across all benchmarks.
- WebShop: Achieved an 83.4% success rate with a 1.5B model, surpassing the strongest RL baseline (GiGPO) by 16.0% (67.4%).
- ALFWorld: Achieved 89.9% success rate (1.5B) and 92.1% (7B), outperforming GiGPO by 3.8% and 1.3% respectively.
- Sokoban: Achieved 87.5% success rate with a 7B VLM, outperforming GiGPO by 4.7 points.
Sample Efficiency: HiMAC converged to target success thresholds significantly faster than flat RL baselines (e.g., reaching 65% success on WebShop in ~220 iterations vs. ~380 for GRPO).
Ablation Studies:
- Removing the hierarchy (Flat GRPO) caused a massive performance drop (e.g., -18% on WebShop), proving the necessity of structural decoupling.
- Removing the Iterative Co-Evolution (training both simultaneously) degraded performance by ~6-9%, confirming the strategy's role in stabilizing non-stationarity.
- Using random blueprints instead of high-confidence ones reduced performance, highlighting the importance of training the executor on feasible plans.

5. Significance

The paper demonstrates that structured hierarchy is a more decisive factor for robust long-horizon agentic intelligence than simply increasing model scale.

Scalability: HiMAC scales gracefully with model size (1.5B to 7B) and generalizes from text-based to visually-grounded environments.
Efficiency: It offers a path to high-performance agents using smaller, open-source models by leveraging architectural inductive biases rather than massive parameter counts.
Paradigm Shift: It challenges the prevailing "flat" autoregressive paradigm, suggesting that future LLM agents require explicit structural decomposition to handle complex, real-world tasks effectively.

HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents

The Problem: The "One-Brain" Approach

The Solution: HiMAC (The Architect & The Builder)

How They Learn: The "Coach and Player" Drill

The "No-Critic" Secret Sauce

The Results: Why It Matters

The Big Takeaway

1. Problem Statement

2. Methodology: HiMAC Framework

A. Hierarchical Architecture

B. Critic-Free Hierarchical Policy Optimization

C. Iterative Co-Evolution Training

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank