HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents

This paper introduces HiMAC, a hierarchical reinforcement learning framework that decomposes long-horizon decision-making into macro-level planning and micro-level execution using a critic-free optimization and iterative co-evolution strategy, achieving state-of-the-art performance and improved sample efficiency across diverse text-based and visual environments.

Hongbo Jin, Rongpeng Zhu, Jiayu Ding, Wenhao Zhang, Ge Li

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a very smart, but slightly scattered, robot assistant how to clean your entire house, buy groceries online, and solve a complex puzzle, all in one go.

If you just tell the robot, "Go clean the house," and let it figure out every single step (pick up the sock, open the closet, find the dustpan) in one long, continuous stream of thoughts, it will likely get overwhelmed. It might pick up a sock, forget why it picked it up, wander into the kitchen, and then forget the whole mission. This is what current AI agents struggle with: they get lost in the details and lose track of the big picture.

This paper introduces HiMAC, a new way to train these AI agents. Think of HiMAC not as a single worker, but as a perfectly organized construction team with two distinct roles: The Architect and The Builder.

The Problem: The "One-Brain" Approach

Current AI agents try to do everything with one brain. They think and act at the same time.

  • The Analogy: Imagine a chef trying to cook a 10-course banquet while simultaneously chopping vegetables, seasoning the soup, and washing dishes, all in one continuous motion without stopping to plan. One small mistake (like burning the garlic) ruins the whole meal, and the chef has no way to recover because they never stopped to look at the recipe.
  • The Result: The AI gets confused, makes small errors that snowball into big failures, and gives up.

The Solution: HiMAC (The Architect & The Builder)

HiMAC splits the job into two clear layers:

  1. The Macro-Policy (The Architect):

    • Role: This is the strategic planner. It doesn't touch the tools. Instead, it looks at the big goal ("Clean the house") and writes a Blueprint.
    • The Blueprint: This isn't just a to-do list; it's a structured map of "Milestones." For example: Step 1: Clear the living room. Step 2: Dust the shelves. Step 3: Vacuum the floor.
    • Why it helps: The Architect only worries about the order of things, not the messy details of how to do them. If the Architect makes a mistake, it's just a wrong plan, not a broken action.
  2. The Micro-Policy (The Builder):

    • Role: This is the focused worker. It receives one milestone from the Blueprint (e.g., "Clear the living room") and focuses only on that.
    • The Focus: It picks up socks, moves chairs, and throws away trash until that specific task is done. Once the task is finished, it signals "Done!" and waits for the next instruction.
    • Why it helps: The Builder doesn't need to worry about the whole house. It only cares about the current room. If it drops a sock, it can just pick it up again without forgetting the whole mission.

How They Learn: The "Coach and Player" Drill

Training a team like this is tricky. If the Architect changes the plan while the Builder is trying to learn, the Builder gets confused. If the Builder gets better, the Architect might think the old plans were bad when they were actually fine.

To fix this, the authors created a special training method called Iterative Co-Evolution.

  • The Analogy: Imagine a coach (Architect) and a player (Builder) practicing for a game.
    • Phase 1 (Coach's Turn): The coach draws up 5 different game plans. The player runs through them without changing their skills (just following orders). The coach sees which plan worked best and learns to write better plans.
    • Phase 2 (Player's Turn): The coach picks the best plan from Phase 1 and says, "Okay, we are sticking to this plan. Now, you practice executing it perfectly." The player gets better at following instructions.
    • Repeat: They switch back and forth. The coach gets better at planning, and the player gets better at executing, but they never confuse the two roles.

The "No-Critic" Secret Sauce

Usually, to train AI, you need a "Critic" (a judge) that scores every move. But in complex tasks, this judge is often wrong or hard to train.

  • HiMAC's Trick: Instead of a judge, they use Group Comparison.
  • The Analogy: Imagine a talent show. Instead of one judge giving a score out of 10, you have 10 contestants perform. You don't need to know exactly how "good" a song is; you just know that Song A was clearly better than Song B and Song C.
  • HiMAC does this by generating many plans (for the Architect) or many attempts (for the Builder) and simply asking: "Which one worked better than the others?" This is much easier and faster to learn.

The Results: Why It Matters

The paper tested this on three tough challenges:

  1. ALFWorld: A virtual house where you have to find and move objects.
  2. WebShop: Buying specific items on a noisy, confusing website.
  3. Sokoban: A visual puzzle where you push boxes into targets.

The Outcome:
HiMAC didn't just work; it crushed the competition.

  • It was 16% better than the next best AI at buying things online (WebShop).
  • It learned faster, needing fewer tries to get good.
  • It even developed self-checking habits on its own (like the Architect realizing, "Wait, I should double-check I found the candle before moving on"), something other AIs didn't do.

The Big Takeaway

The most important lesson from this paper is that structure beats size.
You don't need a giant, super-expensive brain to solve hard problems. You just need to organize the brain correctly. By splitting the job into "Planning" and "Doing," and training them separately but together, even a smaller AI can become a master at long, complex tasks.

In short: HiMAC teaches AI to stop trying to do everything at once. It teaches them to plan first, then act, and to practice those two skills separately until they become a perfect team.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →