Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

Imagine you have a team of brilliant, hard-working students (the AI Agents) trying to solve a series of incredibly difficult math and logic puzzles. They are smart, but they have a major limitation: they only know what they learned in school (their training data). If a puzzle requires a brand-new trick or a piece of information they've never seen, the whole team might get stuck, argue in circles, and eventually give up.

This paper introduces a new way to run this team called HILA (Human-In-the-Loop Multi-Agent Collaboration). Think of HILA not just as adding a teacher to the room, but as teaching the students how to think about their own thinking.

Here is the breakdown using simple analogies:

1. The Problem: The "Closed-World" Trap

Currently, most AI teams operate like a closed room. They can talk to each other, debate, and combine their existing knowledge to solve problems. But if the problem requires knowledge outside that room (like a new scientific discovery or a specific expert trick), they hit a glass ceiling. They can't invent new knowledge; they can only rearrange what they already have.

2. The Solution: The "Metacognitive" Student

The authors gave these AI agents a superpower: Metacognition.

What is it? It's the ability to say, "Wait, I don't know this," or "This is too hard for us right now."
The Analogy: Imagine a student who, instead of blindly guessing on a test, pauses and asks themselves: "Do I actually know the answer, or am I just making it up?"
The Strategy: The HILA framework teaches the agents to make three specific moves:
1. Evaluate (EVAL): "Hey team, let's pick the best answer we already have." (Using collective wisdom).
2. Create (CREATE): "None of our current ideas work. Let's try to invent a brand-new solution from scratch." (Creative exploration).
3. Defer (DEFER): "This is too hard. We are stuck. Let's call the expert." (Strategic surrender).

3. The Secret Sauce: The "Dual-Loop" Training

How do you teach an AI when to ask for help? You can't just tell them; they have to learn. The authors use a two-step training process called Dual-Loop Policy Optimization (DLPO).

Think of this like training an apprentice chef:

The Inner Loop (The "When to Ask" Coach):
- This part uses a game-like system (Reinforcement Learning).
- The Rule: If you solve it yourself, you get a gold star. If you ask the chef (the human/expert), you get a gold star but you lose a few points because asking takes time and effort.
- The Goal: The AI learns to balance the risk of failing alone vs. the cost of asking for help. It learns to ask only when it's truly necessary, not out of laziness.
The Outer Loop (The "What to Learn" Mentor):
- This is the magic part. When the AI does ask for help, the human expert doesn't just give the answer and leave. They show the AI how to solve it.
- The Analogy: The apprentice watches the master chef cook the dish. The next time a similar dish comes up, the apprentice remembers the technique.
- The Result: The AI doesn't just get the answer for that one question; it actually gets smarter. It absorbs the expert's knowledge into its own brain, so it might not need to ask for help next time.

4. The Results: Smarter Teams

The paper tested this on tough math competitions and coding challenges.

Old Way: AI teams argued until they got it wrong or gave up.
HILA Way: The team works hard, realizes when they are stuck, calls the expert, learns the trick, and then solves the next problem on their own.

The Outcome: The HILA system consistently beat other advanced AI teams. It didn't just get better at asking for help; it actually became a better problem-solver over time because it kept learning from every interaction.

Summary

In short, this paper teaches AI agents to be humble and strategic. Instead of pretending to know everything, they learn to recognize their limits, ask for help at the right moment, and then use that help to grow stronger. It turns a static group of computers into a team that can learn and evolve, just like a human team with a great mentor.

1. Problem Statement

Current Multi-Agent Systems (MAS) utilizing Large Language Models (LLMs) have achieved significant progress in collaborative reasoning. However, they operate as "closed-world" systems. Their knowledge is strictly bounded by their pre-training corpora, making them brittle when faced with:

Tasks requiring real-time information or domain-specific expertise absent from training data.
Novel challenges where internal recombination of existing knowledge fails.
Scenarios where collective introspection (e.g., debates) cannot bridge the gap between current capabilities and task requirements.

Existing human-in-the-loop approaches often treat humans as passive oracles or supervisors for sub-tasks, relying on simple heuristics (e.g., low-confidence thresholds) to trigger intervention. Crucially, these systems fail to address two core questions:

When to ask: How to strategically decide to defer to an expert rather than relying on heuristics.
How to grow: How to transform one-time human feedback into long-term capability growth (continual learning) rather than just a temporary fix.

2. Methodology: The HILA Framework

The authors propose HILA (Human-In-the-Loop Multi-Agent Collaboration), a framework that equips agents with a metacognitive policy to decide autonomously when to solve problems and when to defer to human expertise.

A. Metacognitive Markov Decision Process (Meta-MDP)

The collaboration is formalized as a Meta-MDP where the state $s_t$ includes:

Task Context: The original question and history.
Self Context: The agent's own solution and reasoning status.
Peer Context: Responses from other agents (consensus vs. conflict).
Cognitive Cues: Lightweight, rule-based signals for social consensus, self-monitoring reliability, and cognitive control (progress vs. escalation).

B. Strategic Action Space

Agents select from three high-level cognitive strategies:

Evaluate ( $a_{eval}$ ): Exploit collective knowledge by selecting and endorsing a peer's solution.
Create ( $a_{create}$ ): Explore divergent paths by generating a novel solution from scratch.
Defer ( $a_{defer}$ ): Recognize the limits of the system's capability and invoke a human expert. This triggers a high-quality demonstration used for both immediate task resolution and future learning.

C. Dual-Loop Policy Optimization (DLPO)

To train the metacognitive policy, HILA introduces a novel two-loop optimization strategy:

Inner Loop (Reinforcement Learning - GRPO):
- Goal: Optimize the timing of deferral decisions.
- Method: Uses Group Relative Policy Optimization (GRPO).
- Reward Function: Combines task correctness with cost-aware penalties.
  - $R = \text{Task Reward} - \text{Cost}_{create}$ (for creating new solutions).
  - $R = \text{Task Reward} - \text{Cost}_{defer}$ (for deferring, where $\text{Cost}_{defer} > \text{Cost}_{create}$ ).
- Outcome: The agent learns to balance the risk of failure against the cost of intervention, becoming more selective about when to ask for help.
Outer Loop (Continual Learning - SFT):
- Goal: Expand the model's capability and reasoning horizon.
- Method: When a Defer action is taken, the expert's solution is converted into a supervised fine-tuning (SFT) sample.
- Objective: Minimize cross-entropy loss on the expert's reasoning trace conditioned on the state.
- Outcome: The model internalizes expert knowledge, permanently strengthening its reasoning abilities so it may not need to defer in the future.

Total Loss: The final objective combines the GRPO policy gradient with the SFT loss, weighted by a hyperparameter $\lambda_{sft}$ , ensuring the agent learns both when to ask and what to learn from the answer.

3. Key Contributions

HILA Framework: A principled paradigm for adaptive human-agent collaboration that moves beyond static protocols to dynamic, metacognitive decision-making.
Dual-Loop Policy Optimization (DLPO): A training methodology that disentangles short-term intervention decisions (Inner Loop/RL) from long-term capability growth (Outer Loop/SFT).
Metacognitive Policy: The first approach to learn a policy that explicitly models uncertainty and cost to decide strategically when to defer, rather than using fixed thresholds.
Empirical Validation: Extensive experiments demonstrating that HILA outperforms state-of-the-art autonomous MAS and single-agent baselines across diverse reasoning benchmarks.

4. Experimental Results

The authors evaluated HILA on GSM8K (arithmetic), AMC/AIME (competition math), HumanEval (code), and MMLU (general knowledge), using LLaMA3-8B and Qwen2.5 backbones.

Performance Gains: HILA with DLPO consistently outperformed all baselines (including Debate, G-Swarm, AFlow, and DyLAN).
- On AMC, HILA achieved 35.83% accuracy compared to the best baseline (G-Debate) at 20.48%.
- On GSM8K, HILA reached 89.86%, surpassing the best baseline (G-Swarm) at 84.89%.
- Improvements were most significant on complex, competition-style math problems where "cascade failures" are common.
Generalization: The framework transferred robustly across different model families (Qwen, LLaMA) and scales (3B to 7B/8B), showing particular efficacy in compensating for weaker base models.
Ablation Studies:
- Policy vs. Capability: Removing the outer loop (SFT) resulted in modest gains (better deferral timing but no capability growth). Removing the inner loop (RL) prevented the system from learning when to defer. The full DLPO was required for peak performance.
- Deferral Rate: As training progressed, the DEFER rate decreased significantly (e.g., from ~29% to ~17% on GSM8K) while accuracy increased. This indicates the model learned to solve problems autonomously after internalizing expert knowledge.
Human Proxy Quality: Using stronger LLMs (GPT-4o) as human proxies yielded better results than weaker ones (GPT-3.5), confirming that the quality of the "expert" signal is critical.
Real Human Validation: Experiments with real PhD-level experts confirmed that HILA works with actual humans, providing even higher accuracy than GPT proxies, especially on complex math tasks.

5. Significance and Impact

This paper represents a shift from closed-world multi-agent systems to open-world, continually evolving agentic systems.

Beyond Static Protocols: It demonstrates that agents can learn to recognize their own limitations and strategically seek external help, breaking the ceiling of pre-trained knowledge.
Sustainable Growth: By integrating expert feedback into the model's weights via continual learning, HILA ensures that the system improves over time, rather than just relying on a static "oracle" for every difficult query.
Efficiency: The metacognitive policy reduces the reliance on expensive human intervention over time as the agent becomes more capable, optimizing the trade-off between autonomous reasoning and expert cost.

In summary, HILA establishes a foundational path for building adaptive, self-improving AI systems that can collaborate intelligently with humans to solve problems beyond their initial training scope.