Heterogeneous Agent Collaborative Reinforcement Learning

Imagine you are trying to teach a group of students how to solve incredibly difficult math puzzles. In the past, you would put each student in a separate room. They would try to solve the puzzles, get a score, and then study only their own mistakes and successes. They would never talk to each other. This is how most AI models are trained today: they work in isolation, which is slow and wasteful.

This paper introduces a new way of teaching called HACRL (Heterogeneous Agent Collaborative Reinforcement Learning), and a specific method to make it work called HACPO.

Here is the simple breakdown of how it works, using some everyday analogies.

1. The Problem: The "Silent Study Hall"

Imagine a library where you have three students:

Student A: A genius who solves puzzles quickly but sometimes misses the "trick" questions.
Student B: A hard worker who is slower but very careful and catches details A misses.
Student C: A beginner who makes lots of mistakes but sometimes stumbles upon a unique, creative solution no one else thought of.

In the old way (Isolated Training), Student A studies only A's papers. Student B studies only B's papers. They never share. Student A never learns that Student B found a clever trick, and Student B never learns from Student A's speed. It's like everyone is reinventing the wheel.

2. The Solution: The "Collaborative Study Group"

The authors propose a new rule: During training, everyone shares their work, but during the final exam, everyone works alone.

This is the core idea of HACRL.

Training Phase (The Study Group): All three students sit together. They generate solutions to the same math problems. They share their "rollouts" (their attempts and answers).
Inference Phase (The Exam): When it's time to use the AI in the real world, Student A goes into a room alone. Student B goes into another room alone. They don't need to talk to each other to function; they just use the knowledge they gained from the study group.

3. The Challenge: "Apples and Oranges"

There's a catch. These students are very different (Heterogeneous).

Student A is fast but arrogant.
Student B is slow but humble.
Student C is a beginner.

If you just average their scores, the beginner's mistakes might confuse the genius, or the genius might ignore the beginner's unique insights. You can't just treat them all the same.

4. The Magic Sauce: How HACPO Fixes It

To make this study group work, the authors created HACPO, which uses four clever tricks (mechanisms) to handle the differences between the students:

A. The "Fair Baseline" (Agent-Capability-Aware Advantage)

The Analogy: If you are a genius, getting a 90% on a test is bad. If you are a beginner, getting a 90% is amazing.
The Fix: HACPO doesn't judge everyone against the same standard. It calculates a "personalized baseline." It asks, "How good is this answer for this specific student compared to what they usually do?" This ensures the genius isn't discouraged by a perfect score, and the beginner isn't overwhelmed.

B. The "Respectful Teacher" (Capability Discrepancy Coefficient)

The Analogy: When Student A (the genius) learns from Student C (the beginner), they should be careful. But when Student C learns from Student A, they should listen very closely.
The Fix: The system automatically adjusts how much weight to give to each student's advice. If a "stronger" agent shares a solution, the "weaker" agent learns from it aggressively. If a "weaker" agent shares a solution, the "stronger" agent looks at it cautiously, just in case it's a fluke.

C. The "Safety Filter" (Exponential Importance Sampling)

The Analogy: Imagine Student A and Student B speak different dialects. If Student A tries to copy Student B's handwriting exactly, it might look weird and cause confusion.
The Fix: The system checks how similar the students' "styles" are. If the styles are too different, it dampens the learning signal so the student doesn't get confused by a completely foreign way of thinking. It keeps the learning stable.

D. The "Step-by-Step Guardrails" (Stepwise Clipping)

The Analogy: In a group study, if one student starts shouting out crazy answers, it can derail the whole session.
The Fix: As the study session goes on, the system gets stricter. It puts "guardrails" on the learning. If a shared answer is too weird or risky compared to what the student usually does, the system clips it (cuts it off) so it doesn't ruin the student's progress.

5. The Result: Everyone Wins

The paper tested this on many different combinations of AI models (some big, some small, some from different companies).

The Outcome: Every single student got better. The genius got slightly smarter, and the beginner got much smarter.
Efficiency: They achieved these results using half the effort (half the computing power) compared to training them separately. It's like getting two years of study done in one year by sharing notes.

Summary

HACPO is like a super-efficient study group where:

Different types of students (AI models) share their homework.
Smart rules ensure the genius doesn't get confused by the beginner, and the beginner learns from the genius without getting overwhelmed.
Everyone gets better faster and uses less energy.
In the real world, they still work alone, but they are now much smarter because of their time together.

It turns the "Silent Study Hall" into a "Collaborative Workshop," proving that even when AI models are different, they can learn from each other to become stronger.

1. Problem Definition

The paper addresses the inefficiencies inherent in current Reinforcement Learning with Verifiable Rewards (RLVR) paradigms, such as GRPO and GSPO. In standard RLVR, multiple agents (LLMs) optimize independently on the same task. This leads to:

Wasteful Sampling: Costly rollouts (trajectories) generated by one agent are discarded and not utilized by others.
Isolated Optimization: Agents fail to leverage the complementary strengths or diverse error patterns of other models.
Heterogeneity Challenges: Modern LLM ecosystems are inherently heterogeneous (differing in size, architecture, tokenizer, or training state). Naively sharing data between such agents causes policy distribution shifts and capability discrepancies, leading to unstable training or biased advantage estimation.

The authors formalize Heterogeneous Agent Collaborative Reinforcement Learning (HACRL): a paradigm where heterogeneous agents share rollouts during training to mutually improve, while maintaining independent execution at inference time. Unlike Multi-Agent RL (which requires coordinated execution) or Knowledge Distillation (which is typically one-way teacher-to-student), HACRL aims for bidirectional, peer-to-peer mutual learning.

2. Methodology: HACPO Algorithm

To solve the HACRL problem, the authors propose HACPO (Heterogeneous Agent Collaborative Policy Optimization). The algorithm introduces four tailored mechanisms to mitigate capability gaps and distribution shifts while ensuring theoretical correctness.

A. Agent-Capability-Aware Advantage Estimation

Standard group-relative advantage estimation fails in heterogeneous settings because it assumes uniform capability. HACPO introduces a capability-adjusted baseline:

It computes a baseline $\hat{\mu}^{(k)}_t$ for agent $k$ by aggregating rewards from all agents, but weighted by a capability ratio $\omega^{(k,j)}_t$ .
$\omega^{(k,j)}_t$ is a smoothed estimate of the relative performance of agent $j$ compared to agent $k$ .
This ensures that rewards from stronger agents are weighted appropriately when establishing the baseline for weaker agents, preventing systematic bias.

B. Model Capabilities Discrepancy Coefficient

To handle the gradient updates, HACPO applies the capability ratio $\omega^{(k,j)}_t$ directly as a modulation factor on the advantages of cross-agent samples:

Stronger Agents: When a weaker agent learns from a stronger one, the gradient is amplified (encouraging aggressive learning).
Weaker Agents: When a stronger agent learns from a weaker one, the gradient is attenuated (preventing noise injection).
This creates a bidirectional learning dynamic where even weaker agents contribute unique exploration signals (e.g., alternative reasoning paths or informative errors).

C. Exponential Importance Sampling (EIS)

To correct for the large distributional mismatch between heterogeneous policies (e.g., different architectures or tokenizers), HACPO uses sequence-level importance sampling.

It employs a non-gradient exponential reweighting term $s^{(k,j)}_{t,i} \cdot (sg[s^{(k,j)}_{t,i}])^\alpha$ .
The parameter $\alpha$ controls conservativeness: higher $\alpha$ suppresses large distribution shifts, ensuring the agent primarily learns from cross-agent samples that are distributionally aligned with its own.

D. Stepwise Clipping

Standard symmetric clipping (e.g., $[1-\epsilon, 1+\epsilon]$ ) is insufficient for cross-agent data because cross-agent importance ratios fluctuate irregularly and can exceed 1.0, causing the current agent to overfit to another agent's distribution.

Asymmetric Clipping: The upper bound is strictly capped at 1.0. This ensures cross-agent responses can only downweight the learning signal relative to on-policy responses, never upweight them to dominate the gradient.
Stepwise Tightening: As training progresses within a batch (accumulating policy drift), the lower bound of the clipping range tightens ( $1.0 - \delta + k \cdot \delta_{step}$ ). This prevents late-stage updates in a batch from being dominated by noisy cross-agent rollouts.

3. Theoretical Guarantees

The paper provides rigorous theoretical analysis to validate HACPO:

Unbiased Advantage Estimation: Theorem 4.1 proves that the mixed-response baseline used in HACPO remains an unbiased estimator of the on-policy expected reward, provided the capability ratio is statistically independent of the current batch's reward realization.
Gradient Alignment: Theorem 4.3 demonstrates that the gradient of the heterogeneous objective ( $J_{hete}$ ) is positively aligned with the gradient of the homogeneous objective ( $J_{homo}$ ). This ensures that learning from cross-agent rollouts provides a valid optimization direction consistent with standard RL, avoiding adverse bias.

4. Experimental Results

The authors evaluated HACPO across three types of heterogeneity (State, Size, and Model) on seven mathematical reasoning benchmarks (including MATH, GSM8K, AIME2025, and Olympiad).

Performance Gains: HACPO consistently outperformed all baselines (Standard GRPO, GSPO, Resource-Equivalent GSPO $\times$ $\times$ 2, and a Naive sharing baseline).
- On average, HACPO improved performance by 3.3% over GSPO.
- It achieved these gains using only half the rollout cost compared to running two independent GSPO agents (due to sample sharing).
Ablation Studies:
- Removing the Capability-Aware Advantage module caused significant performance drops due to bias.
- Removing the Discrepancy Coefficient degraded performance, confirming the need for gradient modulation.
- Stepwise Clipping was shown to be critical for training stability; without it, training became unstable or suboptimal.
Heterogeneity Robustness: The method worked effectively even when agents had different architectures (e.g., Qwen vs. Llama) and tokenizers, proving its ability to extract transferable knowledge across diverse model families.

5. Key Contributions & Significance

New Paradigm (HACRL): Defines a new learning framework that decouples collaborative optimization from coordinated execution, allowing agents to learn from each other while remaining independent at inference.
Algorithmic Innovation (HACPO): Proposes a robust algorithm with four specific mechanisms (Capability-Aware Baseline, Discrepancy Coefficient, EIS, Stepwise Clipping) to handle the unique challenges of heterogeneous agents.
Efficiency: Solves the "expensive sampling" bottleneck of RLVR by enabling $N$ -fold sample utilization in an $N$ -agent system.
Bidirectional Learning: Unlike distillation, HACPO enables mutual benefit, where even weaker agents contribute valuable exploration data that helps stronger agents break performance bottlenecks.

Conclusion: HACPO represents a significant step forward in scaling RLVR for large language models. By enabling heterogeneous agents to collaboratively optimize their policies, it maximizes data efficiency and improves reasoning capabilities across diverse model architectures without requiring complex coordination during deployment.