Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs

Imagine you have a very smart, very polite robot assistant. Right now, we've taught this robot to be "good" by showing it examples of helpful, honest, and harmless behavior. It's like teaching a child to say "please" and "thank you." This works great when the robot is just answering questions for one person.

But what happens when the robot has to help a whole group of people who disagree with each other?

Imagine a family dinner where:

Dad wants to save money and eat cheap pizza.
Mom wants a healthy, organic feast.
Grandma wants to cook her traditional recipe, even if it takes all day.
The Teenager just wants to order sushi.

If you ask a standard robot, it might just pick one person's idea (usually the first one it heard) or give a vague answer like, "Let's all be happy." It doesn't know how to actually negotiate a solution where everyone feels heard and gets something they want.

This paper introduces a new way to train robots to handle these messy, conflicting situations. They call it "Learning to Negotiate."

The Big Idea: The Robot Debate Club

Instead of teaching the robot to just give a single "correct" answer, the researchers taught it how to hold a conversation with itself to find a middle ground.

Here is how they did it, using a simple analogy:

1. The "Self-Play" Game (The Mirror Match)

Imagine the robot is a chess player. Usually, it plays against a human. But here, the researchers made the robot play against a frozen copy of itself.

Robot A is assigned a persona: "The Strict Budget Manager."
Robot B is assigned a persona: "The Luxury Lover."

They are forced to sit at a table and talk. They have to argue their points, listen to each other, and try to find a plan that satisfies both the budget and the desire for luxury. They aren't just shouting; they are trying to reach an agreement.

2. The "Scorecard" (The Judge)

After they talk, a "Judge" (another AI) looks at their final plan.

Did they agree? If they just kept arguing and never solved the problem, they get a zero.
Did they find a "Collective Agency" solution? This is a fancy term the authors use. Think of it as "The Win-Win Score."
- A bad solution is: "We eat cheap pizza, and Mom is sad." (One person wins, one loses).
- A good "Collective Agency" solution is: "We order a few slices of pizza for the budget, but we make a fancy homemade salad for Mom, and we all eat together." (Everyone's agency—everyone's ability to get what they need—is expanded).

3. The Training Loop (Learning from Mistakes)

The robot plays this game thousands of times.

If it argues too much and fails to agree, it gets a "bad grade" (negative reward).
If it finds a clever, creative compromise that makes everyone happy, it gets a "gold star" (positive reward).

Over time, the robot learns that arguing forever is bad, but finding a creative compromise is the best way to win.

Why Is This a Big Deal?

1. It's not just about being "nice."
Old training methods taught robots to be "helpful, honest, and harmless." But in the real world, being "harmless" isn't enough. Sometimes you have to choose between two "good" things that conflict (e.g., Privacy vs. Safety). This new method teaches the robot to navigate the tension between two good values, rather than just picking one.

2. It's like a "Group Therapy" session for AI.
The paper shows that when the robot learns to negotiate, it doesn't just get better at arguing; it gets better at thinking. It learns to synthesize different viewpoints into a single, stronger idea. It's like a group of friends brainstorming: the final idea is often better than anything one person could have come up with alone.

3. It doesn't make the robot "dumb."
A common fear is that if you teach a robot to argue, it might forget how to do math or follow instructions. The researchers tested this, and the robot remained just as good at math and logic as before. It just gained a new superpower: Diplomacy.

The Real-World Example from the Paper

The paper gives a great example: A therapist has a client who confesses to a crime, but another innocent person is currently in jail for it.

Rule A: Keep patient confidentiality (don't tell anyone).
Rule B: Do justice for the innocent person (tell the truth).

A standard robot might say, "I can't break confidentiality," or "I must tell the police," picking one side and ignoring the other.

The Negotiation-Trained Robot figures out a third path: "Let's encourage the client to voluntarily confess to the authorities. This way, the client takes responsibility (upholding their agency), the innocent person is freed (justice), and the therapist didn't force the breach of trust (maintaining the relationship)."

The Takeaway

This paper suggests that to make AI truly useful in our complex, divided world, we shouldn't just teach them to be obedient servants. We should teach them to be diplomats.

By training them to negotiate with themselves, we are building AI that can help us solve our own human conflicts, finding solutions where everyone feels a little more heard and a little more free. It's a step toward AI that doesn't just answer our questions, but helps us figure out what we should do together.

Here is a detailed technical summary of the paper "Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs."

1. Problem Statement

Current Large Language Model (LLM) alignment methods, such as Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF), primarily optimize for static objectives (e.g., helpfulness, honesty, harmlessness) in single-agent settings. These approaches face three critical limitations in real-world deployment:

Static Objectives: They struggle with dynamic, evolving value systems and are vulnerable to reward misgeneralization or strategic gaming.
Multi-Stakeholder Conflicts: They fail to address scenarios involving conflicting interests where agents must negotiate rather than simply optimize a single metric.
Lack of Deliberation: Existing methods often produce value-consistent but non-convergent or abstract responses when disagreements arise, lacking the capability to synthesize mutually beneficial solutions through structured dialogue.

The authors propose addressing these gaps by training LLMs to engage in multi-agent negotiation to achieve Collective Agency (CA) alignment while simultaneously enhancing conflict-resolution capabilities.

2. Methodology

The paper introduces a scalable, multi-agent alignment framework that integrates structured negotiation into a group-relative reinforcement learning loop.

A. Core Framework: Multi-Agent Negotiation

The system simulates a negotiation between two agents with conflicting personas (e.g., "Data-driven vs. Emotion-centric" or "Speed vs. Quality").

Self-Play Setup: To ensure scalability without training separate opponent models, the framework uses a self-play mechanism. Agent 1 is the trainable policy ( $\pi_\theta$ ), and Agent 2 is a frozen copy of the policy from the current iteration.
Two-Phase Process:
1. Negotiation Phase: Agents engage in turn-based dialogue to reconcile competing objectives. An external LLM judge evaluates after each turn whether a concrete, mutually acceptable agreement has been reached. The negotiation terminates upon agreement or after a maximum of 7 turns.
2. Final Completion Phase: Agent 1 generates a final summary ( $y$ ) that synthesizes the resolution and justifies it under the CA objective.

B. Training Objective: Collective Agency (CA)

The alignment target is Collective Agency, a dynamic objective defined by four inseparable aspects:

Knowledge: Expansion of perception and understanding.
Benevolence: Uplifting the agency and well-being of others.
Power: The capacity to actualize intent.
Vitality: The ability to adapt, renew, and sustain long-term growth.
The goal is not a static compromise but a synthesis that expands the agency of all parties involved.

C. Data Generation

Curriculum: 1,100 synthetic moral and practical dilemmas were generated across three categories: Professional/High-Stakes (30%), Interpersonal/Relational (40%), and Micro-Ethics (30%).
Personas: A library of 50 distinct personas organized into 25 adversarial pairs (e.g., "Strict Protocol" vs. "Adaptive Improvisation") was created to induce genuine value conflicts.

D. Optimization: GRPO with Token-Level Gradients

The policy is optimized using Group Relative Policy Optimization (GRPO):

Reward Signal: An external LLM judge assigns a CA score (0–5) to the final completion. If the negotiation fails to reach an agreement within the turn limit, the reward is set to 0, providing an explicit negative signal for non-convergent behavior.
Group Relative Advantage: For each prompt, $G$ negotiation trajectories are sampled. Advantages are computed as $\hat{A}_i = \frac{r_i - \text{mean}(r)}{\text{std}(r) + \epsilon}$ . This emphasizes relative negotiation quality over absolute scores.
Token-Level Gradients: Crucially, gradients are applied to the dialogue tokens (the negotiation process) rather than just the final completion tokens. This directly shapes the interactive deliberation dynamics.
Loss Function: The authors use the token-normalized GRPO loss (from DAPO) with $\beta=0$ (no KL penalty) to encourage exploration.

3. Key Contributions

Negotiation-Driven Alignment: Proposes the first scalable framework that aligns LLMs to a dynamic objective (CA) specifically through multi-agent negotiation, moving beyond single-agent static optimization.
Synthetic Curriculum for Conflict: Developed a large-scale synthetic dataset of 1,100 dilemmas and 25 adversarial persona pairs, enabling systematic training on value conflicts without human annotation.
Training Mechanism Innovation:
- Utilizes self-play with a frozen copy for scalable multi-agent interaction.
- Implements zero-reward assignment for failed negotiations to penalize non-convergent dialogue.
- Applies gradients to dialogue tokens, directly optimizing the deliberation process rather than just the summary output.
Empirical Validation: Demonstrates that negotiation training improves conflict resolution without degrading general language capabilities.

4. Experimental Results

The authors fine-tuned Qwen3-14B-Instruct and compared the Multi-Agent Aligned model against a Base model and a Single-Agent CA-aligned baseline.

Conflict Resolution Performance:
- The Multi-Agent model significantly outperformed the Base model in pairwise win rates for conflict resolution (e.g., 62.2% vs. 58.0% under sampling).
- It also outperformed the Single-Agent aligned model (62.2% vs. 41.8%), proving that single-agent alignment does not generalize well to multi-stakeholder conflicts.
- Efficiency: The average number of negotiation rounds to reach agreement decreased by ~25% (from ~2.3 to ~1.9), indicating more efficient deliberation.
Collective Agency (CA) Alignment:
- The Multi-Agent model achieved CA alignment comparable to the Single-Agent baseline on open-ended questions.
- Notably, the minimum CA score (quality floor) increased substantially during training, suggesting the model became more robust at avoiding low-quality, non-convergent negotiations.
General Capabilities:
- The model retained general language capabilities on standard benchmarks (IFEval, AIME 2024/2025, GPQA), showing no degradation in instruction following, math reasoning, or science QA.
Decoding Strategy:
- Stochastic sampling (Temperature 0.7) yielded higher win rates than greedy decoding, indicating that the training improved the consistency of generating high-quality responses across diverse trajectories rather than just the peak performance.

5. Significance and Implications

Bridging the Gap: This work bridges the gap between static alignment objectives and the dynamic, multi-stakeholder nature of real-world AI deployment. It demonstrates that LLMs can be trained to deliberate rather than just decide.
Collective Intelligence: By optimizing for Collective Agency, the framework moves LLMs toward supporting collective decision-making, where solutions are synthesized to benefit all parties rather than satisfying a single metric.
Scalable Oversight: The use of self-play and AI judges (RLAIF) offers a path to scalable oversight for complex, value-laden tasks that are difficult to evaluate with static rules.
Future Directions: The paper suggests that negotiation-driven deliberation is a viable path for building AI systems capable of navigating ethical gray areas and conflicting human values, a critical step toward safe and beneficial AGI in multi-agent environments.

6. Limitations

Synthetic Data Bias: The training data and personas are generated by LLMs, potentially inheriting biases from the generator.
Evaluation Scope: Metrics rely heavily on win rates and agreement counts; finer-grained analysis of negotiation quality (e.g., depth of synthesis) is needed.
Pairwise Constraint: The current setup is limited to two agents, whereas real-world scenarios often involve multi-party negotiations with coalition dynamics.
Reward Granularity: The reward is assigned to the final outcome, not individual dialogue turns, which limits the precision of credit assignment for specific negotiation moves.