Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts

Imagine you are the director of a play where the actors are not humans, but super-smart AI computers (called Large Language Models, or LLMs). Your goal is to get them to have a realistic, engaging debate about a tricky topic, like "Should we build more parks on farmland?"

In the past, directors of these AI plays had to do one of two things:

Train the actors for months: Feed them thousands of examples of how to argue until they learned the "right" way to behave. This is slow, expensive, and hard to change.
Just shout instructions: Write a random note on a piece of paper saying, "Be nice!" or "Argue harder!" and hope for the best. This is hit-or-miss and often leads to boring or repetitive conversations.

This paper introduces a third, smarter way: The "Smart Script" Method.

The Core Idea: Prompts as Actions

The authors propose that instead of just telling the AI what to say, we should treat the instruction itself (the prompt) as a leverageable tool or a remote control.

Think of the AI agent as a car.

Old Way: You try to train the car to drive itself perfectly by letting it crash and learn for years (Reinforcement Learning).
New Way (This Paper): You keep the car's engine exactly the same, but you install a customizable dashboard. You can turn knobs to adjust the "steering," the "speed," and the "fuel mix" while the car is driving, without ever opening the hood.

How the "Smart Script" Works

The researchers broke down the instruction (the prompt) into five adjustable ingredients, like a recipe for a debate:

The Character (Task & Persona): Who is the AI playing? (e.g., A grumpy farmer, a worried parent, or an environmentalist).
The Memory (Dialogue History): What has been said so far?
The Library (External Knowledge): What facts does the AI have access to?
The Rulebook (Structure): How should the AI format its answer? (e.g., "Start with a yes/no," or "List three facts first").
The Volume Knobs (Weights): This is the magic sauce. You can turn up or down how much the AI listens to its Character, its Memory, or its Library.

The Experiment: Tuning the Knobs

The team set up two "stages" (scenarios):

Land Use: Farmers vs. Conservationists vs. Community Reps.
Education: Rural Teachers vs. Urban Parents vs. Policy Makers.

They ran these debates 10 times, but each time they tweaked the Rulebook and the Volume Knobs.

Here is what they discovered:

The "Rulebook" Effect:
- If you give the AI no rules (None), it talks naturally but might repeat itself or forget to use facts.
- If you give it light rules (Light), it starts using more facts from its library, like a student citing a textbook.
- If you give it strict rules (Struct), it becomes very organized and stops repeating itself, but it might get a bit rigid and stop using outside facts as much.
The "Volume Knob" Effect:
- If you turn up the Character knob, the AI gets more passionate and argues harder (more "rebuttals").
- If you turn up the Memory knob, the AI remembers what was said earlier, but might start repeating itself if you aren't careful.
- If you turn up the Knowledge knob, the AI brings in more evidence.
The "Auto-Pilot" (Adaptive Weights):
They even built a system that automatically adjusts these knobs as the conversation goes on.
- Early in the debate: The AI focuses on its Character and Facts to set the stage.
- Later in the debate: The AI shifts focus to Memory to respond to what the other person just said.
- It's like a DJ mixing music, automatically fading one track out and fading another in to keep the party going smoothly.

Why This Matters

This is a big deal for Social Simulation.

Imagine you want to study how a town reacts to a new law. Instead of hiring 1,000 actors and training them for weeks, you can just write a "Smart Script" with adjustable knobs.

Want a more aggressive town? Turn up the "Conflict" knob.
Want a town that relies on data? Turn up the "Evidence" knob.
Want to see how opinions change over time? Let the "Auto-Pilot" adjust the knobs as the conversation evolves.

The Bottom Line

This paper shows that we don't need to retrain AI models to change how they behave in a group. We just need to tune the instructions we give them. By treating instructions as adjustable parameters (like a mixing board), we can create diverse, realistic, and controllable social simulations that help us understand human behavior, all without writing a single line of new code or training the AI for hours.

It turns the AI from a static text-generator into a flexible social actor that we can direct in real-time.

Here is a detailed technical summary of the paper "Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts".

1. Problem Statement

Traditional multi-agent systems often rely on explicit modeling or Reinforcement Learning (RL) to train policies, requiring significant computational resources and training data. While Large Language Models (LLMs) have emerged as a new paradigm for multi-agent systems due to their inherent natural language generation capabilities, current research relies heavily on ad hoc prompts. These prompts lack a principled framework to treat communication strategies as formal policies. Consequently, it is difficult to systematically control, predict, or optimize agent behaviors (such as debate styles, evidence usage, or stance shifts) without retraining the underlying models. The core problem is the absence of a lightweight, trainable-free mechanism to parameterize and steer LLM-based multi-agent dialogue.

2. Methodology

The authors propose a framework that treats prompts as actions generated by a lightweight, parameterized policy. Instead of training the LLM, the system dynamically constructs prompts based on the agent's current state to influence its output.

2.1 Formalization

The multi-agent discussion is modeled as a state-action process:

State ( $s_i^{(k)}$ ): Composed of Task/Persona ( $T$ ), Dialogue History/Memory ( $M$ ), and Retrieved External Knowledge ( $D$ ).
Action ( $a_i^{(k)}$ ): The constructed prompt itself, generated by a policy $\pi_i$ .
Policy: Maps the state to a prompt using Rule Templates ( $R$ ) and Weight Vectors ( $W$ ).

2.2 Prompt Decomposition

The prompt is decomposed into five components:

Task & Persona ( $T$ ): Role description and goals.
Memory ( $M$ ): Dialogue history (shared pool).
Knowledge ( $D$ ): Retrieved external data via RAG.
Rule Template ( $R$ ): Structural constraints on the output.
Weights ( $W$ ): Parameters controlling the emphasis on $T$ , $M$ , and $D$ .

2.3 Key Control Mechanisms

Rule Templates ( $R$ ): Three levels of structural constraint are defined:
- None: No explicit structure; direct generation.
- Light: Basic ordering (e.g., answer question, then provide evidence).
- Struct: Detailed reasoning structure (e.g., extract specific argument types before responding).
Weight Scheduling ( $W$ ): A vector $W = \{w_T, w_M, w_D\}$ $W = {w_{T}, w_{M}, w_{D}}$ where values in $[0, 2]$ $[0, 2]$ determine the behavioral instruction for each component (Low, Mid, High).
- Adaptive Updates: Weights are adjusted dynamically based on Time-based trends (increasing reliance on memory over time) and Behavior-based correction (penalizing agents that fail to use evidence or respond to history).

2.4 Evaluation Metrics

The framework evaluates dialogue quality using five metrics:

Responsiveness: Does the agent address the most recent utterance?
Rebuttal: Does the agent explicitly oppose the previous turn?
Non-repetition: Novelty of the current utterance compared to previous ones.
Evidence Usage: Presence of key phrases from the retrieved knowledge base.
Stance Shift: Cosine similarity between the current output and the original persona description (measuring consistency).

3. Experimental Setup

Scenarios: Two public interest topics: Land Resource Use (e.g., "freedom to roam") and Educational Resource Allocation.
Agents: Three distinct personas per scenario (e.g., Farmer, Conservationist, Community Rep) driven by different LLM backbones (Qwen3, Llama3, Mistral).
Protocol: 10 rounds of dialogue per query.
Variables: The study systematically varies Rule Templates ( $R$ ) and Weight parameters ( $W$ ), including adaptive weight scheduling.

4. Key Results

Effectiveness of Policy Parameterization (RQ1): The study confirms that prompt control acts as a lightweight policy. Different parameter settings lead to statistically significant differences in dialogue dynamics without model training.
Impact of Rule Templates (RQ2):
- Non-repetition: The Struct rule significantly reduces repetition compared to None or Light.
- Evidence Usage: The Light rule significantly improves evidence usage compared to None, while Struct sometimes suppresses it due to cognitive load or strict formatting.
- Rebuttal: Both Light and Struct rules increase the rate of rebuttals compared to None, fostering more argumentative interactions.
- Stance Consistency: Rule templates primarily affect interaction style rather than the core stance; agents generally remained aligned with their personas regardless of the rule.
Weight Sensitivity:
- Increasing the weight on Persona ( $w_T$ ) significantly increases the rebuttal rate and stance stability (agents become more "loyal" to their role).
- There is a trade-off between rules and weights: Strong rules can enforce evidence usage even with low weights, whereas weak rules require high weights to drive evidence usage.
Adaptive Weights: Adaptive scheduling (adjusting weights over time) effectively regulates the dialogue process, shifting focus from establishing stances (early rounds, high $D$ ) to engaging with debate (later rounds, high $M$ ).
LLM Diversity: Heterogeneous setups (different LLMs for different agents) produced richer, more interactive, and less repetitive dialogues compared to homogeneous setups (all agents using the same LLM).
Ablation Study:
- $T$ (Persona): Crucial for increasing rebuttal frequency and stance consistency.
- $D$ (Knowledge): Essential for grounded evidence usage.
- $M$ (Memory): Increases repetition but ensures contextual coherence.
- Complementarity: Combining $T+D$ yielded the most balanced performance.

5. Key Contributions

Prompt-as-Action Paradigm: Proposes a formal framework where prompts are treated as actions generated by a parameterized policy, bridging the gap between LLMs and traditional policy-based multi-agent systems without requiring RL training.
Lightweight Control Mechanism: Introduces a modular system of Rule Templates and Weight Schedulers that allows researchers to steer specific behavioral dimensions (e.g., forcing evidence use or increasing conflict) via simple parameter adjustments.
Comprehensive Evaluation: Defines and validates a set of five quantitative metrics specifically designed to measure the dynamics of multi-agent dialogue (responsiveness, rebuttal, evidence, etc.).
Empirical Validation: Demonstrates through extensive experiments that these parameters significantly alter dialogue outcomes across different LLM backbones and scenarios.

6. Significance

This work offers a simple, interpretable, and effective mechanism for controlling LLM-driven multi-agent systems. By moving away from "black box" training or ad hoc prompting, it provides a principled way to conduct social simulations.

For Social Simulation: It enables the creation of diverse, controllable, and measurable social experiments (e.g., simulating public debates on policy) where agent behaviors can be tuned to specific research hypotheses.
For AI Safety & Alignment: It provides a method to steer LLM interactions toward desired outcomes (e.g., ensuring agents use evidence or avoid repetition) without the computational cost of fine-tuning.
Future Directions: The framework opens pathways for integrating inference-time interventions and fine-tuning to further customize agent cognitive strategies.