Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models

Imagine you are trying to teach a robot how to play a complex game like Poker or Rock-Paper-Scissors against a room full of other robots.

In the past, the standard way to do this was to use a "Black Box" trainer (Deep Reinforcement Learning). You'd throw millions of games at the robot, and it would slowly learn by trial and error. Eventually, it would become a master player. But here's the catch: you have no idea how it thinks. It's like a wizard casting a spell; the result is magic, but if you ask, "Why did you cast that spell?" the wizard just shrugs. It's a "black box" policy. If the robot makes a weird mistake, you can't fix it because you can't read its mind.

This paper introduces a new method called Code-Space Response Oracles (CSRO). Instead of a black box, they use a Large Language Model (LLM)—the same kind of AI that writes essays and code—to act as the trainer.

Here is the simple breakdown of how it works, using some everyday analogies:

1. The Core Idea: "Write the Code, Don't Just Learn the Weights"

Instead of training a robot to adjust invisible numbers (neural network weights) until it wins, CSRO asks the AI: "Write me a Python program that tells you how to win."

Old Way: Like training a dog by giving it treats for good behavior. The dog learns, but you don't know why it barked at the mailman.
CSRO Way: Like asking a human chess coach to write a rulebook for a new student. The coach says, "If the opponent moves their knight here, you should move your bishop there because..."
- The Result: The "policy" isn't a mysterious neural network; it's actual, readable computer code with comments explaining the strategy. You can open the file, read it, and say, "Ah, I see! It's bluffing because the opponent always folds to aggression."

2. The Process: The "Evolutionary Workshop"

The paper describes a few ways the AI generates these code strategies, but the most powerful one is called AlphaEvolve. Think of this as a high-tech workshop for inventors:

The Prompt (The Brief): The AI is given the rules of the game and a description of the current "meta-game" (what the other robots are doing).
The Draft (Zero-Shot): The AI writes a first draft of a strategy code.
The Test (Evaluation): This new code is pitted against the other robots in the simulation.
The Critique (Refinement): If the code loses, the AI looks at the score and says, "Okay, that didn't work. Let's try again."
The Evolution (AlphaEvolve): This is the cool part. Imagine a room full of 100 different AI "inventors." They all write slightly different versions of the code. The ones that win get to "reproduce" (their code is copied and slightly mutated), while the losers are thrown out. Over many rounds, the "survival of the fittest" creates a super-strategy that is both smart and readable.

3. Why This Matters: The "Glass Box" Advantage

The biggest win here is Interpretability.

The Black Box Problem: In traditional AI, if a self-driving car crashes, we often can't explain why. The neural network just "felt" like turning left.
The CSRO Solution: Because CSRO outputs code, the strategy is transparent.
- Example from the paper: In the Poker game, the AI wrote code that explicitly calculated: "If the opponent always calls, I should stop bluffing and only bet with strong hands."
- You can read that line of code and instantly understand the logic. It's not magic; it's a clear, logical algorithm.

4. The Results: Smart and Efficient

The researchers tested this on Rock-Paper-Scissors and Leduc Poker (a simplified poker game).

Performance: The AI-generated code performed just as well as, or better than, the traditional "black box" robots.
Efficiency: Traditional robots need to play millions of games to learn. CSRO generates a strategy by "thinking" and writing code, which is much faster and requires fewer game simulations.
Creativity: The AI didn't just copy human strategies. It invented new ones, like a Rock-Paper-Scissors bot that uses a "Theory of Mind" (it thinks: "My opponent thinks I will play Rock, so they will play Paper, so I should play Scissors...").

Summary Analogy

Imagine you are hiring a team to design a new car engine.

Traditional RL (Deep Learning): You hire a team that builds a prototype, breaks it, fixes it, breaks it again, and eventually, after 10 years, they have a working engine. But the engine is made of a solid block of metal; you can't see the gears inside, so you can't explain how it works.
CSRO (This Paper): You hire a team of engineers who are also great writers. They design the engine, but instead of building a solid block, they write a blueprint (the code) that explains every gear, spring, and piston. You can read the blueprint, understand the logic, and even tweak a gear if you want to. And surprisingly, their engine runs just as fast as the solid block one.

In short: CSRO turns the "magic" of AI strategy into a "manual" that humans can read, understand, and trust.

Here is a detailed technical summary of the paper "Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models".

1. Problem Statement

Multi-Agent Reinforcement Learning (MARL), particularly using the Policy-Space Response Oracles (PSRO) framework, has achieved state-of-the-art results in complex games (e.g., StarCraft, Stratego). However, standard PSRO relies on Deep Reinforcement Learning (RL) oracles to compute "best responses." These RL oracles produce opaque, "black-box" neural network policies.

Lack of Interpretability: The resulting strategies are difficult to interpret, debug, or verify, creating a barrier to deployment in high-stakes real-world applications (e.g., autonomous driving, cybersecurity).
Sample Inefficiency: Training deep RL oracles often requires millions or billions of game simulations to converge.
Goal: The authors aim to replace the black-box RL oracle with a mechanism that generates interpretable, human-readable policies while maintaining competitive game-theoretic performance.

2. Methodology: Code-Space Response Oracles (CSRO)

The core innovation is CSRO, which reframes the best-response computation in PSRO as a program synthesis task rather than numerical optimization. Instead of training weights, a Large Language Model (LLM) generates executable source code (e.g., Python) representing the agent's strategy.

Key Components:

The Oracle: An LLM (specifically Gemini 2.5 Pro in experiments) acts as the oracle. It receives a prompt containing:
- Game rules and API specifications.
- Descriptions or source code of current opponent strategies (the meta-strategy $\sigma$ ).
- A directive to generate a "best response" program.
Iterative PSRO Loop:
1. Compute a meta-game equilibrium ( $\sigma$ ) over the current population of code policies.
2. Construct a prompt describing $\sigma$ and the game.
3. The LLM generates a new code policy ( $\pi'$ ) intended to be a best response to $\sigma$ .
4. Evaluate $\pi'$ against $\sigma$ and add it to the population.
Refinement Mechanisms: To improve the quality of generated code, CSRO employs three distinct strategies:
1. Zero-Shot: The LLM generates a policy in a single pass based on the prompt.
2. Linear Refinement: An iterative loop where the LLM regenerates the code if the initial version performs poorly (negative utility), using feedback to improve the strategy until a budget is reached.
3. AlphaEvolve: A distributed evolutionary system where the LLM mutates existing code in parallel threads, guided by a score function (expected utility), to discover novel and robust strategies.
Context Management: To handle large populations of opponents, CSRO uses context abstraction. Instead of feeding raw source code for all opponents (which exceeds context windows), it uses natural language summaries of opponent behaviors or filters the population (e.g., Top-K opponents).

3. Key Contributions

Novel Framework: Introduction of CSRO, the first framework to use LLMs as code-generating oracles within the PSRO loop to compute approximate Nash equilibria.
Interpretability: Policies are generated as commented, executable source code. This allows for direct inspection of strategic logic (e.g., "Theory of Mind" modules, specific betting heuristics), unlike neural network weights.
Scalability via Abstraction: The method addresses context limits by summarizing opponent strategies, enabling application to complex games where full code ingestion is impossible.
Evolutionary Refinement: The integration of AlphaEvolve demonstrates that evolutionary search over LLM-generated code significantly enhances strategy quality compared to zero-shot generation.
Rigorous Benchmarking: Unlike prior work (e.g., LLM-PSRO) limited to internal comparisons, CSRO is validated against standardized external populations and game-theoretic solvers (CFR+, IMPALA).

4. Experimental Results

The authors evaluated CSRO on two environments: Repeated Rock-Paper-Scissors (RRPS) and Repeated Leduc Hold'em Poker.

Repeated Rock-Paper-Scissors (RRPS)

Performance: CSRO variants (specifically LinearRefinement with code input and AlphaEvolve) achieved competitive performance against a population of 43 heuristic bots.
- AlphaEvolve achieved the lowest exploitability (25.2), aligning with the PSRO goal of finding robust equilibria.
- LinearRefinement (Code) achieved the highest aggregate score (122.1), competitive with a 27B parameter LLM agent (126.0) and significantly outperforming standard PSRO-IMPALA (which had negative aggregate scores).
Insight: Providing opponent strategies as source code combined with iterative refinement yielded the best results. Zero-shot generation without opponent context led to high exploitability.

Repeated Leduc Hold'em Poker

Performance: CSRO-AlphaEvolve achieved a PopReturn of 49.3 and an exploitability of 4.4, competitive with CFR+ (the gold standard for imperfect information games).
Strategic Depth: The generated policies demonstrated sophisticated reasoning:
- Against AlwaysCall bots, the agent learned to stop bluffing and only value-bet with strong hands.
- Against AlwaysFold bots, the agent learned to bluff relentlessly.
- The code explicitly calculated Expected Value (EV) based on opponent folding probabilities, showing transparent adaptation.

Qualitative Analysis

Interpretability: The generated code contained distinct modules for opponent modeling, pattern matching, and "Theory of Mind" (simulating how the opponent models the agent).
Efficiency: While baseline LLM agents require a model call every turn (1000 calls/game), CSRO generates a single reusable policy per iteration (e.g., 20 calls total for 20 iterations), offering a massive computational advantage during deployment.

5. Significance and Future Outlook

Paradigm Shift: CSRO shifts the focus of multi-agent learning from optimizing opaque parameters to synthesizing interpretable algorithmic behavior.
Trust and Verification: By producing human-readable code, CSRO enables domain experts to verify strategies, debug failures, and understand the "why" behind an agent's decisions, crucial for safety-critical applications.
Limitations:
- LLM Dependency: Performance is tied to the underlying LLM's capabilities and prompt quality.
- Scalability: Handling games with vast state spaces (e.g., full StarCraft) remains a challenge due to context window limitations, though context abstraction helps.
- Cost: While more efficient than RL training, repeated LLM API calls can be computationally expensive.

Conclusion: The paper demonstrates that LLMs can serve as powerful, interpretable oracles for game-theoretic strategy discovery. By generating code rather than weights, CSRO bridges the gap between high-performance multi-agent learning and the need for transparency and trust in AI systems.