TRINITY: An Evolved LLM Coordinator

Imagine you are trying to solve a incredibly difficult puzzle, like a complex math problem or writing a piece of software. You have a team of experts available to help you: a genius mathematician, a master coder, a creative writer, and a meticulous editor.

The problem is, you don't know which expert to ask, when to ask them, or how to ask them. If you just ask the mathematician to write code, they might struggle. If you ask the coder to solve a calculus problem, they might get stuck.

TRINITY is a new system designed to be the ultimate project manager for these AI experts.

Here is how it works, broken down into simple concepts:

1. The "Tiny Brain" Manager

Most AI systems try to make one giant brain that knows everything. But that's expensive and hard to build. TRINITY takes a different approach. It uses a very small, lightweight AI (called a "coordinator") that acts as the manager.

Think of this manager like a conductor in an orchestra. The conductor doesn't play the violin or the trumpet; they just listen to the music and tell the right musician when to play.

The Team: TRINITY connects to a pool of different large AI models (some open-source, some from big companies like OpenAI or Google).
The Manager: The coordinator is tiny (about 0.6 billion parameters, which is small for AI). It doesn't try to solve the problem itself. Instead, it looks at the question and decides: "Who is best for this?"

2. The Three Magic Hats

Instead of just picking an AI, TRINITY gives the chosen AI a specific "hat" to wear for that turn. There are three roles:

The Thinker (The Strategist): This AI doesn't do the work yet. It looks at the problem and says, "Okay, here is the plan. First, we need to break this down into three steps." It creates the roadmap.
The Worker (The Doer): This AI follows the plan. It does the actual math, writes the code, or drafts the essay.
The Verifier (The Inspector): This AI checks the work. It asks, "Is this correct? Did we miss anything? Is there a better way?" If the answer is good, it says "Done!" If not, it sends it back to the Thinker or Worker to fix.

This cycle repeats until the Verifier is happy with the answer.

3. How Does the Manager Learn? (The Evolutionary Trick)

Usually, to teach a manager, you show it thousands of examples of "Right Answer" vs. "Wrong Answer." But in this case, the "Right Answer" is hard to define because the problem is complex, and asking the AI to generate answers is expensive.

The researchers used a method called Evolutionary Strategy.

The Analogy: Imagine you are trying to find the highest peak in a foggy mountain range. You can't see the top.
- Old Way (RL/Gradient Descent): You try to feel the slope under your feet to decide which way to walk. But the ground is slippery and foggy, so you often walk in circles or fall off a cliff.
- TRINITY's Way (Evolution): You send out 32 different "explorers" (variations of the manager) in random directions. You see which one got the highest score. Then, you take the best traits of the successful explorers, mix them together, and send out a new, slightly better group. You repeat this.
Why it works: Because the manager is so small and the problem has a specific structure, this "trial and error" method finds the best strategy much faster and cheaper than traditional teaching methods.

4. The Results: Why It's a Big Deal

The paper tested TRINITY on hard tasks like coding, math, and general knowledge.

Beating the Giants: TRINITY, using a tiny manager, consistently beat the individual "super-expert" AIs on their own. It even beat the best existing methods that try to combine AI models.
New Records: On a coding benchmark called LiveCodeBench, TRINITY achieved a score of 86.2%, setting a new world record. This means it solved coding problems better than any single AI model currently available.
Generalization: Even when given a brand new type of problem it had never seen before (like a specific math competition), TRINITY figured out how to use its team effectively without needing to be retrained.

Summary

TRINITY proves that you don't need to build a single, massive, god-like AI to solve hard problems. Instead, you can build a smart, tiny manager that knows how to orchestrate a team of diverse experts, assigning them the right roles (Thinker, Worker, Verifier) to solve problems together. It's like realizing that a well-led team of specialists is often better than one overworked genius.

1. Problem Statement

The paper addresses the limitations of current strategies for combining Large Language Models (LLMs) to achieve superior performance:

Weight Merging (Micro-level): Techniques like model soups or evolutionary merging require access to model weights and compatible architectures. This excludes high-performing closed-source models (e.g., GPT-5, Claude) and is often impractical due to architectural mismatches.
Scaling Laws: Simply increasing model size and training tokens is resource-intensive and yields diminishing returns.
Existing Coordination (Macro-level): Current multi-agent routing methods often rely on static heuristics, expensive inference (e.g., Mixture of Agents), or fail to effectively leverage the complementary strengths of diverse models in a dynamic, multi-turn setting.

Core Challenge: How to orchestrate a pool of diverse, heterogeneous LLMs (both open and closed-source) at test-time without modifying their weights, using a lightweight mechanism that learns to assign roles and select agents dynamically based on the input context.

2. Methodology: TRINITY

TRINITY introduces a lightweight, adaptive coordination framework that operates as a "macro-level" fuser. It does not generate text itself but acts as a router and role-assigner.

A. Architecture

The system consists of two main components with a total of fewer than 20,000 learnable parameters:

Compact Coordinator (SLM): A pre-trained Small Language Model (0.6B parameters, specifically Qwen3-0.6B) used solely to extract contextual representations.
- Input: The full conversation transcript (query + history).
- Feature Extraction: The coordinator uses the penultimate token's hidden state ( $h$ ) from the SLM. The authors hypothesize that this state contains rich contextual signals sufficient for decision-making.
- Fine-tuning: A small subset of the SLM's layers is adapted using Singular Value Fine-tuning (SVF), where only singular value scales are learned while orthogonal matrices remain fixed.
Lightweight Head: A small neural module (approx. 10K parameters) that takes the hidden state $h$ $h$ as input.
- Output: It produces logits for two simultaneous decisions:
  1. Agent Selection: Choosing one LLM from a pool of $L$ candidates.
  2. Role Assignment: Assigning one of three distinct roles: Thinker (T), Worker (W), or Verifier (V).

B. The Tri-Role Protocol

The coordination happens over a multi-turn process (up to a fixed budget $B_{turn}$ ):

Thinker: Analyzes the query and decomposes the task into a high-level plan or strategy.
Worker: Executes the plan, performing concrete steps (e.g., coding, calculation, derivation).
Verifier: Evaluates the accumulated solution for correctness and completeness. If it accepts the solution, the process terminates; otherwise, it triggers a revision.

C. Optimization via Evolutionary Strategy

Training the coordinator is challenging due to:

High Cost: Each training step requires running multiple LLM inferences (the "atomic evaluation").
Weak Signal: The reward (success/failure) is binary and sparse, making gradient-based methods like REINFORCE ineffective due to high variance and low signal-to-noise ratio.

Solution: The authors employ Separable Covariance Matrix Adaptation Evolution Strategy (sep-CMA-ES).

Why sep-CMA-ES? The optimization landscape exhibits block- $\epsilon$ separability, meaning parameters can be optimized in independent blocks with negligible inter-block interference.
Efficiency: Unlike full CMA-ES, the separable version maintains only a diagonal covariance matrix, making it computationally feasible for the ~10K parameter head under strict budget constraints (1.5k–40k evaluations).
Theoretical Advantage: The paper proves that under these conditions, sep-CMA-ES converges linearly with iterations, whereas Random Search (RS) converges only logarithmically, and Reinforcement Learning (RL) fails due to gradient instability.

3. Key Contributions

Lightweight Coordination Mechanism: Demonstrates that rich contextual signals from a 0.6B SLM's hidden states are sufficient for a tiny (10K parameter) head to effectively orchestrate diverse, state-of-the-art LLMs.
Efficient Training Methodology: Establishes theoretically and empirically that sep-CMA-ES is superior to RL, Imitation Learning (SFT), and Random Search for training coordinators under tight budget and binary reward constraints.
State-of-the-Art Performance: Achieves new records on major benchmarks, particularly in coding and reasoning, by dynamically synthesizing the strengths of multiple models.
Generalization: Shows robust zero-shot transfer to unseen tasks and domains without retraining.

4. Experimental Results

The authors evaluated TRINITY on four in-distribution benchmarks (Math500, MMLU, RLPR, LiveCodeBench) and four out-of-distribution tasks.

Performance:
- LiveCodeBench: Set a new state-of-the-art with a pass@1 of 86.2% (Jan–Apr 2025 questions), significantly outperforming GPT-5 (83.8%), Gemini 2.5-Pro (67.2%), and Claude-4-Sonnet (46.5%).
- General Benchmarks: Consistently outperformed all single-model baselines (even when those baselines were given 5x the token budget) and existing multi-agent routing methods (MoA, RouterDC, Smoothie).
- Error Reduction: Achieved a 21.9% relative error reduction over the second-best method on average across benchmarks.
Zero-Shot Transfer: TRINITY outperformed every individual model in the pool on held-out tasks (AIME, BigCodeBench, MT-Bench, GPQA-D), demonstrating emergent task-aware strategies.
Ablation Studies:
- Removing Singular Value Fine-tuning degraded performance, confirming the need to adapt the SLM's internal representation.
- Removing the Tri-Role structure (specifically the Thinker role) caused significant drops in reasoning tasks (Math500, RLPR).
- Using the last token (EOS) instead of the penultimate token for the hidden state caused a severe performance collapse, validating the hypothesis that the penultimate token carries the necessary context.
Separability Analysis:
- Hidden states were found to be linearly separable by task type (100% accuracy with Linear SVM), confirming that the SLM encodes task semantics effectively.
- The optimization objective was shown to possess block- $\epsilon$ separability, justifying the use of diagonal covariance in CMA-ES.

5. Significance and Future Work

Paradigm Shift: TRINITY suggests that the future of AI scaling may lie less in training monolithic models and more in engineering collaborative ecosystems where lightweight coordinators orchestrate specialized agents.
Cost Efficiency: By using a tiny coordinator and evolutionary training, it avoids the prohibitive costs of generating labels for Supervised Fine-Tuning (SFT) in multi-turn settings.
Limitations & Future: The current system can devise plans involving tools but cannot yet execute them (e.g., run code or call APIs). Future work aims to integrate heterogeneous agents, including code interpreters and external APIs, to bridge the gap between reasoning and grounded execution.

In summary, TRINITY proves that a minimal, evolutionarily trained coordinator can unlock the collective intelligence of diverse LLMs, achieving performance that surpasses even the strongest individual models in the pool.

TRINITY: An Evolved LLM Coordinator

1. The "Tiny Brain" Manager

2. The Three Magic Hats

3. How Does the Manager Learn? (The Evolutionary Trick)

4. The Results: Why It's a Big Deal

Summary

1. Problem Statement

2. Methodology: TRINITY

A. Architecture

B. The Tri-Role Protocol

C. Optimization via Evolutionary Strategy

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

More like this

A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

FLeX: Fourier-based Low-rank EXpansion for multilingual transfer

Spectral Edge Dynamics Reveal Functional Modes of Learning

S3S^3S3: Stratified Scaling Search for Test-Time in Diffusion Language Models

$S^3$ : Stratified Scaling Search for Test-Time in Diffusion Language Models