Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures

Imagine you are trying to organize a massive, chaotic kitchen to cook a complex banquet. You have 25,000 different recipes to test, and you have a team of 256 chefs. But here's the twist: these aren't human chefs. They are AI chefs (Large Language Models) who can read the entire menu in a second, change their specialty from "sushi" to "baking" instantly, and work for free.

The big question this paper asks is: How do you organize these AI chefs to get the best meal?

Do you put a strict head chef in charge who tells everyone exactly what to do? Do you let them all shout out ideas at once and hope they figure it out? Or is there a better way?

The researchers ran a massive experiment (25,000 tasks!) and found a surprising answer that flips traditional management on its head.

The Three Ingredients for Success

The paper argues that for a team of AI to work well, you need three things, but none of them is a pre-assigned job title.

A Mission: A clear goal (e.g., "Cook a 5-course meal").
A Protocol: A set of rules for how they talk to each other.
A Capable Model: A smart AI chef.

If you have a great chef but no rules, they get confused. If you have great rules but a dumb chef, they fail. You need both.

The "Endogeneity Paradox": The Goldilocks Solution

The researchers tested different ways to organize the team:

The Dictator (Centralized): One AI acts as the boss, assigns roles, and tells everyone what to do.
The Free-for-All (Fully Autonomous): Everyone shouts out what they want to do at the same time, with no order.
The Hybrid (Sequential): This is the winner.

The Winning Strategy (The Hybrid):
Imagine a sports draft.

The order of the chefs is fixed (Chef 1 goes first, then Chef 2, etc.). This is the only rule.
However, Chef 2 doesn't know what Chef 1 was told to do. Instead, Chef 2 sees exactly what Chef 1 actually cooked.
Based on that, Chef 2 decides: "Oh, Chef 1 already made the soup. I'm not good at soup, so I'll skip it and make the dessert." Or, "Chef 1 made a great soup, but I can make it even better, so I'll tweak it."

Why this wins:

The Dictator fails because the boss can't see everything and might give bad orders.
The Free-for-All fails because everyone tries to do the same thing (like 10 people making soup) while ignoring other tasks.
The Hybrid wins because everyone sees the actual results of the previous person. They can adapt instantly. They invent new roles on the fly (like "Sauce Specialist" or "Plating Artist") that no one told them to be.

The "Musician" vs. The "Sheet Music"

The paper uses a beautiful analogy:

The AI Model is the Musician.
The Protocol is the Sheet Music.

If you have a world-class orchestra (a smart AI) but no sheet music (no protocol), they play a mess. If you have perfect sheet music but a band of beginners (a weak AI), they still sound bad.

The Finding: The "Sheet Music" (the protocol) matters just as much as the "Musician." In fact, for smart AIs, the way you organize them (the protocol) explains 44% of the success, while picking a slightly smarter AI only explains 14%.

Surprising Behaviors (The Magic Happens Here)

When the researchers let the AIs use this "Hybrid" method, some magical things happened that humans couldn't program:

Voluntary Quitting: If an AI chef realizes, "I'm not good at this specific dish," they will voluntarily step aside and let someone else do it. This saves time and money.
Role Invention: The AIs didn't stick to "Chef" or "Waiter." They invented 5,000+ unique roles like "Flavor Balancer" or "Safety Inspector" just for that specific task.
Self-Healing: If you randomly remove a chef from the team, the remaining chefs instantly reorganize and fix the problem without panicking.

The "Too Big" Problem

The researchers tried scaling up from 4 chefs to 256.

Good News: The quality of the food didn't drop. The system stayed stable.
Bad News: Adding more chefs beyond 64 didn't make the food better. It just cost more.
Lesson: It's better to have 64 smart chefs working well than 256 chefs getting in each other's way.

Cheap vs. Expensive

They tested expensive, closed-source AIs (like Claude) against cheaper, open-source ones (like DeepSeek).

The cheap AI achieved 95% of the quality of the expensive one.
But it cost 24 times less.
Takeaway: You don't need the most expensive "musician" if you have the right "sheet music."

The Bottom Line for Humans

If you are building a team of AI agents, stop assigning them job titles.

Don't say: "You are the Researcher, you are the Writer."
Do say: "Here is the goal. Here is the rule: Watch what the person before you did, then decide what you should do next."

Give them a mission, a smart brain, and a simple rulebook, and they will organize themselves into a perfect, self-healing, highly efficient machine. The paper calls this the "Endogeneity Paradox": The best structure isn't one you build from the outside; it's one that grows naturally from the inside when you give the right conditions.

1. Problem Statement

Current multi-agent Large Language Model (LLM) systems typically rely on exogenous coordination, where humans pre-assign fixed roles, rigid hierarchies, and centralized task allocation (e.g., ChatDev, MetaGPT). This approach assumes agents function like human workers with fixed specializations. However, LLM agents possess fundamentally different capabilities: they can instantaneously change specialization, process full organizational context, and operate at zero marginal cost when idle.

The paper addresses a critical gap: What coordination architecture best balances solution quality, cost, scalability, and resilience? Specifically, it investigates whether maximal external control (centralized) or maximal agent autonomy (fully decentralized) yields optimal results, or if a hybrid approach is superior.

2. Methodology

The study is the largest systematic computational experiment on multi-agent LLM coordination to date, comprising 25,000+ task runs across 20,810 unique configurations.

Scope:
- Models: 8 LLMs (4 closed-source: Claude Sonnet 4.6, GPT-5.4, GPT-4o, GPT-4.1-mini; 4 open-source: DeepSeek v3.2, GLM-5, etc.).
- Scale: Systems ranging from 4 to 256 agents.
- Tasks: 4 complexity levels (L1: Single-domain to L4: Adversarial multi-stakeholder).
Coordination Protocols: The authors evaluated a spectrum from exogenous to endogenous:
1. Coordinator (Centralized): An external agent assigns roles and phases; others execute in parallel.
2. Sequential (Hybrid): Agents process tasks in a fixed order. Each agent observes completed outputs of predecessors and autonomously selects its role, decides to participate or abstain, and contributes.
3. Broadcast (Signal-based): Agents broadcast intentions simultaneously, then decide.
4. Shared (Fully Autonomous): Agents access shared memory and decide simultaneously without fixed ordering.
- Note: Four bio-inspired protocols were also tested but detailed in a companion paper.
Evaluation:
- Quality ( $Q_t$ ): Measured via an independent LLM-as-a-judge (GPT-4o/GPT-5.4) across 5 dimensions: Accuracy, Completeness, Coherence, Actionability, and Mission Relevance.
- Metrics: Solution quality, execution time, token cost, risk, and a composite Balance Index.
- Resilience: Tested via random agent removal, hub removal, and model substitution.

3. Key Contributions

The Endogeneity Paradox: The discovery that neither maximal control nor maximal autonomy is optimal. A hybrid protocol (Sequential) that imposes minimal structural scaffolding (fixed ordering) while allowing maximal role autonomy (self-selected specialization) significantly outperforms both centralized and fully autonomous systems.
Capability Threshold: Identification of a "capability threshold" for models. Self-organization only benefits systems with strong models (e.g., Claude, DeepSeek). Weaker models (e.g., GLM-5) perform better under rigid, fixed structures; for them, autonomy leads to performance degradation.
Emergent Phenomena: Documentation of spontaneous behaviors in self-organizing systems, including:
- Dynamic Role Invention: Agents reinvent roles for every task (Role Stability Index $\to$ 0), generating thousands of unique roles from a small agent pool.
- Voluntary Self-Abstention: Agents voluntarily opt-out of tasks they deem outside their competence, optimizing system cost and quality.
- Shallow Hierarchy: Systems spontaneously form flat structures (max depth ~2.0) regardless of scale.
Constitutional Framework: Proposal of a "Three-Ring" governance model for autonomous organizations, separating immutable human values (Ring 1), human-system standards (Ring 2), and fully autonomous protocols (Ring 3).

4. Key Results

A. Protocol Performance (The Endogeneity Paradox)

Sequential vs. Shared: The hybrid Sequential protocol outperformed the fully autonomous Shared protocol by 44% in quality (Cohen's $d = 1.86$ , $p < 0.0001$ ).
Sequential vs. Coordinator: At scale (16 agents, L3 tasks), Sequential outperformed the centralized Coordinator by 14% ( $p < 0.001$ ).
Reasoning: Sequential succeeds because agents receive factual, completed outputs from predecessors, rather than changing intentions (Broadcast), stale history (Shared), or a single agent's plan (Coordinator).

B. Scaling and Cost Efficiency

Sub-linear Scaling: Increasing agents from 64 to 256 resulted in no statistically significant quality degradation ( $p = 0.61$ ).
Cost Optimization: At $N=256$ , approximately 45% of agents self-abstained, demonstrating an endogenous cost-optimization mechanism.
Open-Source vs. Closed-Source: Open-source models (DeepSeek v3.2) achieved 95% of the quality of top-tier closed-source models (Claude Sonnet 4.6) at 24x lower cost.

C. Model Capability Threshold

Strong Models: For capable models (Claude, DeepSeek), free-form self-organization improved quality by ~3.5% over fixed roles.
Weak Models: For less capable models (GLM-5), self-organization reduced quality by ~9.6% compared to fixed roles, confirming that rigid structure is necessary below a certain capability threshold.

D. Emergent Properties

Role Fluidity: In a 16-agent system, 5,006 unique roles were generated across 10 tasks. Agents did not settle into fixed positions but adapted dynamically.
Resilience: Self-organizing systems recovered from perturbations (agent removal, model substitution) within 1 iteration, with larger systems healing faster.

5. Significance and Implications

Paradigm Shift: The paper challenges the prevailing "fixed role" architecture in multi-agent systems. It argues that pre-assigning roles replicates human limitations onto AI agents, which are capable of zero-cost specialization switching.
Design Recipe: Practitioners should stop assigning roles and instead provide:
1. A clear Mission.
2. A capable Model (above the threshold).
3. The right Protocol (specifically, a hybrid like Sequential).
Economic Viability: The findings suggest that high-quality autonomous organizations can be built using cost-effective open-source models if paired with the correct coordination protocol, rather than relying solely on expensive proprietary models.
Future Governance: The proposed constitutional framework offers a theoretical basis for governing autonomous organizations, balancing human oversight with system autonomy.

In conclusion, the paper demonstrates that effective self-organization is multiplicative, requiring both a capable foundation model and a coordination protocol that provides minimal structural constraints while maximizing informational flow. The "Sequential" hybrid approach emerges as the superior architecture for scalable, resilient, and high-quality multi-agent systems.