Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

Imagine you are trying to solve a very tricky puzzle, like a complex riddle that requires connecting three different clues to find the answer. You have two ways to tackle this:

The Lone Genius (Single-Agent): One smart person sits at a desk, thinks hard, writes down their entire thought process, and solves it alone.
The Committee (Multi-Agent): A group of people sits around a table. They pass notes back and forth, debate each other, split the work, and try to solve it together.

For a long time, everyone assumed the Committee was better because they had more "brainpower" and could discuss ideas. But this new paper asks a crucial question: Are they actually smarter, or are they just using more paper and ink?

The Big Discovery: It's About the Budget

The researchers realized that in previous studies, the Committee was almost always allowed to use way more paper (tokens) than the Lone Genius. The Committee could write 10 pages of notes while the Genius was only allowed 1 page. Of course, the Committee looked better! They had more space to think.

This paper forced them to use the exact same amount of paper (a fixed "thinking budget"). When they did this, the results were surprising:

The Lone Genius almost always won or tied with the Committee.

Why? The "Telephone Game" Analogy

The authors use a concept from information theory to explain this. Imagine you are playing the Telephone Game (where a message is whispered from person to person).

The Lone Genius hears the whole story once, thinks about it, and writes the answer. The information stays fresh and complete in their head.
The Committee has to whisper the story from Person A to Person B, then to Person C. Every time they pass a note, a little bit of the message gets lost, distorted, or forgotten. Even if they are all very smart, the act of passing the message around introduces "noise."

The paper argues that unless the Committee has a massive advantage (like having a much bigger budget to talk over), the Lone Genius is more efficient because they don't lose information in the middle of the conversation.

When Does the Committee Win?

The paper found one specific situation where the Committee shines: When the puzzle is messy.

Imagine the Lone Genius is trying to read a book, but someone has spilled coffee on the pages, torn out half the text, or written random nonsense words over the clues. The Genius gets confused and can't find the answer.

In this case, the Committee can help. Because they are splitting the work, one person can focus on cleaning up the coffee stains, another can ignore the nonsense words, and a third can double-check the facts. They can "filter out" the mess better than one person trying to do it all at once.

The Lesson: If the information is clear, a single smart brain is best. If the information is messy or broken, a team can sometimes fix it.

The "Hidden Trick" in the Results

The researchers also found a funny glitch in how some AI systems (specifically Google's Gemini) report their work.

When asked to "think for 10,000 steps," the Lone Genius would often stop writing after 300 steps, even though the computer said it used 10,000.
The Committee, however, would actually write out all 10,000 steps because they were passing notes between different "people."

This made the Committee look like they were doing much more work than they actually were. The paper suggests that many previous studies claiming "Teams are better" were actually just measuring "Teams who got to use more paper."

The Takeaway

If you want to solve a reasoning problem efficiently:

Don't assume more people = better results. Often, a single, focused mind is more efficient if given the same resources.
Watch out for "fake" effort. Just because an AI says it "thought" a lot doesn't mean it actually processed more information.
Use a team only when things are messy. If the data is noisy or confusing, a team structure can help filter out the bad information.

In short: For clear thinking, one smart brain is usually better than a committee of brains passing notes, unless the notes are the only thing keeping the team from getting lost.

1. Problem Statement

Recent advancements in Large Language Models (LLMs) have popularized Multi-Agent Systems (MAS), where multiple agents collaborate (via planning, debate, or role-playing) to solve complex tasks. However, empirical comparisons between MAS and Single-Agent Systems (SAS) are often confounded by unequal test-time computation. MAS architectures typically consume significantly more tokens due to multiple interaction rounds, longer reasoning traces, and inter-agent communication.

The core question addressed by this paper is: When computational resources (specifically "thinking tokens") are strictly normalized, do multi-agent architectures offer inherent advantages over single-agent systems for multi-hop reasoning, or are their reported gains simply artifacts of increased compute?

2. Methodology

A. Theoretical Framework (Information Theory)

The authors ground their hypothesis in the Data Processing Inequality (DPI).

Premise: Let $Y$ be the correct answer, $C$ be the full context, and $M$ be the messages passed between agents in a MAS. Since $M$ is a function of $C$ ( $Y \to C \to M$ ), DPI dictates that $I(Y; C) \ge I(Y; M)$ .
Implication: A single agent with access to the full context $C$ is information-theoretically guaranteed to perform at least as well as a multi-agent system operating on compressed or summarized messages $M$ .
Context Degradation: The authors introduce a degradation parameter $\alpha$ . They predict that while SAS dominates under perfect context utilization, MAS may become competitive if the single agent's effective context utilization is degraded (e.g., due to noise, long-context confusion, or information loss), allowing structured multi-step pipelines to filter or recover information better than a single degraded pass.

B. Experimental Setup

Datasets: FRAMES and MuSiQue (specifically 4-hop questions), which require complex multi-step world knowledge reasoning.
Models: Three distinct model families:
- Qwen3-30B-A3B (Open-source)
- DeepSeek-R1-Distill-Llama-70B (Open-source)
- Gemini 2.5 (Flash and Pro versions)
Architectures Compared:
- SAS: A single pass with a "think step-by-step" prompt.
- SAS-L: A variant encouraging longer internal reasoning without changing the budget.
- MAS Variants:
  1. Sequential: Planner decomposes tasks; workers solve sequentially; aggregator synthesizes.
  2. Subtask-parallel: Independent subtasks solved in parallel.
  3. Parallel-roles: Specialized roles (Solver, Fact Extractor, Skeptic, etc.).
  4. Debate: Two agents answer and critique each other.
  5. Ensemble: Multiple independent answers with a judge.
Control Mechanism: All systems were evaluated under matched "thinking token budgets" (excluding prompts and final answers). The study explicitly controlled for the total number of tokens used for intermediate reasoning ( $B$ ).

C. Evaluation Metrics

Accuracy: Evaluated using an LLM-as-a-judge with a fixed rubric to check for semantic equivalence to the ground truth.
Diagnostic Analysis: Detailed error analysis (e.g., "gold in thoughts" vs. final answer) and context degradation experiments (deletion, masking, substitution, distractors).

3. Key Contributions

Information-Theoretic Justification: Provided a formal argument using DPI suggesting that, under fixed budgets, multi-agent decompositions introduce communication bottlenecks that inevitably lead to information loss compared to a single agent with full context access.
Controlled Empirical Benchmark: Conducted a rigorous comparison across three model families and five MAS architectures, strictly normalizing for thinking tokens.
Identification of Evaluation Artifacts:
- Revealed significant discrepancies in API-based token accounting (specifically for Gemini), where reported token counts often vastly exceeded the actual visible reasoning text, inflating the apparent compute of MAS.
- Demonstrated that standard benchmarks may suffer from memorization/overfitting, which paraphrasing ablation studies helped expose.
Context Degradation Boundary: Identified the specific regime where MAS becomes competitive: when the single agent's ability to utilize long or noisy contexts is degraded.

4. Key Results

SAS Dominance under Fixed Budgets: Across all models and datasets, SAS consistently matched or outperformed all MAS variants when reasoning tokens were held constant.
- Example: On MuSiQue with Qwen3, SAS achieved ~26-27% accuracy at 1k-2k tokens, while Sequential MAS hovered around 22-23%.
- Gemini Results: Even with Gemini 2.5-Pro (the strongest model), SAS generally outperformed or tied Sequential MAS at matched budgets.
Diminishing Returns: Increasing the thinking token budget beyond a certain point (e.g., 2k-5k tokens) yielded diminishing returns for both SAS and MAS, with some models exhibiting "over-thinking" or drift.
The "Context Degradation" Exception:
- When the context was artificially degraded (via masking or substitution of tokens), MAS (specifically Sequential) began to outperform SAS at high degradation levels ( $\alpha = 0.7$ ).
- This confirms the theoretical prediction: MAS helps when a single reasoning trajectory is too noisy or corrupted to maintain coherence, allowing the multi-step pipeline to filter errors.
Gemini Accounting Artifacts: The study found that for Gemini models, the API-reported "thinking tokens" were often 4-5x higher than the actual visible text tokens. MAS pipelines, which make multiple API calls, accumulated more "visible" text than SAS, creating an illusion of deeper reasoning even when the actual compute was matched.
Paraphrasing Robustness: Deep paraphrasing of benchmark questions reduced performance for both SAS and MAS, suggesting that some baseline performance relies on memorization rather than pure reasoning.

5. Significance and Implications

Re-evaluating MAS Hype: The paper challenges the narrative that multi-agent systems are inherently superior for reasoning. It suggests that many reported gains in literature are actually due to unaccounted compute (more tokens) rather than architectural benefits.
Efficiency: For multi-hop reasoning tasks, Single-Agent systems are the more efficient default when compute is constrained. They avoid the overhead of inter-agent communication and context fragmentation.
Best Practices for Evaluation: The authors emphasize the critical need for strict budget control in future research. Comparing systems without normalizing for "thinking tokens" leads to misleading conclusions.
When to Use MAS: The study clarifies that MAS is not universally inferior; it becomes a viable strategy specifically in degraded context regimes (e.g., extremely long contexts with noise) where a single agent struggles to maintain focus, or when additional compute is explicitly available and not budget-constrained.

In conclusion, the paper argues that for multi-hop reasoning, simplicity (SAS) often beats complexity (MAS) when resources are equal, and that the perceived superiority of multi-agent systems is frequently an artifact of increased computational expenditure rather than architectural innovation.