Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for LLM Truthfulness

Imagine you are trying to solve a mystery, but you don't have a detective to check the clues. Instead, you decide to ask 100 different people for their opinions. You figure that if 90 of them say the same thing, they must be right. This is the idea behind "Crowd Wisdom": the belief that a group of imperfect people, when combined, can cancel out individual mistakes and reveal the truth.

This paper argues that this strategy fails completely when talking to Large Language Models (LLMs), like the AI you might use today.

Here is the breakdown of why, using some simple analogies:

1. The "Echo Chamber" Problem

In a real crowd of humans, people have different life experiences. If you ask 100 people about the capital of France, some might guess, but they won't all guess the same wrong answer by accident. Their errors are random, so the right answer usually wins out.

But AI models are different. They are all trained on the same massive piles of internet data (like Wikipedia, Reddit, and news sites). They are taught by similar methods and optimized to do similar things.

The Analogy: Imagine asking 100 students who all studied from the exact same textbook, which happened to have a typo on page 50. If you ask them a question based on that page, they won't give you a variety of wrong answers. They will all confidently give you the same wrong answer.
The Result: When you ask an AI for 100 different answers, you aren't getting 100 different opinions. You are getting the same opinion repeated 100 times, just with slightly different wording. The "crowd" is just an echo chamber.

2. The "Confident Fool"

The paper tested a common trick: asking the AI, "How sure are you?" The hope was that if the AI is very confident, it's probably right.

The Analogy: Think of a student who memorized the wrong answer but is very loud and confident about it. In a classroom, their confidence might convince the teacher they are right.
The Result: The paper found that AI models are often very confident when they are wrong. Because they are trained to sound helpful and agreeable, they will confidently repeat a shared misconception. Asking for "confidence" doesn't help filter out the truth; it just amplifies the loudest (and potentially wrong) voice.

3. The "Random String" Test

To prove that the models were just "thinking alike" because of their training, the researchers did a crazy experiment. They gave the models a string of random, nonsense characters (like gP%!mdq4k') and asked them to pick an answer (A, B, C, or D).

The Logic: There is no "truth" here. It's pure nonsense. If the models were truly independent, their answers should be random and scattered.
The Result: Even with nonsense, the models still agreed with each other more than chance would allow. This proves that the models have shared "biases" built into their brains (their code and weights). They have a shared "gut feeling" that isn't based on facts, but on how they were built.

4. The "Math vs. Opinion" Difference

The paper acknowledges that "asking the crowd" does work in math or coding.

The Analogy: If you ask 100 people to solve 2 + 2, and you have a calculator to check the answers, you can easily throw out the 99 people who said "5" and keep the one who said "4". The calculator is the Verifier.
The Problem: In real life, truth (like "What will the economy look like in 2030?" or "Is this news story true?") doesn't have a calculator. You can't just run a code check to see if an opinion is right. Without that external check, the AI's "crowd" just reinforces its own mistakes.

The Big Takeaway

The authors conclude that more computing power does not equal more truth if you don't have a way to verify the answer.

If you have a verifier (like a math checker): Asking the AI to try 1,000 times is great. It gives you 1,000 chances to find the one right answer.
If you have NO verifier (like asking about facts or opinions): Asking the AI to try 1,000 times is useless. It just gives you the same wrong answer 1,000 times, but with 1,000 times more confidence.

In short: You cannot fix a broken compass by asking 100 broken compasses to point in the same direction. If they are all broken in the same way, they will all point to the wrong North, and the group will be wrong together. To find the truth, you need an external map (a verifier), not just a bigger crowd.

Here is a detailed technical summary of the paper "Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for LLM Truthfulness."

1. Problem Statement

The paper addresses a critical limitation in Inference-Time Scaling for Large Language Models (LLMs). While scaling compute at test time (e.g., generating multiple candidates and selecting the best one) has proven highly effective in domains with external verifiers (such as mathematics and code, where correctness can be algorithmically checked), its efficacy in verifier-absent domains remains unproven.

The central question is: Can we elicit gains in truthfulness for domains without convenient verification by aggregating multiple model outputs (a "crowd wisdom" approach)?
The authors challenge the intuition borrowed from the "wisdom of crowds" literature, which suggests that aggregating imperfect judgments cancels out individual errors to reveal the truth. They hypothesize that this intuition fails for LLMs due to correlated errors.

2. Methodology

The authors conducted a comprehensive empirical evaluation across five models and four benchmarks to test aggregation strategies in the absence of external verifiers.

Models: Five instruction-tuned, open-source models spanning different families and parameter sizes (4B to 235B), including Gemma-3, GPT-oss, and Qwen variants.
Benchmarks: Four verifier-absent regimes:
1. Com2Sense: Binary commonsense reasoning.
2. Humanity's Last Exam (HLE): Expert-level questions with binary structures.
3. BoolQ: Binary factual question answering.
4. Predict-the-Future: Forecasting questions where outcomes postdate the model's knowledge cutoff (manually verified later).
Sampling Protocol:
- Generated 25 independent samples per model per question at temperatures $T \in \{0.7, 1.0\}$ .
- Created "Intra-model crowds" (repeated samples from one model) and "Inter-model crowds" (pooled samples from all five models).
Aggregation Strategies Evaluated:
1. Majority Vote: Selecting the most frequent answer.
2. Highest Confidence: Selecting the answer with the highest self-reported confidence.
3. Confidence-Weighted Vote: Weighting votes by confidence scores.
4. Prediction-Weighted Vote: Weighting by predicted popularity.
5. Surprisingly Popular (SP): Selecting the answer where observed support exceeds predicted support (relying on an expert minority).
Negative Control (Novel Contribution):
- To isolate structural correlation from shared knowledge, the authors introduced a "Zero-Knowledge" baseline. Models were fed random ASCII strings and forced to choose between options (A, B, C, D). Since no ground truth exists, any agreement must stem from shared inductive biases rather than factual knowledge.

3. Key Contributions

Empirical Refutation of Crowd Wisdom for LLMs: The paper demonstrates that polling-style aggregation fails to improve truthfulness in verifier-absent domains, even with up to 25× the inference cost of naive sampling.
Identification of Correlated Errors: The authors prove that LLM errors are strongly correlated across samples and model families. This violates the independence assumption required for crowd wisdom to function.
Decoupling Social Prediction from Truth Verification: The study reveals a fundamental separation: LLMs are significantly better at predicting what other models will say (social prediction/consensus) than at identifying what is true.
The "Random String" Diagnostic: By showing that models produce correlated outputs even on random strings with no ground truth, the authors prove that correlation stems from shared inductive biases and architectural similarities, not just shared training data.
Failure of Internal Signals: The paper shows that self-reported confidence and "surprise" signals (used in SP algorithms) track expected consensus rather than correctness, often amplifying shared misconceptions.

4. Key Results

No Accuracy Gains: Across all benchmarks, no aggregation method consistently outperformed single-sample baselines. In some cases, aggregation degraded performance.
Forecasting Failure: On the "Predict-the-Future" benchmark (where models have no prior knowledge of the outcome), all aggregation methods performed at chance levels, proving that aggregation cannot extract latent expertise without a verifier.
High Error Concentration: When models err, they tend to collapse onto the same incorrect answer. For example, in the MATH benchmark (where verifiers exist), errors were highly concentrated (65–87% of wrong answers were the same), meaning aggregation cannot filter them out without an external check.
Confidence Misalignment: Self-reported confidence correlates weakly with accuracy but strongly with agreement. High-confidence answers are often confidently wrong due to "sycophantic" training.
Inverse-SP Diagnostics: The "Surprisingly Popular" signal was found to be unstable; on some tasks (like HLE), the standard SP signal was anti-correlated with correctness (i.e., the "surprising" answer was actually the wrong one).
Correlation Without Truth: In the random string control, different models showed correlations as high as 0.35 in their forced-choice answers, confirming that shared priors drive agreement even when no truth exists.

5. Significance and Implications

Defining the Boundary of Inference-Time Scaling: The paper delineates a clear boundary: Aggregation improves performance only when an external verifier exists to filter candidates. In unverified domains, additional samples merely reinforce shared misconceptions.
Rejection of the "Bitter Lesson" for Truth: The results suggest that simply scaling compute (sampling more) does not scale truthfulness if the underlying epistemic prior is shared and flawed.
Rethinking Aggregation: The authors argue that consensus is not a proxy for verification. Instead of using agreement to select answers, agreement should be used as a warning signal for shared failure modes, triggering external tools, retrieval, or human intervention.
Future Directions: To scale truthfulness, the field must move beyond self-aggregation toward:
1. External Grounding: Using retrieval, tools, or execution as verifiers.
2. True Diversity: Engineering models with disjoint training data or objectives to break error correlation.
3. Explicit Verifiers: Training models specifically to detect errors using external evidence.

Conclusion: The paper concludes that "Consensus is Not Verification." Without an external mechanism to break error correlation, LLMs cannot self-correct or scale truthfulness through aggregation alone. The "wisdom of the crowd" fails because the crowd (LLMs) shares the same blind spots.

Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for LLM Truthfulness

1. The "Echo Chamber" Problem

2. The "Confident Fool"

3. The "Random String" Test

4. The "Math vs. Opinion" Difference

The Big Takeaway

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this

Comparison of Outlier Detection Algorithms on String Data

Structure-Aware Epistemic Uncertainty Quantification for Neural Operator PDE Surrogates

Interventional Time Series Priors for Causal Foundation Models

Fingerprinting Concepts in Data Streams with Supervised and Unsupervised Meta-Information

Graph Tokenization for Bridging Graphs and Transformers