Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Imagine you are trying to find the absolute best route to a hidden treasure in a massive, foggy maze. You have a guide (the Large Language Model, or LLM) who can suggest paths, but you don't have a map, a GPS, or a compass. In fact, you don't even know exactly where the treasure is until you stumble upon it.

This is the problem researchers faced when trying to improve AI answers for complex tasks like math problems or coding. Usually, you need a "scorecard" to tell the AI, "This answer is an 8 out of 10, try to get to a 9." But for many tricky tasks, creating a perfect scorecard is impossible, too expensive, or just doesn't exist.

Enter DUEL-EVOLVE. Think of it as a revolutionary way to navigate that foggy maze without a scorecard.

The Old Way: The Lonely Critic

Previously, methods tried to ask the AI, "On a scale of 1 to 10, how good is this answer?"

The Problem: AI is bad at giving consistent numbers. It might say "8" today and "6" tomorrow for the same answer. It's like asking a friend to rate a movie on a scale of 1 to 10; they might be inconsistent, or they might not know the rules.

The New Way: The Tournament (DUEL-EVOLVE)

Instead of asking for a number, DUEL-EVOLVE asks a much simpler question: "Between Answer A and Answer B, which one is better?"

This is like a boxing match or a tennis tournament. It's much easier for a human (or an AI) to say "Player A hit the ball better than Player B" than it is to say "Player A hit the ball with exactly 85% power."

Here is how DUEL-EVOLVE works, step-by-step, using a creative analogy:

1. The Arena (The Population)

Imagine a giant arena filled with hundreds of different "contestants" (candidate answers) generated by the AI. Some are terrible, some are okay, and a few are brilliant.

2. The Referee (The Self-Judge)

The AI acts as its own referee. It doesn't need an outside judge. It looks at two contestants, say "Candidate A" and "Candidate B," and declares a winner.

Note: The referee isn't perfect. Sometimes it gets it wrong (noise). But if you watch enough matches, the truth starts to emerge.

3. The Scoreboard (The Bayesian Model)

This is the magic part. The system keeps a running tally of every match. It uses a special math trick (called a Bradley-Terry model) to turn all these messy "A beat B" and "B beat C" results into a global ranking.

It's like a sports league table. Even if Team A hasn't played Team B yet, the table can estimate who is likely better based on who they've beaten and who they've lost to.
Crucially, the system also tracks uncertainty. If it hasn't seen enough matches between two candidates, it knows it's not sure who is better yet.

4. The Strategy (Double Thompson Sampling)

The system has a limited budget of "matches" it can watch before time runs out. It needs to be smart about who it pits against whom.

The Smart Move: It doesn't waste time watching the worst contestants fight each other. Instead, it uses the "uncertainty" data to pick the most promising matches. It asks, "Who are the top contenders that we aren't 100% sure about yet?"
It also picks the best "parents" (the current winners) to generate new, improved contestants for the next round.

5. Evolution (The Loop)

The process repeats:

Fight: The AI pits candidates against each other.
Rank: The system updates the league table.
Breed: The AI takes the best candidates from the table and asks, "Based on what made these winners win, can you create something even better?"
Repeat: The new, improved candidates enter the arena.

Why This is a Big Deal

The paper tested this on two very hard challenges:

MathBench: Solving complex math word problems.
LiveCodeBench: Writing code that passes hidden tests.

The Results:

Math: DUEL-EVOLVE got 94% accuracy, beating the next best method by a huge margin (20 points!).
Coding: It improved accuracy by over 12% compared to other advanced methods.

The Takeaway

The most amazing part is that DUEL-EVOLVE didn't need a teacher.

It didn't need a human to grade the answers.
It didn't need a pre-programmed "scorecard" for math or code.
It didn't even need to know the correct answer during the search.

It just needed the AI to compare its own ideas against each other. By turning the AI into a tournament organizer, a referee, and a coach all at once, it managed to evolve its own intelligence to solve problems it couldn't solve on the first try.

In short: Instead of asking the AI "How good is this?", DUEL-EVOLVE asks "Which is better?" and lets the AI fight its way to the perfect answer.

1. Problem Statement

The paper addresses the challenge of optimizing Large Language Model (LLM) outputs at test time for tasks defined over discrete, combinatorial output spaces (e.g., mathematical proofs, code generation, reasoning traces).

The Core Difficulty: Traditional optimization relies on a scalar reward function $f(y)$ to guide search. However, for many complex tasks, such scores are unavailable, too sparse (e.g., only final correctness is known), or unreliable (e.g., noisy automated verifiers).
Limitations of Existing Methods: Current test-time scaling methods (like Best-of-N or iterative refinement) typically require an external scalar evaluator or a trained reward model. When these are absent or noisy, search algorithms struggle to distinguish between candidates or allocate computational budget effectively.
The Goal: Develop an optimization algorithm that improves LLM performance using only pairwise preferences elicited from the LLM itself, without external ground-truth labels, reward models, or hand-crafted scoring functions.

2. Methodology: DUEL-EVOLVE

The authors propose DUEL-EVOLVE, an evolutionary optimization algorithm that treats the optimization process as a Dueling Bandits problem. It replaces scalar rewards with pairwise comparisons ("duels") and uses a Bayesian framework to guide the search.

Key Components:

Self-Preference as Signal:
- The same LLM acts as both the Generator (proposing candidate solutions) and the Judge (selecting the preferred candidate between two options).
- This eliminates the need for external supervision.
Bayesian Bradley-Terry Model:
- To aggregate noisy pairwise comparisons into a global quality estimate, the method fits a Bradley-Terry model.
- It assigns a latent utility $\theta_y$ to each candidate $y$ . The probability that $y_i$ is preferred over $y_j$ is modeled as $P(y_i \succ y_j) = \sigma(\theta_i - \theta_j)$ .
- Uncertainty Awareness: Instead of point estimates, the method uses a Laplace approximation to compute the posterior mean ( $\mu$ ) and variance ( $\sigma^2$ ) for each candidate's utility. This provides confidence intervals, allowing the algorithm to distinguish between "high quality but uncertain" and "low quality" candidates.
Double Thompson Sampling (DTS) for Budget Allocation:
- The algorithm faces a trade-off: which pairs to compare to maximize information gain?
- It adapts Double Thompson Sampling (DTS) to allocate the comparison budget.
- Mechanism: In each iteration, it samples latent utilities from the posterior distribution, identifies the most likely optimal candidates based on these samples, and selects pairs for comparison that are "plausible optima." This focuses computational effort on the most promising regions of the search space rather than wasting resources on clearly suboptimal candidates.
Evolutionary Loop:
- Initialization: Generate an initial pool of candidates.
- Update: Fit the Bradley-Terry posterior on the current pool using all historical duel outcomes.
- Evaluate: Select pairs to compare using Thompson Sampling and update the history.
- Evolve: Select high-scoring "parent" candidates (based on posterior means) and condition the LLM generator on them (along with their estimated utilities) to produce new, improved "child" candidates.
- Pruning: Maintain a "survivor set" by pruning candidates whose upper confidence bounds fall below the lower bounds of the best candidates, ensuring efficiency.

3. Key Contributions

Reward-Free Optimization: Demonstrates that pairwise self-preferences are a sufficient and strong signal for optimizing LLMs in discrete spaces, removing the dependency on external reward models or ground-truth labels during the search phase.
Uncertainty-Aware Evolution: Introduces a novel integration of Bayesian inference (Laplace approximation) with evolutionary search, allowing the algorithm to handle noisy judgments and efficiently allocate limited inference budgets.
Algorithmic Innovation: Adapts Double Thompson Sampling for a combinatorial, growing search space where exact posterior sampling is intractable, using LLM-conditioned proposals to approximate the search for optimal candidates.
Scalability: Shows that performance continues to improve as more compute (generations) is invested, unlike methods that plateau early.

4. Experimental Results

The method was evaluated on two challenging benchmarks: MathBench (mathematical reasoning) and LiveCodeBench (code generation).

MathBench:
- Task: Solving multi-step math word problems (MIDDLE, HIGH, COLLEGE levels).
- Result: DUEL-EVOLVE achieved 94.0% accuracy.
- Comparison: This is a 20 percentage point improvement over the strongest baseline (Feedback Descent at 72.0%) and significantly outperforms non-iterative methods like Best-of-N (65.3%) and Self-Consistency (62.7%).
- Convergence: Rapid improvement observed, reaching 90% accuracy within the first 10 generations.
LiveCodeBench:
- Task: Competitive programming problems with hidden test suites.
- Result: DUEL-EVOLVE achieved 37.4% accuracy.
- Comparison: Outperformed Feedback Descent (24.2%) and GEPA (25.3%) by over 12 percentage points.
- Significance: Even with sparse feedback (only public test cases available during search, hidden tests only for final evaluation), the method successfully navigated the solution space.

5. Significance and Implications

Overcoming Reward Scarcity: The paper proves that for tasks where defining a scalar reward is difficult (e.g., open-ended generation, complex reasoning), pairwise comparison is a robust alternative. LLMs are generally better at judging relative quality than assigning absolute scores.
Efficiency: By using Bayesian uncertainty to guide the search, DUEL-EVOLVE avoids the "wasteful" comparisons common in random or greedy evolutionary strategies.
Generalizability: The approach is model-agnostic regarding the underlying LLM and task-specific regarding the output space, making it applicable to a wide range of domains including code, math, and creative writing.
Limitations: The authors note that since the signal comes from the model itself, systematic biases (e.g., preferring confident but incorrect answers) may be amplified. Future work suggests ensembling models or calibrating against labeled subsets to mitigate this.

In conclusion, DUEL-EVOLVE represents a significant step forward in test-time compute scaling, demonstrating that sophisticated evolutionary strategies driven by self-generated preferences can outperform methods relying on external supervision or simple sampling.