Strategic Candidacy in Generative AI Arenas

Imagine a giant, noisy talent show where thousands of AI models (like digital chefs or writers) are constantly competing against each other. The audience votes on which model did a better job in head-to-head matchups. Based on these votes, a leaderboard is created, telling everyone which AI is the "best."

This is how platforms like Chatbot Arena work. They are popular because they feel organic—real people voting on real outputs. But, as this paper explains, there's a sneaky loophole in the system that allows companies to cheat, and the authors have designed a new rulebook to stop them.

Here is the breakdown of the problem and the solution, using simple analogies.

The Problem: The "Clone Army" Loophole

The Setup:
Imagine you are a baker entering a competition to be named the "Best Baker." You have one amazing cake recipe. The judges vote on your cake against others. Because there are only so many judges, the results can be a little random. Maybe your cake gets a few lucky votes and jumps up the leaderboard.

The Cheat:
Now, imagine you can submit five identical copies of that same cake, but you give them slightly different names (e.g., "Cake A," "Cake B," "Cake C").

The Lottery Ticket Effect: Instead of having one chance to get lucky, you now have five chances. Even if the judges are picky, statistically, one of your clones is likely to get a few extra lucky votes and shoot to the top.
The New Competitor Effect: By adding more cakes, you change the math. You might accidentally make the other bakers look worse because they are now fighting against more of your cakes.

The Result:
The paper proves that in the current system, companies can flood the arena with "clones" (slightly tweaked versions of the same model) to artificially boost their ranking. It's like a baker entering the contest 50 times to guarantee they win, even if their cake isn't actually the best. This makes the leaderboard useless because the top spot is just a reflection of who has the most clones, not who has the best AI.

The Solution: "You Rank, We Rank" (YRWR)

The authors propose a new mechanism called You-Rank-We-Rank (YRWR). It's a clever way to fix the cheating without needing to know exactly which models are clones (which is hard to detect technically).

How it works:

The Old Way: You submit your models. The system runs the votes and gives you a score. If you submit 5 clones, the system treats them as 5 separate competitors.
The New Way (YRWR):
- You still submit your models.
- But now, you also have to submit a list ranking your own models. You have to tell the system, "This is my best model, this is my second best, etc."
- The Magic Rule: The system looks at your list. If you submitted 5 clones of the same model, the system forces them to share the score of the lowest-ranked one in your own list.
- The Analogy: Imagine you bring 5 identical cakes to the contest. You tell the judges, "I think Cake A is my best, and Cake B is my worst." The judges then say, "Okay, since you admitted Cake B is your worst, all your cakes are now scored as if they were Cake B."

Why this stops the cheating:
If you try to submit 5 clones to get lucky, the system forces you to "pick a winner" in advance. If you pick the wrong one (or if the random votes go against your top pick), your score gets dragged down by the "worst" one you admitted to.

The Result: Submitting 5 clones gives you zero advantage over submitting just 1. You can't game the system anymore because the system forces you to take responsibility for your own lineup.

Does it ruin the rankings?

You might worry: "If companies lie about their own rankings, won't the whole system break?"

The paper shows that:

It's Robust: Even if companies lie a little bit about their own rankings, the system still works much better than the old one.
It's More Accurate: Surprisingly, by forcing companies to rank their own models, the system actually becomes more accurate at finding the true best models. It's like having a second opinion that helps filter out the noise.
The "Uncertainty" Fix: The authors even added a safety feature. If a company is truly unsure which of their models is better, the system can ignore their ranking if there is enough data to prove who is actually better. This prevents companies from accidentally messing up the results if they are just guessing.

The Big Picture

Think of the current AI ranking system as a lottery where you can buy as many tickets as you want (clones) to guarantee a win.

The new system (YRWR) is like a tournament where you have to put your name on the team sheet and say who your captain is. If you try to sneak in 50 players who are all the same person, the rules force you to treat them all as your weakest player.

Why does this matter?
Right now, investors, companies, and researchers are making billion-dollar decisions based on these leaderboards. If the leaderboards are rigged by "clone armies," we might be investing in the wrong AI or ignoring the truly brilliant ones. This new mechanism ensures that the "Best AI" title actually goes to the best AI, not the one with the most marketing tricks.

1. Problem Statement

Generative AI arenas (e.g., LMArena/Chatbot Arena) rank models based on pairwise user preferences aggregated via the Bradley-Terry (BT) statistical model. While effective, these rankings are susceptible to strategic candidacy, specifically model cloning.

The Vulnerability: Model producers can submit multiple identical or near-identical copies ("clones") of the same model.
The Mechanism of Exploitation: Due to the finite number of votes per model pair, rankings contain statistical noise. Submitting clones creates a "Lottery Ticket Effect" (increasing the probability that at least one clone achieves a high rank due to random variance) and a "New Competitor Effect" (altering the competitive landscape).
The Consequence: Producers can artificially inflate the rank of their best-performing model by flooding the arena with clones, degrading the accuracy and trustworthiness of the leaderboard.
Research Questions:
1. Under what conditions do producers benefit from cloning in current systems?
2. Can a ranking mechanism be designed that is robust to cloning while maintaining or improving ranking accuracy?

2. Methodology and Formal Setup

Theoretical Framework

The authors formalize the arena as a game involving $n$ producers, each possessing a set of distinct models.

Voter Behavior: Modeled using the standard Bradley-Terry model where the probability of model $j$ beating $j'$ is $P(j > j') = \frac{e^{R_j}}{e^{R_j} + e^{R_{j'}}}$ .
Producer Utility: Producers derive utility not just from being #1 overall, but from winning specific leaderboards (subsets of models defined by attributes like "open-weight," "low-latency," etc.). Utility is defined as the probability of having the top-ranked model in a specific leaderboard, weighted by the importance of that leaderboard.
Status Quo Mechanism ($sq$): Fits BT parameters via Maximum Likelihood Estimation (MLE) on pairwise votes and ranks models directly by the estimated scores.

The Proposed Mechanism: You-Rank-We-Rank (YRWR)

The authors propose a new mechanism that requires producers to submit a self-ranking ( $\pi_i$ ) of their own models alongside the models themselves.

Algorithm:
1. Compute standard MLE scores ( $\hat{R}$ ) from pairwise votes.
2. Score Correction: For each producer $i$ and each model $j$ in their set, the corrected score $\tilde{R}_j$ is set to the minimum of the estimated scores of all models that producer $i$ ranked above $j$ in their self-report $\pi_i$ .
3. Global Ranking: Models are ranked based on these corrected scores $\tilde{R}$ .
Intuition: If a producer submits clones, they must rank them relative to each other. The mechanism forces the "best" clone's score to be capped by the score of the "worst" clone (or the one ranked lowest among the group) if the producer ranks them in a specific order. This negates the statistical advantage of having multiple independent draws (clones) because the maximum of the group is constrained by the minimum of the group's scores relative to the producer's own ordering.

3. Key Contributions

A. Proof of Clone-Nonrobustness in Status Quo

The authors formally prove that under the current status quo mechanism, submitting clones is a rational strategy for producers under specific conditions.

Theorem 3.2: If a producer has a model that is $(\epsilon, \delta)$ -competitive (i.e., it has a non-negligible chance of winning a leaderboard of significant importance, but the outcome is uncertain due to noise), the producer strictly benefits from submitting an additional clone.
Simulation: Calibrated to real Arena data, simulations show that cloning can improve a model's rank by 7+ positions with just one clone, particularly in arenas with fewer votes per pair (e.g., Coding, Multiturn).

B. YRWR: Approximate Clone-Robustness

The authors prove that YRWR is approximately strategy-proof against cloning.

Theorem 4.1: For sufficiently large sample sizes ( $s$ ), a producer cannot gain more than $O(1/\sqrt{s})$ utility by submitting multiple copies of a model compared to submitting exactly one copy of each distinct model.
Mechanism: The self-ranking requirement forces the producer to "pick a winner" among their clones in advance. Since the mechanism takes the minimum score among ranked models, the statistical benefit of having multiple independent draws (the "Lottery Ticket" effect) is eliminated.

C. Accuracy Improvements

Contrary to the intuition that adding constraints might reduce accuracy, YRWR improves ranking accuracy if producers report truthfully.

Proposition 4.2: If producers rank their models according to the ground truth, YRWR produces estimates with lower $\ell_\infty$ error than the status quo.
Corollary 4.3: The mechanism maintains the asymptotic correctness and efficiency of the standard MLE estimator.

D. Handling Misranking and Uncertainty

The paper addresses scenarios where producers might misreport rankings due to uncertainty or strategic incentives.

Uncertainty-Aware Variant (ua-yrwr): Only enforces self-rankings when model scores have overlapping confidence intervals. If data is sufficient to confidently distinguish models, the mechanism ignores the self-ranking, ensuring asymptotic correctness regardless of producer honesty.
Local-yrwr: Performs score correction within specific leaderboards rather than globally. This incentivizes truthful reporting even in finite samples by aligning the producer's local incentives with the global truth.

4. Results

Simulation on Arena Data:
- Status Quo: Cloning leads to significant rank gains (up to 7+ positions) for models in the middle of the pack where scores are clustered.
- YRWR: Cloning yields near-zero rank gains. The mechanism effectively neutralizes the "Lottery Ticket" effect.
- Accuracy: YRWR reduces the Kendall-Tau distance to the ground-truth ranking compared to the status quo, even when producer self-rankings are noisy (though accuracy degrades if noise is extreme).
Theoretical Bounds: The paper provides non-asymptotic bounds showing that the utility gain from cloning under YRWR diminishes as $1/\sqrt{s}$ , where $s$ is the number of votes per pair.

5. Significance and Implications

Trustworthiness of AI Benchmarks: As AI models become more commoditized, the integrity of leaderboards is critical for investors, researchers, and users. This work identifies a fundamental vulnerability in current "crowdsourced" ranking systems.
Mechanism Design for AI: The paper introduces a novel mechanism (YRWR) that leverages self-reporting not just for denoising (as in academic peer review) but specifically to enforce clone-robustness in an endogenous candidate setting.
Policy Recommendations:
- Platforms should consider implementing self-ranking requirements.
- Private testing phases can help producers generate accurate self-rankings before public release.
- Confidence intervals should be used to determine when to enforce self-rankings vs. relying on raw data.
Future Directions: The framework opens avenues for studying strategic behavior in online, sequential ranking environments and exploring non-uniform vote allocations to further mitigate manipulation.

In summary, the paper demonstrates that current AI arenas are vulnerable to strategic cloning, which distorts rankings. It proposes You-Rank-We-Rank (YRWR), a mechanism that requires producers to rank their own models, mathematically proving that this approach eliminates the incentive to clone while simultaneously improving the statistical accuracy of the final rankings.