Imagine you are running a massive cooking competition. You have thousands of chefs (AI models) trying to create the perfect dish, but "perfect" is subjective. One judge might care about the salt, another about the presentation, and a third about the cooking time.

In the past, trying to grade these dishes was messy. Sometimes judges just wrote a vague note like "This tastes good," or they argued endlessly about why one dish was better than another. This paper introduces a new system called AsymmetryZero to fix that mess, and then tests two different ways to hire the judges.

Here is the breakdown in simple terms:

1. The Problem: The "Vague Judge" Trap

Currently, when we test AI, we often ask a super-smart AI to grade another AI's work. But if you just say, "Grade this essay," the grader might use its own hidden rules. It might like long answers, or it might get confused by the topic. It's like hiring a food critic who doesn't have a checklist; you never know if they're judging the food or just their mood.

2. The Solution: The "Evaluation Contract"

The authors created AsymmetryZero, which is basically a strict recipe for grading.

Instead of a vague prompt, every task comes with a "Contract." This contract is like a detailed scorecard that says:

What are we grading? (e.g., "Did the chef use salt?")
How do we check it? (e.g., "If the word 'salt' appears, give 10 points.")
Who decides? (A single judge or a group?)
What is the passing score?

This contract works for both simple AI (just writing text) and complex AI agents (robots that use tools and take multiple steps). The cool part is that the same contract can be used to grade a simple text bot or a complex robot, and the scores will be comparable.

3. The Experiment: The "Big Judges" vs. The "Small Judges"

The authors wanted to know: Do we need expensive, super-smart judges to grade these contracts, or can we use cheaper, smaller judges?

They set up a test with 75 complex tasks (like solving advanced math or coding problems). They used four different "contestant" AI models to solve the tasks. Then, they graded those solutions using two different groups of "Judge" AIs:

The Frontier Jury (The Big Judges): A panel of 5 of the most powerful, expensive, and smart AI models available.
The Compact Jury (The Small Judges): A panel of 5 smaller, cheaper, and faster AI models.

4. The Results: The "Cheaper Judges" Are Noisier

Here is what they found:

The Final Score is Similar: When you add up all the points, the "Big Judges" and the "Small Judges" usually agreed on who won the competition. If a task passed for the Big Judges, it usually passed for the Small Judges too.
The Details Are Messy: However, when you look at the individual steps (the specific criteria on the scorecard), the Small Judges disagreed with the Big Judges about 15% to 25% of the time.
The "Finger-Pointing" Problem: The biggest issue was that the Small Judges couldn't even agree with each other.
- The Big Judges were like a calm committee; they almost always agreed (only 6–11% of the time they were split).
- The Small Judges were like a chaotic room; they argued with each other constantly (splitting 3 vs. 2 about 30% of the time).

The Analogy: Imagine grading a math test.

Big Judges: All five professors look at the answer and say, "Yes, that's correct."
Small Judges: Three professors say "Correct," but two say "Incorrect because the handwriting is messy," even though the math is right. They are arguing with themselves.

5. The Trade-Off: Cost vs. Consistency

The Small Judges were incredibly cheap and fast.

Cost: They cost about 97% less than the Big Judges.
Speed: They were about 82% faster.

The Verdict:
If you just want a quick, cheap check to see if a system is generally working (like a "sanity check"), the Small Judges are great. They save a fortune.

But, if you need to know exactly why something failed, or if you need a perfect audit trail for high-stakes decisions, the Small Judges are too "noisy." They argue too much among themselves to be trusted for the fine details.

Summary

The paper argues that how you write the grading rules (the contract) is just as important as who you hire to grade.

You can save a lot of money by using smaller, cheaper AI judges, but you have to accept that they will argue with each other more often. If you need a calm, consistent verdict, you still need the expensive, "Frontier" judges. If you just need a rough estimate, the cheap ones will do the job.

Technical Summary: AsymmetryZero

Problem Statement

The paper identifies a critical gap in current Reinforcement Learning (RL) and AI evaluation pipelines: the difficulty of operationalizing subjective, procedural, and domain-specific human expert requirements into scalable evaluation signals. While exact-match metrics suffice for deterministic tasks, they fail for semantic, multi-factor, or open-ended tasks. Conversely, open-ended LLM judging often leaves grading policies implicit within prompts, leading to instability and lack of auditability. The authors argue that the central challenge in post-training is not merely scoring models, but the faithful encoding of expert requirements into the evaluation itself.

Methodology: The AsymmetryZero Framework

To address this, the authors introduce AsymmetryZero, a framework that operationalizes human expert preferences as semantic evals via a stable evaluation contract.

Core Components

Evaluation Contracts: Instead of a single prompt or answer key, a task is defined as a portable contract separating execution inputs (prompts, references) from grading inputs (criteria, weights, thresholds).
- Structure: Each criterion explicitly declares its weight, prompt, and grader type (either ExactMatch or llm-judge).
- Aggregation: Criterion-level decisions are aggregated into a task score ( $S = \sum w_i \hat{v}_i$ ). A task passes if $S \ge \tau$ .
- Jury Consensus: For llm-judge criteria, a panel of judges ( $J_i$ ) votes. Consensus is determined by strict majority ( $\hat{v}_i = 1$ if $\sum v_{ij} > |J_i|/2$ ); ties result in failure.
Dual Execution Harnesses: The framework decouples evaluation semantics from execution:
- Inspect: Used for model-only evaluations.
- Harbor: Used for agentic evaluations (specifically using a terminus2 agent).
- Both harnesses consume the same contract, ensuring comparable scores and shared audit artifacts across model and agent outputs.
Auditability: The system generates detailed traces, recording per-criterion results, judge votes, rationales, and weight contributions, enabling the analysis of failure modes and dissent.

Empirical Study: Judge Capacity and Substitution

The paper presents an empirical study using the PORTEX-COMPOSITE benchmark to answer whether smaller, cheaper "compact" juries can substitute for expensive "frontier" juries without compromising evaluation integrity.

Experimental Setup

Task: 75 frontier-class tasks evaluated across four solver models (Claude Opus 4.6, GPT-5.4, Grok-4.20, Gemini-3.1-Pro).
Jury Conditions:
- Frontier Jury: 5 large, state-of-the-art open-weight models.
- Compact Jury: 5 smaller open-weight models.
Metrics: Criterion-level agreement, within-pool disagreement (dissent rates), task-level score stability, and economic efficiency (cost, latency, tokens).

Key Results

Criterion-Level Divergence: Compact and frontier juries do not agree perfectly.
- Majority Agreement: Ranges from 75.9% to 89.6% across runs (strict common-subset: 77.8%–92.1%).
- Implication: Substituting compact judges changes a non-trivial share of semantic criterion decisions.
Internal Dissent (Stability): Compact juries exhibit significantly higher internal instability.
- 3–2 Splits: Frontier juries averaged 6.1%–11.5% split rates, whereas compact juries averaged 28.7%–32.4%.
- Conclusion: Compact juries disagree more with frontier juries and more with themselves.
Task-Level Stability: Despite criterion-level divergence, aggregated task outcomes are often similar.
- Correlation: Pearson correlation between frontier and compact task scores is 0.88 (range 0.81–0.93).
- Score Change: 70%–87% of graded tasks showed no score change between pools.
- Nuance: The stability appears "brittle," relying on the cancellation of errors in weighted sums rather than consistent criterion-level judgment.
Economic Efficiency: Compact juries offer massive efficiency gains.
- Cost: Reduced by ~97% per criterion.
- Latency: Reduced by ~82%.
- Tokens: Output tokens reduced by ~75%.
Analysis of Disagreement:
- Response Length: While longer responses correlate with higher disagreement, statistical modeling (ordinal mixed model) did not find strong evidence that compact juries are more sensitive to length than frontier juries. The primary driver of disagreement is the pool type itself (compact pools are inherently noisier).
- Failure Modes: Qualitative review suggests compact juries fail for the same reasons as frontier juries (e.g., literalism vs. substance) but apply standards less uniformly.

Key Contributions

Operational Framework: AsymmetryZero provides a concrete system for turning expert knowledge into auditable, executable evaluation contracts that work for both models and agents.
Rubric-Based Semantic Grading: It moves beyond open-ended prompting to structured, criterion-centric grading with explicit aggregation rules.
Empirical Evidence on Judge Capacity: The study provides data-driven evidence that while compact juries are economically viable for high-throughput monitoring, they are not yet decision-equivalent to frontier juries for criterion-auditable evaluation due to higher variance and internal dissent.

Significance and Claims

The paper claims that evaluation reliability depends as much on the contract as on the judge.

For Practitioners: The framework allows organizations to separate the definition of "what matters" (the contract) from the "how much it costs" (the judge selection).
Strategic Insight: Compact juries are suitable for low-cost outcome monitoring where final task scores matter more than specific criterion traces. However, for high-stakes decisions requiring criterion-level auditability, frontier juries remain the default due to their superior internal consensus.
Future Direction: The authors suggest that the gap between compact and frontier behavior could be narrowed via on-policy distillation (training compact evaluators to mimic frontier jury decisions), but this is identified as future work, not a current capability.

The authors remain modest, noting that their study evaluates comparability between juries, not absolute correctness against human ground truth, and that the results are specific to the STEM-oriented tasks and Harbor agent configuration tested.

AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals