Aggregate vs. Personalized Judges in Business Idea… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a judge on a reality TV show like MasterChef. You are tasting a dish and giving it a score from 1 to 10.

Now, imagine there are three different judges: one is a professional chef who loves spicy food, one is a nutritionist who cares only about health, and one is a food critic who loves fancy presentation. If you ask them all to rate the same dish, they won't give the same score. The chef might give it a 9, the nutritionist a 4, and the critic a 7.

The big question is: If we wanted to build a robot to replace these judges, should we train the robot to give the "average" score of all three? Or should we build three different robots—one that "thinks" like the chef, one like the nutritionist, and one like the critic?

This paper, "Aggregate vs. Personalized Judges," explores exactly this problem, but instead of cooking, it’s about business ideas.

The Problem: The "Expert Disagreement" Mess

When companies use AI to generate new product ideas (like a new type of eco-friendly smartphone), they need to know which ideas are actually good. They ask human experts to rate these ideas on things like "Is this technically possible?" or "Is there a market for this?"

The researchers found something crucial: Experts rarely agree on the exact numbers.

One expert might be a "tough grader" (giving everything a 2 or 3), while another is an "easy grader" (giving everything a 4 or 5). Even if they are looking at the same idea, their "rulers" are calibrated differently. The researchers call this "structured heterogeneity." It’s not that the experts are being random or messy; it’s that they have different, consistent professional standards.

The Experiment: The Three Types of Robot Judges

The researchers tested three different ways to train an AI (the "Robot Judge") to score these business ideas:

The "Clueless" Judge (Zero-Shot): This robot is given the rulebook but has never seen a human score anything before. It’s like a judge who has read the manual but has never actually tasted food.
The "Average" Judge (Aggregate): This robot is trained on a big pile of scores from all the different human experts mixed together. It tries to find the "middle ground."
The "Chameleon" Judge (Personalized): This robot is shown the specific history of one particular human. If it’s trying to mimic the "Chef," it only looks at how the Chef scored things in the past.

The Results: Why "Average" is Often Wrong

The study found that the Chameleon (Personalized) Judge was the clear winner.

The Average Judge fails to please anyone: Because it tries to find a middle ground between a "tough grader" and an "easy grader," its scores don't actually match any real human. It’s a "jack of all trades, master of none."
The Chameleon Judge is a mimic: It was much better at predicting exactly how a specific person would react. If the human expert was picky, the AI became picky. If the human was optimistic, the AI became optimistic.
It even mimics the "Why": Not only did the Chameleon Judge get the numbers right, but it also wrote explanations that sounded like the human expert's reasoning.

The Big Takeaway: Don't Force a Consensus

In the business world, we often try to force everyone to agree on a single "score" for a project. This paper suggests that forcing agreement might actually hide the truth.

If a technical expert says an idea is "risky" and a marketing expert says it's "brilliant," those two different opinions are both valuable. Instead of building an AI that tries to mash those two opinions into one boring average, we should build AI that can speak the language of each specific stakeholder.

In short: When evaluating big ideas, don't look for one "correct" answer. Look for the different perspectives, and build AI that can understand them all.

Technical Summary: Aggregate vs. Personalized Judges in Business Idea Evaluation

1. Problem Statement

While Large Language Models (LLMs) can generate vast quantities of business ideas, evaluating them remains a significant bottleneck. Unlike standard NLP tasks (e.g., factual correctness), business idea evaluation is multi-dimensional (feasibility, novelty, market size, etc.) and judgment-driven.

The authors identify a fundamental methodological flaw in current "LLM-as-a-Judge" approaches: most assume a single, unified scoring standard exists. In reality, domain experts (technical reviewers vs. business strategists) often exhibit systematic disagreement. This raises a critical question: Should an automatic judge be designed to approximate an aggregate consensus (the average of all experts) or to model individual evaluators (replicating specific, heterogeneous standards)?

2. Methodology

The study utilizes a newly introduced dataset, PBIG-DATA, and compares three distinct judge configurations.

The Dataset (PBIG-DATA): Contains ~3,000 individual scores across 300 patent-grounded product ideas. It covers three domains (NLP, Computer Science, and Material Chemistry) and evaluates ideas across six dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size.
Annotation Protocol: A "staged screening" protocol was used, where downstream dimensions (like market size) are only scored if upstream thresholds (like specificity) are met. This mimics real-world professional screening.
Judge Configurations:
1. Zero-shot Judge: Uses only the rubric and instructions (no prior examples).
2. Aggregate Judge: Conditioned on a "mixed history" of scoring examples from multiple different evaluators (aiming for a pooled consensus).
3. Personalized Judge: Conditioned on the scoring history of the specific target evaluator (aiming to replicate individual standards).
Evaluation Metrics: Alignment is measured using Krippendorff’s $\alpha$ (fine-grained ordinal agreement) and Jaccard similarity/Top-50% overlap (coarse selection agreement).

3. Key Contributions

PBIG-DATA Dataset: A high-quality, expert-annotated dataset for business-oriented idea evaluation.
Quantification of Structured Disagreement: Empirical proof that expert disagreement in business contexts is not "random noise" but "structured heterogeneity."
Methodological Framework: A comparative study demonstrating that personalization is superior to aggregation in pluralistic evaluation settings.

4. Key Results

Nature of Disagreement: Fine-grained agreement (Krippendorff’s $\alpha$ ) among experts was often near zero or negative, meaning experts do not share a single ordinal scale. However, coarse agreement (identifying which ideas are "strong" vs. "weak") was much higher, proving that experts have different "rulers" but similar "filters."
Personalization Advantage:
- Alignment: Personalized judges consistently outperformed both zero-shot and aggregate judges in aligning with specific evaluators across almost all dimensions and model sizes.
- Coarse Selection: Personalized judges were more effective at picking the same "top-tier" ideas as the target expert (higher Jaccard similarity and Top-50% overlap).
- Reasoning Consistency: A positive correlation was found between evaluator agreement and the similarity of the judge's generated reasoning only under personalized conditioning. This suggests personalized judges capture the actual "logic" or "policy" of the human expert.
Failure of Aggregation: Aggregate judges tend to produce an "averaged" score that fails to accurately represent any single expert's specific standards.

5. Significance and Implications

For AI Research: The paper challenges the "single ground truth" paradigm in LLM evaluation. It suggests that for subjective or professional tasks, "truth" is pluralistic, and model training/prompting should account for evaluator-specific distributions.
For Industry/Business: In corporate settings, forcing a single consensus score on business ideas can obscure vital perspectives (e.g., a technical expert's caution vs. a strategist's optimism). The study suggests that AI evaluation systems should be designed to surface and model diverse stakeholder perspectives rather than suppressing them through aggregation.

Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement