This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer
Imagine you are a judge on a reality TV show like MasterChef. You are tasting a dish and giving it a score from 1 to 10.
Now, imagine there are three different judges: one is a professional chef who loves spicy food, one is a nutritionist who cares only about health, and one is a food critic who loves fancy presentation. If you ask them all to rate the same dish, they won't give the same score. The chef might give it a 9, the nutritionist a 4, and the critic a 7.
The big question is: If we wanted to build a robot to replace these judges, should we train the robot to give the "average" score of all three? Or should we build three different robots—one that "thinks" like the chef, one like the nutritionist, and one like the critic?
This paper, "Aggregate vs. Personalized Judges," explores exactly this problem, but instead of cooking, it’s about business ideas.
The Problem: The "Expert Disagreement" Mess
When companies use AI to generate new product ideas (like a new type of eco-friendly smartphone), they need to know which ideas are actually good. They ask human experts to rate these ideas on things like "Is this technically possible?" or "Is there a market for this?"
The researchers found something crucial: Experts rarely agree on the exact numbers.
One expert might be a "tough grader" (giving everything a 2 or 3), while another is an "easy grader" (giving everything a 4 or 5). Even if they are looking at the same idea, their "rulers" are calibrated differently. The researchers call this "structured heterogeneity." It’s not that the experts are being random or messy; it’s that they have different, consistent professional standards.
The Experiment: The Three Types of Robot Judges
The researchers tested three different ways to train an AI (the "Robot Judge") to score these business ideas:
- The "Clueless" Judge (Zero-Shot): This robot is given the rulebook but has never seen a human score anything before. It’s like a judge who has read the manual but has never actually tasted food.
- The "Average" Judge (Aggregate): This robot is trained on a big pile of scores from all the different human experts mixed together. It tries to find the "middle ground."
- The "Chameleon" Judge (Personalized): This robot is shown the specific history of one particular human. If it’s trying to mimic the "Chef," it only looks at how the Chef scored things in the past.
The Results: Why "Average" is Often Wrong
The study found that the Chameleon (Personalized) Judge was the clear winner.
- The Average Judge fails to please anyone: Because it tries to find a middle ground between a "tough grader" and an "easy grader," its scores don't actually match any real human. It’s a "jack of all trades, master of none."
- The Chameleon Judge is a mimic: It was much better at predicting exactly how a specific person would react. If the human expert was picky, the AI became picky. If the human was optimistic, the AI became optimistic.
- It even mimics the "Why": Not only did the Chameleon Judge get the numbers right, but it also wrote explanations that sounded like the human expert's reasoning.
The Big Takeaway: Don't Force a Consensus
In the business world, we often try to force everyone to agree on a single "score" for a project. This paper suggests that forcing agreement might actually hide the truth.
If a technical expert says an idea is "risky" and a marketing expert says it's "brilliant," those two different opinions are both valuable. Instead of building an AI that tries to mash those two opinions into one boring average, we should build AI that can speak the language of each specific stakeholder.
In short: When evaluating big ideas, don't look for one "correct" answer. Look for the different perspectives, and build AI that can understand them all.
Drowning in papers in your field?
Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.