Machines acquire scientific taste from institutional traces

The Big Problem: AI is Great at Math, But Terrible at "Gut Feelings"

Imagine you are a talent scout for a movie studio. You have to decide which movie scripts are worth millions of dollars to produce.

AI today is like a super-smart robot that can read a script, check the grammar, count the words, and even solve complex math problems about the budget. It can do all that perfectly.
But, when it comes to the real question—"Is this story good? Is it original? Will people love it?"—the robot is lost. It tends to be too nice, saying "Yes!" to everything, or it just guesses randomly.

This paper argues that the hardest part of science isn't doing the experiments; it's taste. It's the ability to look at a new, untested idea and say, "This is brilliant," or "This is a waste of time." Humans have this "taste," but they can't really explain how they have it. It's like a chef who knows a soup is perfect but can't write down the exact recipe for "flavor."

The Secret Ingredient: The "Institutional Trace"

The researchers asked: If AI can't learn "taste" from a textbook, where does it live?

They realized that taste isn't stored in a person's brain; it's stored in the history of decisions made by the scientific community.

The Analogy: Think of a massive, dusty library where every book represents a research idea. Some books got published in the "Hall of Fame" (top journals), some in the "Local Library" (good journals), and some were thrown in the trash (rejected).
The "taste" isn't in the books themselves; it's in the pattern of where they ended up. The library has a hidden map showing which ideas the community eventually decided were winners.

The researchers called this map the "Institutional Trace." It's the digital footprint of thousands of years of editors and reviewers saying "Yes" or "No."

The Experiment: Teaching AI to Read the Map

The team tried to teach AI to read this map instead of trying to teach it rules.

The Old Way (Frontier Models): They asked the smartest AI models (like the ones you might know) to judge research ideas using a strict set of rules.
- Result: The AI performed barely better than a monkey throwing darts at a dartboard (about 31% accuracy). It couldn't distinguish a "masterpiece" from a "mess."
The Human Way: They asked real human experts (editors and professors) to judge the same ideas.
- Result: Humans did better (about 42%), but they were still very inconsistent. One expert might love an idea, while another hated it. Even a group vote only got them to 42%.
The New Way (The "Taste" Training): They took AI models and fine-tuned them. They didn't give them rules. Instead, they fed them thousands of examples of: Here is a research idea -> Here is where it got published (Top, Good, Fair, or Trash).
- Result: The AI learned the "vibe" of the winners. It started to mimic the collective "gut feeling" of the entire scientific community.
- The Score: These trained AI models jumped to 59% accuracy. In the field of Economics, they hit 70%. They beat the smartest AI and the best human experts.

Why This Works: The "Crowd" vs. The "Individual"

Here is the magic trick:

Individual humans are noisy. One editor might reject a great idea because they were having a bad day.
The System is clear. Over 10 years, if an idea is truly great, it eventually finds its way to the top journals. The "noise" of individual humans cancels out, leaving a clear signal of quality.

The AI didn't learn to be a genius; it learned to be a perfect historian. It looked at the history of what succeeded and learned to predict what would succeed.

The "Black Box" of Taste

The paper reveals something surprising: Taste isn't magic. It's data.

We thought "scientific taste" was a mysterious human superpower that machines could never have.
The paper shows that taste was actually deposited in the institutional record all along, waiting to be extracted. It was like a secret code hidden in the filing cabinets of every university library.

What This Means for the Future

This is a game-changer for science, especially in fields like psychology, economics, and management where you can't easily prove an idea is "true" with a math formula.

The Bottleneck: Right now, we have too many research ideas and too few human reviewers. Good ideas get lost because there aren't enough people to read them.
The Solution: We can use these "taste-trained" AI models as a first filter.
- The AI can quickly scan thousands of new ideas.
- If the AI says, "This looks like a winner," it gets fast-tracked to human experts.
- If the AI says, "This looks like a dud," it gets filtered out.
- Crucially, the AI knows when it's unsure. It can say, "I'm 90% sure this is great," or "I'm confused, a human should check this."

The Bottom Line

Science is moving from an era where we worry about generating ideas (AI is already great at that) to an era where we need to evaluate them.

This paper proves that we don't need to teach AI to be a philosopher. We just need to show it the history of what worked. By teaching machines to read the "institutional traces" of human success, we can give them the "scientific taste" they were missing, helping us find the next big breakthroughs faster than ever before.

In short: AI didn't need to learn how to think like a human; it just needed to learn how to read the library card catalog of human success.

1. Problem Statement

The paper addresses a critical bottleneck in the scientific enterprise: evaluative judgment (or "scientific taste"). While AI has surpassed human performance in tasks with verifiable answers (e.g., protein folding, Olympiad math), it struggles with the tacit, subjective judgment required to determine which untested research ideas are worth pursuing.

The Gap: Frontier Large Language Models (LLMs) and human experts alike fail to reliably discriminate the quality of research pitches before empirical results are known.
The Limitation of Current AI: Current models, optimized for conversational fluency and reinforcement learning from human preferences (RLHF), tend to be sycophantic, inflating scores and failing to distinguish novel, high-impact work from incremental or low-quality work. They evaluate presentation rather than substance.
The Nature of the Knowledge: Scientific taste is "collective tacit knowledge." No single expert can articulate the rules distinguishing "exceptional" from "limited" work, yet the institutional system (journals, funders) reliably produces a quality stratification over time. This signal exists in the institutional record (publication decisions) but has not been effectively extracted for AI training.

2. Methodology

The authors constructed a rigorous benchmark and testing framework to evaluate whether AI can learn this tacit judgment from institutional traces.

A. Benchmark Construction

Domain: Organizational Psychology and Management (primary) and Economics (replication).
Data: 120 research pitches (management) and 200 (economics) derived from published articles.
Input Standardization: Articles were transformed into standardized "research pitches" containing only the core research question and theoretical framing. Methods, results, journal names, and author identities were removed to force evaluation based on idea quality alone.
Ground Truth: Pitches were assigned to four quality tiers (Exceptional, Strong, Fair, Limited) based on the prestige tier of the journal where they were eventually published. This serves as the "institutional trace."
Human Baseline: 48 expert gatekeepers (editors/board members) and 174 junior researchers evaluated the pitches.

B. Model Evaluation Strategy

The study compared three classes of evaluators:

Frontier Models: 11 state-of-the-art reasoning models (e.g., GPT-5.2, Gemini 3.1 Pro, Claude Opus 4.6) using zero-shot prompting with expert-derived rubrics.
Human Panels: Individual experts and aggregated majority votes.
Supervised Fine-Tuned (SFT) Models: Four base models (GPT-4.1, GPT-4.1-nano, Qwen3-4B, Qwen3-30B) fine-tuned on 4,479 historical research-pitch/journal-outcome pairs. The models learned to map the pitch text directly to the publication tier label.

C. Mechanism Tests

To ensure the models learned genuine evaluative judgment rather than pattern matching, the authors conducted:

Calibration Tests: Checking if confidence scores correlate with accuracy.
Pairwise Discrimination: Testing models on head-to-head comparisons (a task format absent from training).
Input Compression: Evaluating models on one-sentence summaries to test generalization beyond rich context.
Temporal Persistence: Training on older data (2015–2020) to test signal durability.
Cross-Domain Transfer: Testing management-trained models on economics data.

3. Key Contributions

Identification of "Institutional Traces" as a Training Signal: The paper demonstrates that the "dark knowledge" of scientific taste is not missing from AI but is embedded in historical institutional decisions. Fine-tuning on these traces unlocks evaluative capabilities that reasoning alone cannot achieve.
Superiority of SFT over Frontier Reasoning: The study proves that supervised fine-tuning on outcome data significantly outperforms both the most powerful frontier models and human expert panels, even when the frontier models are given the best possible prompts.
Calibrated Self-Knowledge: Unlike frontier models (which are overconfident) or humans (who are inconsistent), the fine-tuned models exhibit calibrated confidence. They can reliably identify when they are correct, reaching 100% accuracy on their highest-confidence predictions.
Generalization: The mechanism is not specific to one architecture or field. It works across different model families (GPT, Qwen), scales, and disciplines (Management and Economics).

4. Key Results

Performance Metrics

Frontier Models: Averaged 31.1% accuracy (barely above the 25% chance baseline). They exhibited "prediction collapse," clustering almost all outputs in the "Strong" and "Fair" tiers and failing to identify "Limited" or "Exceptional" work.
Human Experts: Individual experts averaged 36.2%; majority vote reached 41.6%. While better than frontier models, human agreement was near chance (Fleiss' $\kappa \approx 0.05$ ), indicating high noise.
Fine-Tuned Models:
- Single SFT models achieved 55.0% – 59.2% accuracy.
- A simple 2-model ensemble achieved 60.8% accuracy.
- Significance: This captures ~48% of the "headroom" between chance and perfection, compared to only 8% for frontier models and 22% for human panels.
- Economics Replication: The approach yielded 69.5% accuracy in economics, confirming cross-field generalizability.

Mechanism Insights

Calibration: The SFT ensemble reached 100% accuracy on its top 10% highest-confidence predictions. This allows for a "triage" workflow where high-confidence predictions are automated, and uncertain ones are routed to humans.
Transfer Learning: Models trained on management data achieved 43.5% accuracy on economics data (significantly above chance), suggesting shared evaluative structures across social sciences.
Input Compression: Larger models retained evaluative signal even when inputs were compressed to one-sentence summaries (49.2% accuracy), whereas smaller models collapsed.
SFT vs. RL: Reinforcement Learning (RL) variants performed worse (40.3%) than SFT, suggesting that explicit reasoning chains (Chain-of-Thought) can degrade holistic tacit judgment, while direct exposure to examples (SFT) is superior.

5. Significance and Implications

Redefining AI for Science: The paper shifts the focus from AI as a generator of scientific content to AI as a filter or evaluator. It suggests that the bottleneck in science is not idea generation but the evaluation of which ideas deserve resources.
Scalable Triage: The proposed mechanism offers a low-cost (training cost < $300), scalable solution to triage the exploding volume of scientific submissions, particularly in social sciences where quality resists formal verification.
Solving the Tacit Knowledge Paradox: The study empirically validates Autor's prediction that machine learning can overcome Polanyi's paradox ("we know more than we can tell") by learning from outcome data rather than explicit instruction.
Broader Applications: The mechanism is applicable to any domain with socially judged outcomes and weak ex-ante verification, including venture capital, grant allocation, creative industries, and hiring.

Conclusion: Scientific taste is not an innate human trait exclusive to experts but a collective signal deposited in institutional records. By aligning AI with these institutional traces via supervised fine-tuning, machines can acquire a form of evaluative judgment that surpasses both current frontier AI and human consensus, providing a robust, calibrated tool for the future of scientific gatekeeping.