From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise

This paper introduces a deterministic, automated pipeline that transforms raw domain corpora into completion-style benchmarks to provide scalable, unbiased, and LLM-independent evaluations of domain expertise in both base and instruction-tuned models, effectively addressing issues of benchmark contamination and multiple-choice bias.

Nitin Sharma, Thomas Wolfers, Ça\u{g}atay Yıldız

Published 2026-03-09
📖 4 min read☕ Coffee break read

Here is an explanation of the paper, translated into everyday language with some creative analogies.

The Big Problem: How Do We Really Know What an AI "Knows"?

Imagine you have a library of books on Quantum Physics. You want to know which of your students (or AI models) actually understands the material, and which one just memorized the answers to a specific test.

Currently, the way we test AI is like giving them a Multiple Choice Quiz (like the MMLU benchmark). The paper argues this is a terrible way to test real expertise for three reasons:

  1. The "Order Matters" Bug: If you shuffle the order of the answers (A, B, C, D), the AI's score changes wildly. It's like a student who only knows how to pick "C" because it's always in the middle, not because they know the answer.
  2. The "Cheating" Problem: Many of these tests are already in the AI's training data. It's like giving a student a final exam they've already seen the answers to online. They aren't showing knowledge; they're showing memory.
  3. The "Generalist" Trap: Standard tests ask broad questions. They don't tell you if an AI is a genius at neuroscience but terrible at cardiology.

The Solution: A Custom "Fill-in-the-Blank" Factory

The authors built a deterministic pipeline (a step-by-step machine) that turns raw, messy text from a specific field (like medical journals or physics papers) into a custom test.

Think of it like this:

  • Old Way: You buy a pre-made, generic trivia game and hope it covers the specific topic you care about.
  • New Way: You take a stack of raw textbooks, feed them into a machine, and it automatically generates a unique "Fill-in-the-Blank" test based only on the words and concepts found in those books.

How the Machine Works (The 4-Step Recipe)

  1. Mining for Gold (Keyword Extraction): The system reads thousands of papers and pulls out the most important "buzzwords" (e.g., "reinforcement learning," "phylogenetic tree"). It filters out boring words like "the" or "study."
  2. Finding the Context (Sentence Matching): For every buzzword, it finds sentences in the text that talk about that word.
  3. Building the Puzzle (Prompt-Target Pairs): It takes a sentence and cuts it off right before a key term.
    • The Prompt: "Prior attempts at improving data efficiency in reinforcement learning involved the use of an Experience..."
    • The Target: "...Replay."
    • The AI has to guess the missing word.
  4. The Scorecard (Ranking): Instead of asking "Did it get it right or wrong?", the system asks: "How high up on the AI's list of guesses was the correct word?"
    • If the AI puts the correct word as its #1 guess, that's a perfect score.
    • If it puts it as #500, it's struggling.
    • Why this matters: This measures confidence and knowledge without needing the AI to be a perfect conversationalist.

Why This is a Game Changer

The paper validates this method with some cool experiments:

  • The "Expert" Check: They compared their auto-generated test against a test written by human experts from a famous textbook. The results matched almost perfectly (99% correlation). This proves their machine is as good as a human expert at building a test.
  • The "Learning" Tracker: They watched an AI learn during its training.
    • Old Metric (Perplexity): Like checking a student's handwriting. It gets neater over time, but doesn't tell you if they actually learned the math.
    • New Metric (Their Rank): Like watching the student solve problems. You can see exactly when they start understanding the specific topic.
  • The "Chatbot" Surprise: They tested "Base" models (raw, smart but blunt) vs. "Chat" models (polished, helpful, aligned with human rules).
    • The Shock: The "Chat" models often performed worse on specific domain knowledge than the raw "Base" models.
    • The Analogy: It's like taking a brilliant, grumpy professor (Base Model) and hiring a PR team to make them polite and friendly (Chat Model). In the process, the PR team accidentally made the professor forget some of their deep technical details to sound more "safe" and "general." The authors call this the "Alignment Tax."

The Bottom Line

This paper gives us a scalable, cheat-proof, and unbiased way to test if an AI is actually an expert in a specific field (like Law, Medicine, or Physics).

Instead of relying on multiple-choice questions that can be gamed or contaminated, they built a machine that turns raw domain data into a "Fill-in-the-Blank" exam. This allows researchers to see exactly how much an AI knows, how it learns, and whether making it "safer" (via instruction tuning) accidentally makes it "dumber" at its actual job.