Evaluating LLM-Based Grant Proposal Review via Structured Perturbations

This paper evaluates LLM-based grant proposal reviews using structured perturbations on six quality axes, finding that a section-by-section analysis approach outperforms other architectures but that current models still struggle with clarity detection and holistic assessment, suggesting they are best suited as supplementary tools rather than replacements for human reviewers.

William Thorne, Joseph James, Yang Wang, Chenghua Lin, Diana Maynard

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine a massive, high-stakes game show where researchers are competing for a limited pot of gold (government funding). To win, they must submit a detailed "grant proposal"—a blueprint for a future project that promises to change the world.

Right now, there are too many applicants and not enough human judges to read them all. It's a traffic jam of ideas. The authors of this paper asked: "Can we hire an AI (a Large Language Model or LLM) to be the referee?"

To find out, they didn't just ask the AI to read a proposal and give an opinion. That would be like asking a student to grade a test without an answer key. Instead, they built a controlled "trap" system to see if the AI could spot the mistakes.

Here is the breakdown of their experiment in simple terms:

1. The Setup: The "Broken Toy" Test

The researchers took six real, high-quality grant proposals (like six perfect, working toys). Then, they systematically "broke" them in specific ways to create 42 different versions.

Think of it like taking a perfect car and creating different faulty versions:

  • The "Budget" Break: They made the car cost $1 million instead of $10,000.
  • The "Timeline" Break: They said the car would be built in 2 days instead of 2 years.
  • The "Team" Break: They removed the only engineer who knew how to build the engine.
  • The "Clarity" Break: They removed all the labels and instructions, leaving the car parts in a pile with no explanation of what they do.

They then fed these "broken" proposals to three different types of AI reviewers to see if the AI would say, "Hey, this doesn't make sense!"

2. The Three AI Reviewers

They tested three different ways the AI could read the proposals:

  • The "Speed Reader" (Baseline): The AI read the whole proposal in one giant gulp.
    • Result: It got overwhelmed. Like trying to drink from a firehose, it missed most of the broken parts.
  • The "Specialist Team" (Section-Level): The AI was told to read the proposal one section at a time (e.g., just the budget, then just the team).
    • Result: This was the winner. By focusing on one small piece at a time, it spotted the errors much better.
  • The "Council of Experts" (Personas): They created a virtual boardroom with five different AI characters: a "Money Saver," an "Ethics Police," a "Tech Hype-Man," a "Skeptic," and an "Impact Champion." They all argued, voted, and then a "Chairman" AI summarized the final decision.
    • Result: This was the most expensive and slowest method, but it didn't work any better than the Speed Reader. The extra complexity just added noise, not clarity.

3. The Big Surprises

The study found some interesting things about how AI thinks compared to human experts:

  • The "Obvious" vs. The "Subtle":
    • The AI was great at spotting obvious mismatches. If the proposal said "We are building a rocket" but the funding call was for "Baking bread," the AI immediately caught it.
    • The AI was terrible at spotting "Clarity" issues. If the proposal used confusing jargon or forgot to explain an acronym, the AI just glossed over it. It assumed the text made sense because it's good at guessing context, whereas a human would say, "Wait, what does this word mean? This is unclear."
  • The "Compliance" vs. "Vision" Gap:
    • Human reviewers look for the "big picture": Is this idea brilliant? Will it change the world?
    • The AI reviewers were obsessed with checklists: Did you fill out the budget form? Did you mention the ethics policy?
    • The AI was like a strict librarian checking if books are on the right shelf, while humans were the critics deciding if the books were actually good stories.

4. The Verdict: AI as a Co-Pilot, Not the Captain

The paper concludes that we shouldn't let AI replace human grant reviewers yet.

  • Why? Because AI is too inconsistent. Sometimes it's brilliant, sometimes it misses the biggest red flags. It also prioritizes "following the rules" over "judging the quality of the idea."
  • The Solution: AI is a great assistant. Imagine a human reviewer using AI as a "spell-checker for logic." The AI can quickly scan the document to say, "Hey, the budget doesn't match the timeline," or "You forgot to list the ethics approval." This frees up the human expert to focus on the hard part: deciding if the research idea is truly worth funding.

The Takeaway

We are currently in a "Malthusian trap" (a fancy way of saying demand is outpacing supply) for research funding. We have too many ideas and too few time to read them.

This paper says: Don't let the AI drive the car yet. It will crash if you give it a long, complex road. But if you let the AI sit in the passenger seat and point out the potholes (errors) and missing signs (compliance issues), it can help the human driver get to the destination faster and safer.