Evaluating LLM-Based Grant Proposal Review via Structured Perturbations

Imagine a massive, high-stakes game show where researchers are competing for a limited pot of gold (government funding). To win, they must submit a detailed "grant proposal"—a blueprint for a future project that promises to change the world.

Right now, there are too many applicants and not enough human judges to read them all. It's a traffic jam of ideas. The authors of this paper asked: "Can we hire an AI (a Large Language Model or LLM) to be the referee?"

To find out, they didn't just ask the AI to read a proposal and give an opinion. That would be like asking a student to grade a test without an answer key. Instead, they built a controlled "trap" system to see if the AI could spot the mistakes.

Here is the breakdown of their experiment in simple terms:

1. The Setup: The "Broken Toy" Test

The researchers took six real, high-quality grant proposals (like six perfect, working toys). Then, they systematically "broke" them in specific ways to create 42 different versions.

Think of it like taking a perfect car and creating different faulty versions:

The "Budget" Break: They made the car cost $1 million instead of $10,000.
The "Timeline" Break: They said the car would be built in 2 days instead of 2 years.
The "Team" Break: They removed the only engineer who knew how to build the engine.
The "Clarity" Break: They removed all the labels and instructions, leaving the car parts in a pile with no explanation of what they do.

They then fed these "broken" proposals to three different types of AI reviewers to see if the AI would say, "Hey, this doesn't make sense!"

2. The Three AI Reviewers

They tested three different ways the AI could read the proposals:

The "Speed Reader" (Baseline): The AI read the whole proposal in one giant gulp.
- Result: It got overwhelmed. Like trying to drink from a firehose, it missed most of the broken parts.
The "Specialist Team" (Section-Level): The AI was told to read the proposal one section at a time (e.g., just the budget, then just the team).
- Result: This was the winner. By focusing on one small piece at a time, it spotted the errors much better.
The "Council of Experts" (Personas): They created a virtual boardroom with five different AI characters: a "Money Saver," an "Ethics Police," a "Tech Hype-Man," a "Skeptic," and an "Impact Champion." They all argued, voted, and then a "Chairman" AI summarized the final decision.
- Result: This was the most expensive and slowest method, but it didn't work any better than the Speed Reader. The extra complexity just added noise, not clarity.

3. The Big Surprises

The study found some interesting things about how AI thinks compared to human experts:

The "Obvious" vs. The "Subtle":
- The AI was great at spotting obvious mismatches. If the proposal said "We are building a rocket" but the funding call was for "Baking bread," the AI immediately caught it.
- The AI was terrible at spotting "Clarity" issues. If the proposal used confusing jargon or forgot to explain an acronym, the AI just glossed over it. It assumed the text made sense because it's good at guessing context, whereas a human would say, "Wait, what does this word mean? This is unclear."
The "Compliance" vs. "Vision" Gap:
- Human reviewers look for the "big picture": Is this idea brilliant? Will it change the world?
- The AI reviewers were obsessed with checklists: Did you fill out the budget form? Did you mention the ethics policy?
- The AI was like a strict librarian checking if books are on the right shelf, while humans were the critics deciding if the books were actually good stories.

4. The Verdict: AI as a Co-Pilot, Not the Captain

The paper concludes that we shouldn't let AI replace human grant reviewers yet.

Why? Because AI is too inconsistent. Sometimes it's brilliant, sometimes it misses the biggest red flags. It also prioritizes "following the rules" over "judging the quality of the idea."
The Solution: AI is a great assistant. Imagine a human reviewer using AI as a "spell-checker for logic." The AI can quickly scan the document to say, "Hey, the budget doesn't match the timeline," or "You forgot to list the ethics approval." This frees up the human expert to focus on the hard part: deciding if the research idea is truly worth funding.

The Takeaway

We are currently in a "Malthusian trap" (a fancy way of saying demand is outpacing supply) for research funding. We have too many ideas and too few time to read them.

This paper says: Don't let the AI drive the car yet. It will crash if you give it a long, complex road. But if you let the AI sit in the passenger seat and point out the potholes (errors) and missing signs (compliance issues), it can help the human driver get to the destination faster and safer.

Here is a detailed technical summary of the paper "Evaluating LLM-Based Grant Proposal Review via Structured Perturbations."

1. Problem Statement

The research ecosystem faces a "Malthusian trap" where the exponential growth of grant applications (e.g., UKRI applications nearly doubled since 2017) outpaces the capacity for manual peer review, leading to reviewer fatigue and extended decision cycles. While Generative AI (GenAI) is increasingly permitted for applicants to draft proposals, it remains prohibited for reviewers, creating an asymmetry that risks lower review quality or further delays.

Existing research on LLMs in academic review focuses primarily on conference paper reviewing (retrospective evaluation of completed work). However, grant reviewing is fundamentally different:

Prospective & Administrative: It assesses feasibility, value for money, and future impact rather than completed results.
High Stakes: Errors involve significant capital and multi-year commitments.
Data Scarcity: Grant proposals contain sensitive intellectual property and confidential data, making them unavailable for large-scale training or public benchmarking.
Contextual Breadth: Reviews require synthesizing diverse documents (financial spreadsheets, ethics statements, impact plans) and navigating non-anonymous metadata.

The core gap is the lack of systematic evaluation on whether LLMs can identify substantive weaknesses, produce reliable scores, or generate feedback comparable to human experts in this high-stakes, data-scarce domain.

2. Methodology

To overcome data scarcity and privacy constraints, the authors propose a perturbation-based evaluation framework. Instead of labeling thousands of proposals, they take a limited set of genuine proposals and systematically introduce "faults" to test LLM sensitivity.

A. Dataset

Source: Six genuine Engineering and Physical Sciences Research Council (EPSRC) proposals from the University of Sheffield.
Privacy: All processing was done offline on local infrastructure. Two proposals included full expert review comments for ground-truth comparison.
Preprocessing: Documents were converted to Markdown, with timeline data serialized into tables to preserve structure for LLM consumption.

B. Perturbation Taxonomy

The authors defined six quality axes derived from the UKRI assessment process. They generated 42 distinct perturbations across the six proposals by introducing specific, targeted defects:

Funding: Inflated budgets, removed cost justifications, misaligned resource allocation.
Timeline: Unrealistic task compression, misaligned milestones.
Competency: Removal of key personnel, weakened evidence of skills.
Alignment: Modified opportunity aims, switched "what we're looking for" sections.
Clarity: Removed acronym expansions, introduced vagueness, deleted factual background.
Impact: Replaced stakeholders with irrelevant parties, modified outcome scope.

C. Review Architectures

Three distinct LLM architectures were evaluated:

Zero-shot Baseline: A single pass (GPT-OSS-20B) processing the entire proposal context (~30k+ tokens) to generate a score and comments.
Section-Level Review: The proposal is decomposed into four logical groups (Vision/Approach, Team Capability, Funding/Resources, Ethics). The model reviews each section independently to reduce cognitive load and context dilution.
Council of Personas: An ensemble approach emulating an expert panel. Five distinct personas (Cost Analyst, Ethics Assessor, Tech Evangelist, Methodological Sceptic, Impact Champion) generate independent reviews. A "Chair" persona then synthesizes these via a meta-review and ranking process to produce a final score.

D. Evaluation Metrics

Perturbation Detection: Measured by whether the LLM identified the introduced flaw (Correct, Partial, or Incorrect) using a panel of three "Judge" models (Qwen3.5, Nemotron, GLM-4.7) to determine the verdict.
Score Degradation: The difference in scores between original and perturbed proposals ( $\Delta S$ ).
Reliability: Intra-Class Correlation (ICC) to measure consistency across repeated runs.
Feedback Alignment: A claim-based analysis comparing LLM-generated feedback against human expert reviews to measure overlap, contradiction, and unique value.

3. Key Contributions

Perturbation-Based Framework: A principled method for evaluating LLMs in data-scarce, high-sensitivity domains by transforming a small set of genuine proposals into a robust benchmark of controlled fault conditions.
Council of Personas Architecture: A novel multi-stage ensemble method designed to emulate the diverse perspectives of human expert panels, though results showed it did not outperform simpler methods.
Comprehensive Gap Analysis: A comparative study identifying that while LLMs are effective at detecting structural misalignments, they fail significantly at detecting clarity issues and often prioritize compliance over holistic assessment.

4. Key Results

A. Detection Performance

Overall Detection Rate: Low. Only 21.2% of perturbations were correctly identified across all systems.
Architecture Comparison:
- Section-Level Approach: Significantly outperformed others (Mean detection $\mu = 0.29$ ). It was more critical and accurate, assigning lower scores to perturbed proposals.
- Baseline & Council: Performed similarly poorly ( $\mu = 0.17$ ). The Council approach, despite high computational cost, offered no statistical advantage over the baseline.
Sensitivity by Axis:
- High Sensitivity: Alignment perturbations were detected most reliably ( $\mu = 0.41$ ), likely because models have internalized patterns from pre-training on funding opportunity documents.
- Low Sensitivity: Clarity perturbations were almost entirely missed ( $\mu = 0.06$ ). Models tended to "hallucinate" meaning or resolve ambiguity rather than flagging missing definitions or vague phrasing.

B. Reliability

Intra-Class Correlation (ICC): The Section-Level approach achieved the highest reliability (ICC = 0.50), indicating that score variance was driven by actual proposal differences rather than noise.
Baseline/Council: Showed low reliability (ICC $\approx$ 0.11–0.14), meaning repeated evaluations of the same proposal yielded inconsistent scores.

C. Feedback Alignment with Humans

Validity: LLM feedback was largely valid and rarely contradicted human reviewers (contradiction rate $\approx$ 2%).
Bias: LLMs skewed toward compliance checking (e.g., data governance, GDPR, specific budget line items) rather than the holistic, strategic judgment of human reviewers.
Valence: LLMs generated more negative claims than human reviewers, who tend to offer broader, affirmatory assessments alongside critiques.
Unique Value: LLMs surfaced granular compliance issues (e.g., specific environmental sustainability gaps) that human reviewers often overlooked or implicitly accepted, suggesting a role as a supplementary "compliance filter."

5. Significance and Conclusion

Current Limitations: Current LLMs are not ready for autonomous grant review. They exhibit high variability in scoring, miss critical clarity flaws, and lack the holistic judgment required for high-stakes funding decisions.
Recommended Role: LLMs show promise as supplementary tools under human oversight, specifically for:
- Structured feedback on compliance and alignment.
- Detecting logical inconsistencies in budget/timeline alignment.
- Acting as a "second pair of eyes" for granular administrative checks.
Architectural Insight: Decomposing long-context documents into focused sections (Section-Level) is more effective than increasing context window size or using complex ensemble methods (Council) for this specific task.
Future Work: The study highlights the need for models trained to question ambiguity rather than resolve it, and the development of benchmarks that better reflect the prospective nature of grant funding.

The authors release their code and non-protected data to facilitate further research in this under-explored domain.