Can AI Match Human Experts? Evaluating LLM-Generated… — Plain-Language Explanation

Original authors: van Allen, Z., Forgues-Martel, S., Venables, M. J., Ghanney, Y., Villeneuve, A., Dongmo, J., Ahmed, M., Archibald, D., Jolin-Dahel, K.

Published 2026-03-05

📖 5 min read🧠 Deep dive

View on medRxiv ↗PDF ↗

CC BY 4.0

Original authors: van Allen, Z., Forgues-Martel, S., Venables, M. J., Ghanney, Y., Villeneuve, A., Dongmo, J., Ahmed, M., Archibald, D., Jolin-Dahel, K.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a coach for a team of young athletes (medical residents) who are training to become experts. Part of their training involves writing a "playbook" (a scholarly research project) to prove they understand the game.

Traditionally, the head coach (the human expert) has to read every single playbook, write detailed notes on what was good and what was bad, and hand it back. The problem? There are so many players (170+ residents) and so many playbooks that the coach gets overwhelmed. Sometimes, a player waits two months just to get their notes back. By then, they've forgotten what they were thinking, and the learning momentum stalls.

The Big Question: Can we hire a super-fast, tireless robot assistant (Artificial Intelligence) to write these notes instead? And if we do, will the players learn just as well as they would from the human coach?

The Experiment: The "Robot Coach" vs. The "Human Coach"

The researchers at the University of Ottawa set up a massive test. They took 240 different playbooks written by residents at three different stages of their training:

The Sketch: A rough idea (Short Report).
The Draft: A plan with a timeline (Question & Timeline).
The Final: The finished playbook (Final Report).

They split the work in half:

Team Human: Real experts wrote feedback for 120 playbooks.
Team Robot: An AI (a "brain" called LLaMA-3.1) wrote feedback for the other 120.

Then, a panel of "judge coaches" (who didn't know which feedback was written by whom) scored the notes based on five things:

Did they understand the game? (Reasoning)
Did they sound like a real coach? (Persona)
Was the advice actually useful? (Quality)
Did the player trust the advice? (Trust)
Was the advice safe and polite? (Safety)

The Results: Who Won?

The results were a bit like a sports match where the outcome depends on which game you are playing.

1. The "Early Season" Struggle (Short Reports)
When the residents were just starting with rough, messy ideas (Short Reports), the Human Coach won easily.

The Analogy: Imagine a robot trying to critique a sketch drawn on a napkin. The robot gets confused, gives generic advice like "draw better," and the player feels like, "This robot doesn't get me."
The Score: Humans were much better at understanding the messy context and giving specific, helpful advice. The robot's feedback felt cold and vague.

2. The "Mid-Season" Improvement (Question & Timeline)
As the projects got more structured, the gap narrowed. The robot started to get the hang of the rules. It was still a bit behind the human, but it was getting closer.

3. The "Championship" Round (Final Reports)
By the time the projects were finished, the Robot Coach was almost as good as the Human Coach, and in some cases, even better!

The Analogy: When the playbook is finished and detailed, the robot can read the whole thing perfectly. It didn't miss a single rule.
The Surprise: The robot actually gave safer feedback than the humans. It never got angry, never used harsh words, and never made a "safety" mistake. It was the perfect, polite, non-judgmental coach.
Specific Wins: For projects that were very data-heavy (like surveys), the robot was actually better than the humans at spotting errors and giving high-quality advice.

The "Secret Sauce" and the Catch

Why did the robot get better over time?
The researchers didn't just turn the robot on and hope for the best. They "trained" it by showing it examples of good feedback (like showing a student a model essay). They also built a system where a human could read the robot's notes and tweak them before sending them to the student. This is called a "Human-in-the-Loop" system. Think of it as the robot writing the first draft of the notes, and the human coach just signing off on them.

The Catch (Where the Robot Fails)
The robot still struggles with Quality Improvement (QI) projects.

The Analogy: QI projects are like fixing a specific leak in a specific building's plumbing. The robot knows general plumbing rules, but it doesn't know that this specific building has weird pipes or that the manager hates loud noises. It misses the "local flavor" and context that a human who knows the hospital would catch.

The Bottom Line

Can AI replace human experts?
Not yet. If you need deep, emotional, or highly contextual advice on a messy, early-stage idea, you still need a human.

Can AI help?
Absolutely.

Speed: The robot can write feedback in minutes, not weeks.
Consistency: Every student gets a "base level" of good feedback, so no one falls through the cracks.
Safety: The robot is incredibly polite and safe.

The Future Vision
The authors suggest a new way of teaching: Don't let the AI teach for us; teach students how to think with the AI.

Imagine a future where the robot gives the student a draft of feedback instantly. The student reads it, thinks, "Hmm, the robot missed this part," and then goes to the human coach to discuss the nuance. The human coach then spends their time on the deep, complex mentorship, while the robot handles the heavy lifting of checking the rules and grammar.

In short: The robot is a fantastic assistant, but it's not ready to be the head coach just yet. But with a little help from humans, it's getting there fast.

1. Problem Statement

Postgraduate medical education programs, specifically Family Medicine residencies, face a significant bottleneck in providing timely, high-quality feedback on resident scholarly projects.

Scale vs. Capacity: Large programs (e.g., University of Ottawa with ~170–180 residents) struggle to provide detailed feedback at three key milestones (proposal, short report, final report) due to limited staff capacity.
Consequences: Feedback turnaround times often exceed 60 days, hindering resident progress and increasing pressure near deadlines.
Gap in Literature: While AI feedback tools exist, few studies have evaluated them in authentic postgraduate medical settings, and most rely on proprietary, closed-weight models (e.g., GPT-4) rather than open-weight models that allow for local deployment and data privacy.

2. Methodology

The study employed a three-phase, comparative evaluation design to assess an AI-assisted feedback system against expert human evaluators.

A. System Architecture & Development

Model: The system utilizes LLaMA-3.1, an open-weight Large Language Model (LLM), deployed locally to ensure data privacy.
Data Ingestion & Preprocessing:
- Input: Heterogeneous formats including PDFs, handwritten scans, and photographs.
- Extraction: After testing native PDF parsers (PyMuPDF, PDFPlumber) and layout analysis, the system adopted Tesseract OCR with layout-aware parsing as the most robust method for extracting text, figures, and tables.
- Structuring: Extracted content was programmatically mapped to standardized Excel schemas (Introduction, Methodology, Results, Discussion) to ensure consistent input for the LLM.
Prompt Engineering:
- Initial Approach: Zero-shot prompting yielded suboptimal coherence.
- Refined Approach: A few-shot learning paradigm was implemented. The system uses curated exemplar pairs (student submission + expert feedback) to condition the model via in-context learning. Prompts are tailored to specific report types (Quality Improvement, Survey, Research, Literature Review) and include structural guidance and stylistic targets.
Interface: A web-based application (AWS Amplify, Lambda, Batch) allows faculty to upload reports and generate editable Word documents containing section-by-section feedback.

B. Study Design

Dataset: 240 feedback reports were generated and evaluated across three phases: Short Reports (SR), Question & Timeline (QT), and Final Reports (FR).
Distribution: 80 reports per phase; 40 AI-generated and 40 Human-generated per phase.
Project Types: Four categories: Quality Improvement (QI), Survey-Based (SB), Research-Based (RB), and Literature Review (LR).
Evaluation Metrics: Four experienced researchers acted as blinded raters using a 25-item survey adapted from Tam et al. (2024).
- Constructs: Understanding & Reasoning, Trust & Confidence, Quality of Information, Expression Style & Persona, Safety & Harm.
- Scale: 1 (Strongly Disagree) to 5 (Strongly Agree).
Reliability: Internal consistency was measured using Cronbach's $\alpha$ , ranging from 0.71 to 0.98 across constructs.

3. Key Results

A. Overall Performance Comparison

Human Superiority: Human feedback generally outperformed AI across all dimensions, particularly in Trust and Quality.
- Short Reports: The gap was largest here. Humans scored significantly higher on Quality (4.14 vs. 3.09) and Trust (3.96 vs. 2.78).
- Final Reports: The gap narrowed significantly. AI approached human performance in Quality (4.09 vs. 3.49) and Persona.
AI Strengths:
- Safety: AI consistently scored high on "Safety & Harm," often surpassing humans in Final Reports (4.50 vs. 4.36).
- Structured Tasks: In Survey-Based Final Reports, AI outperformed humans in Quality (4.28 vs. 3.98) and Safety (4.58 vs. 4.24).

B. Contextual Variability

Project Type Sensitivity:
- Quality Improvement (Short Reports): AI struggled significantly with context-specific nuances, scoring very low on Reasoning (2.33 vs. 4.27) and Trust (2.25 vs. 3.95).
- Literature Reviews: Humans maintained a lead in narrative-heavy tasks, though AI remained competitive on safety.
Variability: AI feedback exhibited higher standard deviations (greater variability) than human feedback, particularly in early-stage work with sparse content.

C. Item-Level Analysis

Safety Swings: The "Safety" dimension showed the most extreme variability, with AI sometimes scoring significantly higher (e.g., -1.60 difference) and humans other times (e.g., +1.80 difference), depending on the specific content.
Trust: Humans consistently held a significant advantage in Trust, especially in complex or early-stage reports where AI hallucinations or generic phrasing reduced credibility.

4. Key Contributions

Open-Weight Implementation: Demonstrates the viability of using open-weight models (LLaMA-3.1) for high-stakes educational feedback, offering a model-agnostic, privacy-preserving alternative to commercial APIs.
Few-Shot Prompting Strategy: Validates that curating exemplar pairs and using few-shot learning significantly improves the alignment of AI feedback with human pedagogical styles compared to zero-shot prompting.
Granular Evaluation: Provides a nuanced breakdown of AI performance across different report stages (Short vs. Final) and project genres (QI vs. Survey), identifying that AI is currently best suited for structured, data-centric tasks in later project stages.
Human-in-the-Loop Framework: Proposes a hybrid workflow where AI generates rapid, rubric-aligned drafts that are vetted and refined by human experts, balancing efficiency with academic rigor.

5. Significance and Implications

Scalability: The system addresses the critical bottleneck of feedback turnaround time, reducing processing from weeks to minutes, thereby enabling rapid iteration for residents.
Equity: AI ensures a baseline of rubric-aligned critique for all residents, reducing variability caused by differing human supervisor workloads or expertise.
Pedagogical Shift: The study suggests a shift from "AI replacing experts" to "AI augmenting experts." It highlights the need for AI literacy in medical education, teaching residents to critically appraise AI feedback ("verify-then-trust") rather than passively accepting it.
Future Trajectory: The authors note that as open-weight models evolve (e.g., newer reasoning capabilities), the performance gap will likely close further. The system is designed to be "future-proofed" by allowing easy model swapping.

Conclusion: While human experts currently remain superior in reasoning, trust, and handling complex, context-rich early-stage projects, AI-generated feedback approaches human quality in structured, final-stage reports and excels in safety. The optimal path forward is a synergistic human-AI partnership that leverages AI for speed and consistency while retaining human oversight for nuance and accountability.

Can AI Match Human Experts? Evaluating LLM-Generated Feedback on Resident Scholarly Projects