Large language model scoring of medical student reflection essays: Accuracy and reproducibility of prompt-model variations

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a teacher with a mountain of student essays to grade. These aren't just math problems with right or wrong answers; they are reflection essays where students write about their feelings and lessons learned. Grading these is hard work. It takes time, it's expensive to hire enough teachers, and even human teachers get tired, leading to inconsistent scores.

Enter the AI Grader (specifically, Large Language Models like the ones powering chatbots). The question this paper asks is: "Can a robot grade these essays as well as a human, and what's the best way to tell the robot how to do it?"

The researchers treated this like a cooking competition. They had 51 essays (some real, some made up by another AI to test the system) and tried 29 different "recipes" (prompts and settings) to see which one produced the tastiest (most accurate) results.

Here is the breakdown of their findings in plain English:

1. The "Recipe" Matters (Prompt Engineering)

Think of the AI as a very smart but literal-minded sous-chef. If you just say, "Make a cake," it might burn it. You have to give it a specific recipe. The researchers tested different ways of giving instructions:

The "Full Manual" vs. "Cheat Sheet" (Rubrics):
- The Finding: Giving the AI the full, detailed rulebook (the scoring rubric) worked best.
- The Analogy: It's like giving a sous-chef a 10-page recipe with exact temperatures and times. If you only give them a 1-page summary or no instructions at all, the cake (the score) gets messy. The more specific the rules, the better the grade.
The "Show, Don't Just Tell" (Exemplars/Few-Shot Learning):
- The Finding: Showing the AI examples of essays that were already graded correctly helped it get better scores.
- The Analogy: Instead of just saying, "Write a good essay," you show the AI three sample essays and say, "See how this one got a 6? And how this one got a 2? Now grade this new one like that." This "cheat sheet" made the AI much more accurate.
The "Step-by-Step" Thinking (Chain of Thought):
- The Finding: Surprisingly, asking the AI to "think step-by-step" or explain its reasoning before giving a score didn't help and sometimes made it worse.
- The Analogy: It's like asking a chef to narrate every chop and stir while cooking. It slowed them down and didn't make the food taste better. The AI just needed to know the rules, not talk through its process.

2. The "Training" Matters (Fine-Tuning)

The Finding: If you take a generic AI and "train" it specifically on your grading style (using 18 practice essays), it becomes a grading machine.
The Analogy:
- Non-Fine-Tuned AI: Like hiring a general contractor who knows how to build houses but has never built your specific style of house. They do a decent job, but you have to explain everything.
- Fine-Tuned AI: Like hiring a contractor who has built 100 houses exactly like yours. They know exactly what you want. They are the most accurate, but the "training" cost money upfront.

3. The "Model" Matters (Which Robot?)

The Finding: Newer, smarter AI models (like GPT-4.1) were much better than older ones. However, the "mini" versions (smaller, cheaper models) were surprisingly good too.
The Analogy:
- GPT-4.1: The Ferrari. Fast, powerful, and expensive.
- GPT-4.1-mini: The reliable Honda Civic. It gets you to the destination (a good grade) just fine, but it costs a fraction of the price.
- Old Models (GPT-3.5): The horse-drawn carriage. It works, but it's slow and makes mistakes.

4. The Cost vs. Quality Balance

The researchers looked at the price tag for grading 100 essays:

The "Mini" Model: Cost about $0.04 for 100 essays. (A penny a grade!) It was very accurate.
The "Full" Model: Cost about $0.21 for 100 essays.
The "Trained" Model: Cost about $2.00 upfront to train, but if you graded 10,000 essays, the cost per essay dropped so low it became the cheapest option in the long run.

The Verdict:

If you have a small pile of essays (e.g., 100): Just use the "Mini" model with a clear set of rules. It's cheap and nearly perfect.
If you have a massive mountain of essays (e.g., 10,000): Spend the time and money to "train" the AI first. It becomes the most accurate and cheapest option in the long run.

The Big Takeaway

AI is now ready to be your grading assistant. It doesn't need to be a genius prompt engineer to get great results.

Give it a clear rulebook.
Show it a few examples of good and bad grades.
Use a modern, slightly cheaper AI model.

The result? You get grades that are almost identical to what a human teacher would give, but for the price of a cup of coffee per 100 essays. The only catch? The AI might be a little too polite or "perfect" in its grammar, so human teachers should still double-check the final work, just to be safe.

Large language model scoring of medical student reflection essays: Accuracy and reproducibility of prompt-model variations

1. The "Recipe" Matters (Prompt Engineering)

2. The "Training" Matters (Fine-Tuning)

3. The "Model" Matters (Which Robot?)

4. The Cost vs. Quality Balance

The Big Takeaway

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Implications

Large language model scoring of medical student reflection essays: Accuracy and reproducibility of prompt-model variations

1. The "Recipe" Matters (Prompt Engineering)

2. The "Training" Matters (Fine-Tuning)

3. The "Model" Matters (Which Robot?)

4. The Cost vs. Quality Balance

The Big Takeaway

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this

PRIME-CVD: A Parametrically Rendered Informatics Medical Environment for Education in Cardiovascular Risk Modelling

Medical Students' Perceptions of and Attitudes Toward English as a Medium of Instruction at the Faculty of Medicine and Pharmacy of Rabat: A Cross-Sectional Study

Adapting to scarcity: plasticity in rural healthcare practice

Scalable Micro-Credentials for AI Literacy in Healthcare: An AI-Assisted Framework for Expert-Led Education

Physician-scientist hiring practices at US universities before and after the COVID-19 pandemic