Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA

Imagine you are a doctor trying to grade a stack of medical exam papers written by students. But here's the catch: you are too busy to read them all yourself. So, you decide to hire a team of "AI Tutors" to grade the papers for you.

This paper is about testing those AI Tutors to see if they are actually good at their jobs, specifically when the exam is in French and the subject is medicine.

Here is the breakdown of their experiment, explained with some everyday analogies:

1. The Problem: The "Human Grader" Bottleneck

In the medical world, checking if an answer is correct isn't just about matching words. It's about meaning.

The Old Way: Imagine using a spellchecker to grade an essay. It counts how many words match the teacher's answer key. If the student uses a different word for "heart attack" (like "myocardial infarction"), the spellchecker might mark it wrong, even though it's medically perfect.
The Real Challenge: To grade these medical answers properly, you need a real doctor to read every single one. This is slow, expensive, and hard to scale.

2. The New Idea: "The AI Judge"

The researchers asked: Can we train an AI to act like the doctor?
They tried using different types of AI "Judges" to decide if a student's answer was semantically equivalent (meaning the same thing) as the correct medical answer.

They tested three types of judges:

The Big Generalists: Famous, massive AIs (like GPT-5 or Gemini) that know a little bit about everything.
The Medical Specialists: AIs that were specifically trained on medical textbooks (like MedGemma).
The Small, Compact Models: Tiny, efficient AIs (like Phi-3.5) that usually aren't very smart on their own.

3. The Big Surprise: The "Bias" of the Judge

The researchers discovered a weird quirk: The AI Judge's score depended heavily on who wrote the answer.

The Analogy: Imagine a strict teacher who loves long, flowery essays. If a student writes a short, punchy answer that is 100% correct, this teacher might give them a bad grade because it "doesn't look like a good answer."
The Finding: Some AI Judges were biased. If the answer came from a specific type of AI (like a "Qwen" model), the Judge gave it a high score. If the answer came from a different AI (like a "Llama" model) that wrote more concisely, the same Judge gave it a low score, even if the medical facts were identical.
The Lesson: You can't just trust an AI Judge blindly; you have to know which "style" of answer they prefer.

4. The Winners and Losers

The Specialists Won: The AI trained specifically on medicine (MedGemma) was the most consistent. It didn't care as much about the writing style; it just cared if the medical facts were right.
The Big Generalists were "Too Picky": The massive, famous AIs were very good at spotting errors, but they were too strict. They often rejected correct answers just because the wording was slightly different from their expectations.
The Small Model was "Too Nice": The tiny AI (Phi-3.5) initially gave everyone a passing grade, even when the answer was wrong. It was too eager to please.

5. The Magic Fix: "Training the Small Model"

Here is the most exciting part. The researchers took that tiny, "too nice" AI and gave it a crash course using a small amount of data (only about 184 examples) from a real doctor.

They used two training techniques:

SFT (Supervised Fine-Tuning): Like a teacher showing the student the right answers and saying, "Do it like this."
GRPO (Reinforcement Optimization): Like a coach giving feedback during practice. "That was good, but try to be a bit stricter here."

The Result: After this quick training, the tiny AI became almost as good as the massive, expensive medical specialists. It learned to stop being "too nice" and started grading accurately, all while using a fraction of the computing power.

6. The Takeaway

Don't trust the "Big Name" blindly: Just because an AI is huge and famous doesn't mean it's the best at grading medical answers.
Watch out for bias: AI Judges often have favorites based on how the answer was written, not just what it says.
Small is beautiful: You don't need a supercomputer to build a good medical grader. If you take a small AI and train it carefully with a little bit of expert help, it can do a fantastic job.

In short: The paper proves that we can build reliable, affordable tools to check medical AI answers, but we have to be careful about who we ask to do the grading and how we train them.

Here is a detailed technical summary of the paper "Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA".

1. Problem Statement

Automatic evaluation of Open-Ended Question Answering (OEQA) in specialized domains like medicine is a significant bottleneck. Traditional metrics (BLEU, ROUGE, BERTScore) fail to capture semantic fidelity, factual accuracy, and clinical relevance, often penalizing valid paraphrases or missing subtle medical errors. While Multiple-Choice QA (MCQA) benchmarks exist, they do not test a model's ability to generate free-form clinical reasoning.

Current reliance on human expert annotation is expensive and unscalable. The emerging "LLM-as-a-Judge" paradigm offers a potential solution, but its reliability in French medical contexts remains unproven. Key open questions include:

How accurately do LLMs reproduce expert judgments on semantic equivalence?
Are LLM judges sensitive to the specific model that generated the answer being evaluated (generator bias)?
Can compact models be effectively aligned to serve as trustworthy evaluators with limited expert supervision?

2. Methodology

Dataset Construction

The study utilizes two datasets derived from French medical educational resources (S-EDITIONS and MediQAl):

Training Set (184 instances): Used for aligning a lightweight evaluator. It includes 100 original instances, 42 contrastive negative examples (swapped answers), and 42 paraphrased positive examples.
Evaluation Set (500 instances): 100 unique questions, each answered by 5 different LLM generators (Gemma-3-4B, SFT-LLaMA-13B, MedGemma-4B, SFT-Qwen-4B, and Qwen3-4B).
Ground Truth: A board-certified neurovascular physician annotated all 500 generated answers for binary semantic equivalence (0 = non-equivalent, 1 = equivalent) against the reference answer. Inter-annotator agreement was 90%.

Experimental Setup

Task: Binary equivalence judgment (Is the generated answer medically equivalent to the reference?).
LLM Judges Evaluated:
- Closed-access: GPT-5.1, Gemini-2.5-Pro.
- Open-source/General: Qwen3-Next-80B, Phi-3.5-mini (Base).
- Biomedical: MedGemma-27B.
Prompting: A strict binary prompt was used. Notably, prompts were provided in English to ensure consistent evaluation capabilities across models, despite the input data being in French.
Alignment Strategy (for Phi-3.5-mini):
1. Supervised Fine-Tuning (SFT): Trained on the 184 expert-labeled instances.
2. Group Relative Policy Optimization (GRPO): Applied after SFT to further align the model using reinforcement learning on the same equivalence task.
Evaluation Metrics: Accuracy, F1-score, Pearson correlation with human labels, and McNemar's test for statistical significance.

3. Key Contributions

Systematic Assessment in French Medical OEQA: The study provides the first evaluation of LLM-as-a-Judge specifically for French medical open-ended QA, comparing closed-access, open-source, and domain-adapted models.
Discovery of Generator Bias: The authors empirically demonstrate that LLM judges are not generator-invariant. Evaluation performance (precision, recall, F1) varies significantly depending on which model generated the answer being judged.
Effective Lightweight Adaptation: The paper demonstrates that a compact model (Phi-3.5-mini, 3.8B parameters) can be transformed into a reliable evaluator through lightweight SFT and GRPO, achieving performance comparable to much larger, domain-specialized models with limited data.

4. Key Results

A. Limitations of Similarity Metrics

Traditional metrics (ROUGE-L, BLEU, BERTScore) showed weak correlation with expert judgments (Pearson $r < 0.26$ ), confirming they are unsuitable for clinical equivalence evaluation.

B. Performance of LLM Judges

Top Performers: MedGemma-27B (domain-adapted) and Qwen3-Next-80B (large general-purpose) achieved the highest balanced agreement (F1 $\approx$ 60%) and Pearson correlations.
Closed-Access Models: GPT-5.1 and Gemini-2.5-Pro showed high accuracy but suffered from low recall (conservative tendency), often rejecting valid equivalence for concise answers.
Base Small Model: Phi-3.5-mini (Base) exhibited extreme bias toward the positive class (Recall 98%, Precision 36%), leading to poor F1 scores.

C. Generator Bias Analysis

No Invariance: All judges showed performance variance across different answer generators.
Style Sensitivity: Models like GPT-5.1 and Gemini-2.5-Pro were less likely to judge answers from fine-tuned, concise models (Llama/Qwen) as equivalent, preferring more verbose outputs.
Family Bias: Qwen-80B performed best on answers generated by Qwen-based models, indicating a preference for its own training distribution.

D. Impact of Alignment (SFT + GRPO)

SFT: Improved Phi-3.5-mini's F1 from 52.6% to 54.5% by slightly reducing over-prediction.
GRPO: The addition of GRPO yielded a statistically significant jump in accuracy (from 47.0% to 71.4%) and F1 (to 57.1%), bringing the small model's performance close to MedGemma-27B.
Bias Reduction: The GRPO-aligned model showed reduced variability across different answer generators, particularly improving performance on outputs from the Gemma family.
Statistical Significance: McNemar's test confirmed GRPO significantly reduced classification errors compared to both the Base and SFT-only models ( $p < 10^{-6}$ ).

5. Significance and Implications

Generator-Aware Evaluation: The study highlights that evaluation pipelines must account for the "generator" of the answer. A judge's reliability is not absolute but context-dependent on the source model.
Scalability in Low-Resource Settings: The findings suggest that expensive, large-scale domain-adapted models are not strictly necessary for evaluation. Compact models, when carefully aligned via SFT and GRPO, can provide scalable, cost-effective evaluation for medical OEQA, even with limited expert data.
French Medical NLP: This work fills a critical gap in non-English medical evaluation, noting that French clinical terminology and practice differ from US/UK contexts, requiring specific adaptation rather than direct transfer of English-based evaluation methods.
Ethical Caution: The authors emphasize that while LLM judges are useful for scalable screening, they currently lack the reliability to replace human experts in safety-critical clinical decision-making due to observed biases and moderate agreement levels.

Conclusion

The paper concludes that while LLM-as-a-Judge is a promising tool for French medical OEQA, it is currently prone to generator-dependent biases. However, through lightweight adaptation strategies like GRPO, compact models can be effectively aligned to serve as robust, scalable evaluators, offering a viable path forward for low-resource medical NLP evaluation.