Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation

Imagine you are trying to buy car insurance in Quebec, Canada. Thanks to new laws, you can do it entirely online without talking to a human agent. But the insurance contracts are huge, confusing, and written in complex legal language. Most people are left trying to decipher them alone, creating a dangerous "advice gap."

This paper asks: Can Artificial Intelligence (AI) step in to act as a reliable, legal-savvy guide for these people?

The researchers from Laval University built a massive test to find out. Here is the story of their experiment, explained simply with some analogies.

1. The Exam: "The Quebec Insurance Gauntlet"

To test the AI, the researchers created a secret, high-stakes exam called AEPC-QA.

The Source: They took 807 questions from official, paper-only study guides used to certify real human insurance agents. Because these guides are only on paper and not on the internet, the AI couldn't have "cheated" by memorizing them beforehand.
The Difficulty: The questions are tricky. They don't just ask "What is insurance?" They ask complex scenarios like, "If a golfer gets hit by a ball, which legal defense is irrelevant?" It requires deep logic, not just guessing.

2. The Contestants: 51 Different AI Brains

They pitted 51 different Large Language Models (LLMs) against this exam. Think of these models as different types of students:

The General Geniuses: Massive, smart models trained on everything (like a student who read every book in the library).
The Specialists: Smaller models specifically trained on French or legal topics.
The "Reasoners": A new type of AI that pauses to "think" step-by-step before answering, like a detective solving a puzzle.

They tested them in two ways:

Closed-Book: The AI has to answer from memory alone.
Open-Book (RAG): The AI is allowed to look up answers in a digital library of Quebec insurance laws before answering.

3. The Big Surprises (The Results)

🧠 Surprise #1: Thinking Beats Memorizing

The biggest winners weren't the models that had memorized the most laws. They were the "Reasoners" (like the o3 and o1 models).

The Analogy: Imagine a student who memorized the dictionary vs. a student who knows how to solve a math problem. When the question is tricky, the one who knows how to think wins. The top AI scored nearly 79% correct, proving that logic is more important than memory for legal advice.

📚 Surprise #2: The "Knowledge Equalizer" vs. The "Distraction"

This is where it gets weird.

The Equalizer: For some AI models that were "dumb" about Quebec law, giving them the digital library (RAG) was a miracle. Their scores jumped by over 35%. It's like giving a student a cheat sheet, and suddenly they ace the test.
The Distraction: But for other very smart models, the library was a trap. When they tried to read the extra documents, they got confused, panicked, or started refusing to answer because the text looked "too legal" or "too risky."
- The Metaphor: Imagine a brilliant chef who can cook a perfect steak. If you suddenly hand them a 50-page manual on "How to Cook a Steak" while they are cooking, they might get so distracted by the instructions that they burn the steak or refuse to cook at all. This is called "Context Distraction."

🇫🇷 Surprise #3: The "Specialist" Trap

You might think a model specifically trained on French insurance laws would win. It didn't.

The massive, general-purpose models (which speak many languages and know many topics) crushed the small, specialized French models.
The Lesson: Being a "specialist" in one narrow area didn't help as much as being a "generalist" with strong reasoning skills. The small French models were so focused on the words that they missed the logic.

4. The Verdict: Are We Ready?

The paper concludes that while AI is getting very good (approaching human expert levels), we aren't ready to let them run the show alone yet.

The Risk: If an AI gives bad advice, the insurance company is legally liable (just like a human agent).
The Problem: The "Context Distraction" issue means that sometimes, giving an AI more information makes it worse. This is dangerous for real-world use.
The Solution: We need a "Human-in-the-Loop." AI should do the heavy lifting and suggest answers, but a human must double-check them, especially when the AI is confused by the extra documents.

Summary in One Sentence

AI is becoming smart enough to understand complex insurance laws, but it's currently too unstable to trust blindly; it needs a human supervisor to make sure it doesn't get distracted by its own library or give dangerous advice.

Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation

1. The Exam: "The Quebec Insurance Gauntlet"

2. The Contestants: 51 Different AI Brains

3. The Big Surprises (The Results)

🧠 Surprise #1: Thinking Beats Memorizing

📚 Surprise #2: The "Knowledge Equalizer" vs. The "Distraction"

🇫🇷 Surprise #3: The "Specialist" Trap

4. The Verdict: Are We Ready?

Summary in One Sentence

1. Problem Statement

2. Methodology

A. The AEPC-QA Benchmark

B. Experimental Setup

C. Evaluation Protocol

3. Key Contributions

4. Key Results & Findings

A. The Supremacy of Inference-Time Reasoning

B. RAG as a "Knowledge Equalizer" vs. "Context Distraction"

C. The Specialization Paradox

D. The Proprietary Gap

5. Significance and Implications

Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation

1. The Exam: "The Quebec Insurance Gauntlet"

2. The Contestants: 51 Different AI Brains

3. The Big Surprises (The Results)

🧠 Surprise #1: Thinking Beats Memorizing

📚 Surprise #2: The "Knowledge Equalizer" vs. The "Distraction"

🇫🇷 Surprise #3: The "Specialist" Trap

4. The Verdict: Are We Ready?

Summary in One Sentence

1. Problem Statement

2. Methodology

A. The AEPC-QA Benchmark

B. Experimental Setup

C. Evaluation Protocol

3. Key Contributions

4. Key Results & Findings

A. The Supremacy of Inference-Time Reasoning

B. RAG as a "Knowledge Equalizer" vs. "Context Distraction"

C. The Specialization Paradox

D. The Proprietary Gap

5. Significance and Implications

More like this

One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations

MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

ConFu: Contemplate the Future for Better Speculative Sampling

SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

Automated Thematic Analysis for Clinical Qualitative Data: Iterative Codebook Refinement with Full Provenance