Self-hosted Lecture-to-Quiz: Local LLM MCQ Generation with Deterministic Quality Control

Imagine you are a teacher who wants to turn a 50-page lecture PDF into a fun, multiple-choice quiz for your students. Usually, you might copy-paste that text into a powerful AI chatbot (like a cloud-based "genius") and ask it to write the questions. But that has two big problems:

Privacy: You are sending your private, unpublished lecture notes to a stranger's computer.
Reliability: The AI might make mistakes, like giving two answers that are both correct, or writing the same question twice, and you might not notice until it's too late.

This paper presents a solution called L2Q (Lecture-to-Quiz). Think of it as building a private, self-contained factory right in your own classroom (or on your own computer) to make these quizzes, without ever calling the outside world for help.

Here is how the process works, explained with some everyday analogies:

1. The Setup: The "Local Factory"

Instead of mailing your lecture notes to a big tech company's cloud, you run the "AI brain" (a Local LLM) right on your own machine.

The Analogy: Imagine you have a very smart, but slightly distracted, apprentice working in your kitchen. You give them your secret family recipe (the lecture PDF). They stay in your kitchen the whole time. No one else sees the recipe.

2. The Process: The "Assembly Line"

The system doesn't just ask the AI to "make a quiz." It runs a strict, step-by-step assembly line:

Step A: The Plan. The AI reads the lecture and makes a tiny shopping list of topics to cover.
Step B: The Drafting. The AI writes the questions. It's told to write them in a very specific format (like filling out a strict form), so the computer can read them easily.
Step C: The "Bouncer" (Quality Control). This is the most important part. Before a question is allowed to leave the factory, it goes through a Bouncer (a set of strict computer rules).
- The Rule: "Do you have exactly 5 options?"
- The Rule: "Is there only ONE correct answer?"
- The Rule: "Did you accidentally write the same wrong answer twice?" (e.g., Option C and Option E both equal 5).
- The Rule: "If the answer is a decimal, did you tell the student how many digits to round to?"

3. The "Retry" Mechanism

If the Bouncer catches a mistake, the AI doesn't just get a warning; it has to redo the question immediately.

The Analogy: It's like a baker who drops a cake. The Bouncer says, "That cake is lopsided." The baker doesn't argue; they just throw it in the trash and bake a new one until it's perfect. The system keeps trying (up to 3 times) until the question passes the test.

4. The Result: A "Static" Quiz

Once the questions pass the Bouncer, they are saved as a simple text file (like a spreadsheet).

Why this matters: When your students take the quiz later, they do not need an AI. They just open the file. The AI is gone. The quiz is just a piece of paper (or a digital file) that is safe, fast, and doesn't cost money to run every time a student takes it.

The "Seed Sweep" Experiment

The author tested this on three short, made-up lectures about science (entropy, thermodynamics, etc.).

They ran the factory 15 times with different random settings (like shuffling the deck of cards).
They generated 120 questions.
The Good News: Every single question passed the strict "Bouncer" checks. The system was very stable.
The "Warning" Layer: Even though the questions were technically perfect, the system flagged 8 of them for a human to double-check. For example, one question asked for a number but didn't say "round to two decimal places." The computer couldn't fix that logic, so it raised a flag for the teacher to fix.

Why is this a big deal? (The "AI2L" Lens)

The paper connects this to a philosophy called AI2L (AI to Learn).

Privacy: Your lecture notes never leave your computer.
Accountability: You can see exactly how the question was made and check the math.
Green AI: You don't need to keep paying for expensive cloud servers every time a student takes a quiz. You generate the quiz once, and then it's free to use forever.

The Catch (Limitations)

The author is honest about what this can't do yet:

It's not a genius teacher: The computer can check if the math adds up, but it can't tell if the question is actually fair or if it tests the right concept. A human teacher still needs to look at the final list and say, "Yes, this is a good question."
It struggles with pictures: If your lecture PDF has complex charts or handwritten diagrams, the system might miss them because it only reads the text.

In a Nutshell

This paper describes a privacy-safe, self-correcting machine that turns your lecture notes into a quiz. It acts like a tireless intern that drafts the questions, a strict inspector that catches math errors, and a final editor that packages everything up so you can hand it to your students without needing any internet connection or AI subscriptions.

Here is a detailed technical summary of the paper "Self-hosted Lecture-to-Quiz: Local LLM MCQ Generation with Deterministic Quality Control" by Seine A. Shintani.

1. Problem Statement

The paper addresses the challenges of using Large Language Models (LLMs) for educational assessment authoring, specifically converting lecture PDFs into Multiple-Choice Questions (MCQs). Current "prompt-and-publish" workflows face three critical issues:

Privacy Risks: Sending proprietary lecture content to external cloud-based LLM APIs violates data residency requirements.
Lack of Accountability: Generated content often contains "black-box" errors (e.g., multiple correct answers, duplicated distractors, or hallucinated facts) that are difficult to audit.
Operational Dependency: Relying on live LLM calls during student assessments creates latency, cost, and dependency issues.

The author proposes a solution that minimizes the "black-box" nature of AI by ensuring the final deliverable is a static, auditable artifact generated locally without external API calls.

2. Methodology: The L2Q Pipeline

The proposed system, L2Q (Self-hosted Lecture-to-Quiz), is an end-to-end, API-free pipeline designed to run on local hardware (or isolated environments like Colab) using llama.cpp and GGUF models. The pipeline consists of five deterministic stages:

PDF Ingestion & Segmentation: Extracts text from lecture PDFs and splits them into coherent chunks with page references.
Topic Planning: Generates a compact plan of key definitions and properties to ensure coverage and reduce redundancy.
MCQ Drafting: Uses a local LLM (specifically Qwen2.5-14B-Instruct quantized to Q4_K_M) to generate questions with five options (A–E), a single correct answer, and an explanation. The model is constrained via grammar-constrained decoding (JSON schema) to ensure structural validity.
Deterministic Quality Control (QC) & Retries: This is the core innovation. The system applies a two-tier rule set:
- Hard Constraints (Reject + Retry): If the output fails, the system automatically re-prompts the model (up to 3 retries). Checks include:
  - JSON schema conformance.
  - Exactly one labeled correct option.
  - De-duplication (exact and fuzzy string similarity $\ge$ 0.92).
  - Equivalence Testing: Numeric/constant options are evaluated for mathematical equivalence (tolerance $10^{-9}$) and parametric equivalence (5 random trials) to ensure the correct answer is unique.
- Warning Flags (Accept + Log): Non-fatal issues are flagged for human review, such as missing rounding instructions for approximate answers or duplicate constant distractors.
Export: The final curated set is exported as JSONL/CSV, ready for import into Learning Management Systems (LMS) or Google Forms, requiring no LLM at deployment time.

3. Key Contributions

End-to-End Self-Hosted Pipeline: A fully local workflow that converts lecture PDFs to deployable MCQs without transmitting data to external APIs.
Deterministic QC Layer: Integration of automated, machine-verifiable checks (schema, uniqueness, numeric equivalence) with bounded retries to eliminate structural failures.
Black-Box Minimization: A design philosophy where the LLM is used only for drafting, while the final output is a static, inspectable artifact. This aligns with the AI to Learn (AI2L) rubric, specifically addressing privacy, accountability, and Green AI.
Empirical Validation: A seed-sweep study on three "dummy" entropy lectures (Information Theory, Thermodynamics, Statistical Mechanics) producing a curated dataset of 24 high-quality questions.

4. Experimental Results

The study ran 15 experiments (3 lectures $\times$ 5 random seeds), targeting 8 questions per run (120 total target items).

Success Rate: The pipeline achieved a 100% acceptance rate for hard QC constraints (120/120 items).
Efficiency: Only 2 additional generation attempts were needed across all runs (Retry rate: 1.6%). The average runtime was 58.55 seconds per run (approx. 7.3 seconds per item) on an NVIDIA A100.
Quality Visibility: The warning system flagged 8/120 items (6.7%). The most common issues were:
- Missing Rounding Instructions (7 cases): Numeric answers lacked explicit rounding rules.
- Duplicate Distractors (1 case): Two incorrect options evaluated to the same constant.
Final Deliverable: A curated set of 24 questions (one seed per lecture) with zero warnings, exported in formats compatible with Google Forms and LMS.

5. Significance and Implications

Privacy & Data Sovereignty: By keeping inference local, institutions can process proprietary lecture materials without risking data leakage to third-party cloud providers.
Auditability & Trust: The deterministic QC layer transforms the "black box" of LLM generation into a transparent process. Every question has a traceable QC log, allowing instructors to verify correctness before deployment.
Green AI & Sustainability: The approach adheres to the "use big models only when needed" principle. Once the static quiz is generated, it can be deployed indefinitely without further energy consumption from LLM inference.
Limitations: The authors acknowledge that while QC guarantees structural and mathematical validity, it cannot guarantee pedagogical validity (e.g., difficulty calibration, misconception targeting) or deep semantic faithfulness to the source text, which still requires human expert review.

In conclusion, L2Q demonstrates that self-hosted, deterministic pipelines can produce reliable, deployable educational assessments, bridging the gap between the generative power of LLMs and the rigorous requirements of academic assessment.