Evaluating LLMs in the Context of a Functional Programming Course: A Comprehensive Study

Imagine you have a very smart, well-read robot assistant that has read almost every book and code snippet ever written. You call it an "AI Tutor." You ask it to help you with your homework.

For a long time, we've tested this robot on easy, common subjects like Python or Java (which are like "English" in the world of programming—everyone speaks them, and there are millions of books written about them). The robot does great on those! It gets A's.

But what happens when you ask the robot to help with OCaml?

OCaml is a bit like Latin or Ancient Greek in the programming world. It's a "low-resource" language. It's elegant and powerful, but fewer people speak it, and there are fewer books (data) for the robot to learn from. It's also a "functional" language, which means it thinks about problems differently than the common languages, kind of like solving a math puzzle instead of following a recipe.

This paper is the story of researchers putting that AI Tutor through a rigorous exam in this "Latin" class to see if it's actually smart or just good at memorizing patterns.

The Three Tests (The Benchmarks)

The researchers didn't just ask the robot one question. They built three specific "exam rooms" to test it:

The "Write It" Room (λCodeGen):
- The Task: "Here is a homework assignment with 10 different problems. Write the code to solve them all."
- The Analogy: Imagine asking the robot to write a whole essay, not just a sentence.
- The Result: The top robots (like GPT-4o and o3-mini) got a B+. They were good, but not perfect. They made mistakes in logic or used forbidden techniques. The cheaper, smaller robots got F's because they couldn't even finish the sentences (their code didn't compile).
- Key Takeaway: The robot is good at writing code, but it's not a genius yet, especially in a language it doesn't speak fluently.
The "Fix It" Room (λRepair):
- The Task: "Here is a piece of code a student wrote that is broken. It has a typo, a type error, or a logic bug. Fix it."
- The Analogy: Imagine a mechanic trying to fix a car engine.
- The Result: The robot was amazing at fixing simple typos (syntax errors) and "grammar" mistakes (type errors). It got A's here! It's like a mechanic who can instantly spot a loose bolt.
- However: When the problem was a "logic error" (the engine is running, but the car is driving backward), the robot struggled more. It's harder to fix why something is wrong than what is wrong.
- Key Takeaway: The robot is a great editor, but a less reliable engineer.
The "Explain It" Room (λExplain):
- The Task: "Explain this complex theory about how the language works."
- The Analogy: Asking the robot to explain the philosophy of time travel.
- The Result: The top robots got A's here! They could explain the concepts clearly. But, they had a bad habit of being chatty. They would give you the right answer, but then write three extra paragraphs of fluff that wasn't asked for.
- Key Takeaway: The robot knows the theory, but it needs to learn to be concise.

The Big Surprises

The "Specialist" vs. The "Generalist": The researchers also tested a tool built specifically for OCaml (called BURST). You'd think the specialist would win. But the specialist only got 11% of the answers right! The general-purpose AI (the robot that reads everything) was much better, even though it wasn't an OCaml expert.
- Metaphor: It's like hiring a specialist who only knows how to fix 1990s Ford Fords versus a general mechanic who knows how to fix almost any car. The general mechanic did a better job on this specific, weird car.
The "Small" vs. "Big" Brain: The biggest, most expensive models (like o3-mini) were the clear winners. The smaller, free models often failed so badly their code was "non-gradable" (garbage).
- Metaphor: If you ask a kindergartner to solve a calculus problem, they might guess. If you ask a PhD professor, they might get it right. The "small" models are the kindergartners here.
The "Latin" Problem: The robots performed significantly worse on OCaml than they do on Python.
- Metaphor: If you ask a polyglot who speaks 10 languages to translate a poem from a language they only know a little bit of, they will make mistakes. They rely on patterns they've seen millions of times in other languages, which doesn't always work for the unique rules of OCaml.

What Does This Mean for Students and Teachers?

For Students: Don't just copy the robot's homework! The robot is smart, but it makes mistakes. If you use it, you need to be the "editor" who checks if the code actually works. If you rely on it blindly, you might learn the wrong things.
For Teachers: You can't just ban the robot. It's too useful. Instead, change the tests. Ask students to critique the robot's code or find the bugs in it. Make the robot a partner, not a cheat sheet.
For the Future: We need to teach these robots better. They are great at fixing typos and explaining concepts, but they need to get better at deep logic and understanding complex, rare languages.

The Bottom Line

The AI Tutor is a very helpful B+ student. It can fix your grammar, explain the theory, and write a decent first draft of your code. But it is not a genius, and it definitely isn't perfect. In a difficult, niche subject like OCaml, it's a powerful tool, but you still need a human brain in the driver's seat to make sure you don't crash.

Here is a detailed technical summary of the paper "Evaluating LLMs in the Context of a Functional Programming Course: A Comprehensive Study."

1. Problem Statement

Large Language Models (LLMs) are increasingly used by students for coding, debugging, and conceptual learning. While previous studies have demonstrated LLM effectiveness in introductory Computer Science courses using high-resource languages (e.g., Python, Java), there is a significant gap in understanding their performance in functional programming (FP) contexts using low-resource languages like OCaml.

The authors identify several critical challenges:

Data Scarcity: OCaml has significantly less training data compared to mainstream languages, potentially limiting LLM performance.
Complexity: FP courses involve advanced theoretical concepts (e.g., continuations, static semantics, type inference) that differ from standard imperative programming tasks.
Educational Risk: Students may over-rely on LLMs that generate plausible but incorrect or verbose answers, leading to misconceptions about theoretical concepts.
Evaluation Limitations: Existing benchmarks often focus solely on code correctness (pass/fail) rather than code quality, algorithmic design, or conciseness, which are crucial in an educational setting.

2. Methodology

To address these gaps, the authors constructed a comprehensive evaluation framework centered on three custom benchmarks derived from a second-year functional programming course at McGill University.

A. The Benchmarks

$\lambda$ CodeGen (Code Generation):
- Content: 10 multi-task homework assignments containing 53 tasks.
- Scope: Ranges from basic recursion and pattern matching to advanced topics like continuations, lazy streams, and programming language theory (interpreters, type checkers).
- Format: Natural language descriptions with type specifications.
$\lambda$ Repair (Code Repair):
- Content: 150 buggy programs drawn from actual student submissions.
- Categories:
  - Syntax Errors (50 problems)
  - Type Errors (50 problems)
  - Logical Errors (50 problems)
- Setup: Zero-shot prompting where the model receives the buggy code and compiler error messages.
$\lambda$ Explain (Conceptual Explanation):
- Content: 50 exam and preparation questions focusing on theoretical concepts (e.g., variable scope, induction proofs, evaluation strategies).
- Goal: Assess the ability to explain theory without code execution.

B. Models Evaluated

The study evaluated 9 state-of-the-art LLMs, including:

Commercial/Proprietary: GPT-4o, GPT-4o-mini, o3-mini, Claude 3.7 Sonnet, Gemini 2.0 Flash, Gemini 1.5 Flash 8B.
Open-Source: Llama 3.1 (8B, 70B), Qwen2.5 7B.
Baseline: BURST, a specialized OCaml code synthesis tool, was also evaluated on $\lambda$ CodeGen for comparison.

C. Evaluation Protocol

Prompting: Models were prompted 5 times per problem to ensure stability. Prompts mimicked student usage (e.g., requesting code-only output, conciseness).
Grading Strategy: A hybrid approach combining automated grading (OCaml compiler, autograder) and manual grading by experienced teaching assistants.
Rubric: Responses were categorized into five levels: Mastery, Proficient, Developing, Beginning, Non-gradable.
- Criteria: Correctness (passes tests), Algorithm Design (adherence to constraints like tail-recursion), and Readability (conciseness, lack of verbosity).
- Hierarchy: Correctness is a prerequisite for evaluating design and readability.
Metrics: Mastery rates, weighted letter grades (A–F), and performance across difficulty levels (Basic vs. Advanced).

3. Key Contributions

First Comprehensive FP Benchmark: Introduction of $\lambda$ CodeGen, $\lambda$ Repair, and $\lambda$ Explain, specifically tailored for low-resource functional programming languages.
Holistic Evaluation Framework: Moving beyond simple "pass@k" metrics to include manual assessment of code quality, algorithmic constraints, and conceptual accuracy.
Comparative Analysis: A rigorous comparison between general-purpose LLMs and specialized domain tools (BURST), revealing trade-offs between flexibility and guaranteed correctness.
Educational Insights: Identification of specific failure modes (e.g., verbosity, hallucination of theoretical concepts) relevant to instructors and students.

4. Key Results

General Performance Tiers

Top Tier: o3-mini, Claude 3.7 Sonnet, and GPT-4o consistently outperformed others, achieving "Mastery" on roughly 70% of code generation tasks and A-/B+ overall grades.
Bottom Tier: Smaller open-source models (Llama 3.1 8B, Qwen2.5 7B, Gemini 1.5 Flash 8B) struggled significantly, often receiving F grades, with high rates of "Non-gradable" (non-compiling) outputs.

Task-Specific Findings

Code Generation ( $\lambda$ CodeGen):
- Top models achieved ~70% Mastery, significantly lower than their performance on Python/Java benchmarks (often >90%).
- Specialized vs. General: The specialized tool BURST only solved 11.3% of problems (mostly simple recursion), whereas LLMs handled complex, open-ended tasks better despite occasional hallucinations.
- Constraint Adherence: Even top models frequently violated constraints (e.g., using disallowed imperative features or failing to use required higher-order functions).
Code Repair ( $\lambda$ Repair):
- Syntax/Type Errors: LLMs performed exceptionally well, with top models achieving >78% Mastery for syntax and >72% for type errors. They outperformed specialized tools in some cases.
- Logical Errors: Performance dropped to ~60-67% for top models, indicating that fixing semantic logic is harder than fixing syntax.
- One-Shot Learning: Providing a single example (one-shot) improved performance slightly for both syntax and type repairs, particularly for smaller models.
Conceptual Explanation ( $\lambda$ Explain):
- o3-mini excelled (80% Mastery), but the gap between top and bottom models widened significantly here.
- Verbosity: A major issue was excessive verbosity; models often provided correct answers buried in long, unnecessary explanations, even when prompted for conciseness.
- Accuracy: Many models produced plausible-sounding but theoretically incorrect explanations regarding variable scope and substitution.

Impact of Difficulty

Basic vs. Advanced: All models performed better on basic tasks (pattern matching, simple recursion) than on advanced tasks (continuations, type inference).
Theory Gap: Performance on Programming Language Theory (PT) questions was notably poor, with weaker models achieving 0% Mastery. This suggests LLMs rely on pattern matching rather than deep theoretical understanding.

5. Significance and Implications

For Students: LLMs are powerful but unreliable "one-stop" tools. Students must develop critical assessment skills to detect subtle logical errors and theoretical misconceptions, especially in low-resource languages.
For Instructors:
- Assessments should shift from pure code generation to critique, debugging, and verification tasks.
- Instructors can use these benchmarks to demonstrate LLM limitations to students (e.g., asking students to generate questions that force LLMs to fail).
For Researchers:
- Domain Adaptation: General-purpose LLMs struggle with low-resource languages; there is a need for integrating domain-specific reasoning (e.g., type systems, formal semantics) into LLM training.
- Tool Development: While specialized tools (like BURST) offer guaranteed correctness for narrow tasks, LLMs offer superior adaptability for complex, multi-step reasoning. Future tools should aim to combine the flexibility of LLMs with the precision of formal methods.
- Evaluation Standards: The study advocates for manual grading and quality assessment (readability, constraints) over simple execution-based metrics.

In conclusion, while top-tier LLMs are becoming effective assistants for functional programming, they are not yet perfect substitutes for human instructors or specialized tools. Their performance is highly dependent on the task type (repair > generation > theory) and the model's scale, highlighting the need for continued evolution in both model training and educational pedagogy.