Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation

The Big Idea: Can AI "Fake" a Student's Mistake?

Imagine you are a teacher creating a multiple-choice test. You have the correct answer, but you also need to create distractors (the wrong answers).

The trick is that a good distractor isn't just a random number; it has to be a mistake a real student would actually make.

Bad distractor: "42" (Random, nobody would pick this).
Good distractor: "12" (This happens if a student forgets to divide by 2, a common error).

The researchers wanted to know: Can Large Language Models (LLMs) like the ones powering this chat do this? Can they look at a math problem, figure out the right answer, and then pretend to be a confused student to generate the perfect wrong answers?

The Experiment: The "Detective" vs. The "Gambler"

The team asked two smart AI models (DeepSeek and GLM) to generate these wrong answers. They didn't just look at the final result; they looked at the thinking process (the "reasoning trace") the AI used to get there.

They created a "Taxonomy" (a checklist of steps) based on how human experts design tests. Think of it like a recipe for baking a cake.

Step 1: Bake the cake correctly (Solve the problem).
Step 2: Imagine what happens if you forget the sugar (Identify a mistake).
Step 3: Bake a "sugar-less" cake (Simulate the error).
Step 4: Taste it and decide if it looks like something a human would actually eat (Check plausibility).

The Surprising Discovery: The AI is a "Methodical Chef"

The researchers expected the AI to just guess random wrong answers or tweak the right answer slightly (like changing a 3 to a 4).

Instead, they found the AI was acting exactly like a human expert.

Here is the process the AI followed, using our "Chef" analogy:

The Anchor: First, the AI solved the math problem perfectly. It knew the "correct cake."
The "What-If": Then, it said, "Okay, what if a student forgot to divide by 3?" or "What if they added instead of multiplied?"
The Simulation: It actually ran through the math with that mistake to see what the wrong answer would be.
The Selection: Finally, it picked the best "wrong answers" that looked most convincing.

The Metaphor:
Imagine a magician trying to teach an apprentice how to make a fake coin.

The Old Way (Similarity-based): The apprentice just paints a real coin gold. It looks fake, but it's not a real mistake.
The AI Way (Misconception-based): The apprentice first learns how to make a real silver coin. Then, they deliberately mess up the casting process to see what a "bad" coin looks like. They study the flaws of the bad coin to understand why it's wrong.

The AI was doing the second, much harder thing. It wasn't just guessing; it was simulating a student's brain.

Where Did the AI Fail? (The "Glitch" in the Matrix)

Even though the AI's method was brilliant, it didn't always get the result right. The researchers found the failures happened in two specific places:

The Anchor Slipped: Sometimes, the AI tried to solve the problem correctly first, but it made a tiny calculation error in the "correct" part. If the anchor is crooked, the whole building falls.
The Taste Test: Sometimes, the AI generated a great "wrong answer," but then it got confused about which one to pick, or it picked one that was too obvious.

The Fix:
The researchers found a magic trick to fix this. If they told the AI, "Here is the correct answer, don't guess it, just use it as a base," the AI's performance jumped by 8%.

It's like telling a chef: "Don't worry about baking the perfect cake yourself; I'll give you the perfect cake. Just tell me what it would taste like if you forgot the salt." The AI became much better at faking the mistake when it didn't have to worry about getting the right answer first.

The Takeaway

This paper is a big deal for education technology. It proves that modern AI isn't just a "parrot" repeating facts. It can actually model human thinking, including our mistakes.

Good News: AI can help teachers automatically create high-quality tests that catch real student misconceptions.
The Catch: We need to give the AI a little help (like showing it the right answer first) so it doesn't get confused while trying to be "wrong."

In short: AI can now play the role of a confused student very well, as long as we give it a map of the correct path to start from.

1. Problem Statement

The paper addresses a critical gap in AI for Education (AIED): while Large Language Models (LLMs) are increasingly used to simulate student behavior, it remains unclear how they model incorrect student reasoning. Specifically, the authors investigate whether LLMs can generate high-quality distractors (incorrect but plausible answer choices) for Multiple-Choice Questions (MCQs).

Generating effective distractors is a complex task requiring the model to:

Identify the correct solution.
Simulate specific student misconceptions (systematic errors).
Evaluate the plausibility of the resulting incorrect answers.
Curate a set of distinct, non-redundant distractors.

Previous work focused primarily on the output quality of generated distractors. This paper shifts the focus to the reasoning process, asking: Do LLMs follow established best practices from learning sciences (e.g., misconception-based design) or do they rely on superficial heuristics (e.g., similarity-based perturbations)?

2. Methodology

A. Taxonomy Construction

The authors developed a novel taxonomy of strategies grounded in both learning science literature and empirical observation of LLM traces. They combined:

Theoretical Principles: Derived from "Repair Theory" and item-writing guidelines (e.g., negative discrimination, plausibility).
Empirical Observation: Open coding of LLM reasoning traces.

The resulting taxonomy (Table 1 in the paper) includes eight key strategies:

Task Interpretation (INTER): Clarifying instructions.
Correct Answer Reference (CORR): Solving the problem correctly first.
Error Description (ERR_DESC): Articulating a misconception abstractly.
Error Simulation (ERR_SIM): Verbalizing the incorrect procedural step.
Outcome Instantiation (INST): Generating the specific incorrect answer.
Plausibility Check (PLAUS): Evaluating if a student would choose this error.
Final Set Curation (CURATE): Selecting and refining the final distractor set.
Reconsideration (RECON): Backtracking or revising previous choices.

B. Experimental Setup

Models: DeepSeek-V3.2 and GLM-4.7 (both reasoning-capable models).
Dataset: 429 expert-written math MCQs from the Eedi Math MCQ Dataset.
Prompting Conditions:
- Direct: No reasoning tokens.
- Chain-of-Thought (CoT): Step-by-step reasoning enabled but reasoning mode disabled.
- Reasoning: Full reasoning mode enabled (generating intermediate thought tokens).
Annotation Pipeline: Due to the length of reasoning traces (~6,800 characters), the authors used an LLM-assisted open coding approach to annotate 240 traces with the taxonomy tags, validated against human annotations (Precision: 0.97, Recall: 0.95).

C. Analysis Metrics

Performance: Proportional match (overlap with human-authored distractors), number of accidentally correct answers, and redundancy.
Process Analysis: Frequency of strategies, temporal evolution (when strategies occur in the trace), and transition probabilities (sequence of strategies).

3. Key Contributions

First Process-Oriented Analysis: Moves beyond output metrics to analyze the internal reasoning steps LLMs take when generating distractors.
Taxonomy of Distractor Generation: Provides a structured framework to categorize LLM reasoning strategies, bridging the gap between NLP and Learning Sciences.
Discovery of "Solve-First" Alignment: Reveals that state-of-the-art LLMs naturally adopt a misconception-based design strategy (solving correctly first, then injecting errors) rather than superficial similarity heuristics, mirroring expert human practices.
Identification of Failure Modes: Pinpoints that errors in distractor generation stem primarily from failing to recover the correct solution or poor selection/curation, rather than a failure to simulate the error itself.
Prompt Engineering Insight: Demonstrates that providing the correct answer in the prompt significantly improves performance (by 8%), highlighting the importance of anchoring.

4. Key Results

A. Reasoning Improves Performance

Moving from "Direct" prompting to "Reasoning" modes significantly increased the proportional match with human distractors (e.g., DeepSeek-V3.2 improved from 0.34 to 0.52).
Reasoning modes drastically reduced "accidentally correct" answers and redundant outputs.

B. Alignment with Best Practices

Solve-First Strategy: In 92.5% (DeepSeek) and 97.8% (GLM) of traces, the model solved the problem correctly before generating distractors.
Misconception Injection: The models typically followed the sequence: Correct Solution $\rightarrow$ Error Description $\rightarrow$ Error Simulation $\rightarrow$ Outcome Instantiation.
Over-generation & Selection: Models often explored multiple error hypotheses (average 12.6 paths in reasoning mode) before curating the final set, mimicking an "exploratory best-of-N" search.
Rare Heuristics: Surface-level similarity strategies (e.g., changing a sign without modeling a misconception) were observed in only ~3% of cases.

C. Diagnosis of Failure Modes

Error Simulation is Robust: When given a specific error description, the model correctly simulated the resulting incorrect answer with 92% accuracy.
Root Cause of Failure: The primary bottleneck is recovering the correct solution. If the initial correct solution is wrong, the subsequent distractors are flawed.
Selection Issues: Models tend to be overly conservative in curation, sometimes discarding valid distractors that are consistent with plausible errors.

D. Impact of Anchoring

Providing the correct answer in the prompt (anchoring) improved the proportional match by 8% (0.52 $\rightarrow$ 0.56). This confirms that LLMs need a stable "correct" anchor to effectively diverge into plausible incorrect reasoning.

5. Significance and Implications

Trust and Explainability: By demonstrating that LLMs follow human-expert reasoning patterns (misconception-based design), the study builds trust in using LLMs for educational content creation.
Scalable Assessment: The findings suggest that automated distractor generation is viable without massive historical student datasets, provided the model is anchored to the correct solution.
Targeted Interventions: The identification of specific failure points (solution recovery and curation) offers clear directions for future research, such as:
- Fine-tuning models to be more robust in initial problem solving.
- Developing better selection mechanisms for the final distractor set.
- Using "correct answer anchoring" as a standard prompting technique for educational tasks.

In conclusion, the paper establishes that modern LLMs are capable of modeling incorrect student reasoning through a structured, misconception-based process that aligns with learning science principles, provided they are properly anchored to the correct solution.