Estimating Item Difficulty Using Large Language Models and Tree-Based Machine Learning Algorithms

Imagine you are a teacher who has written 5,000 new math and reading questions for elementary school kids. Before you can give these tests to students, you need to know: How hard is each question?

Traditionally, to find the answer, you have to give the test to thousands of real students, wait for the results, and do complex math to figure out which questions were too easy, which were too hard, and which were just right. This takes months, costs a lot of money, and risks "spoiling" the questions (if students see them before the real test).

This paper asks a simple question: Can we use a super-smart AI (a Large Language Model) to guess the difficulty of these questions just by reading them, saving us all that time and money?

The researchers tried two different ways to ask the AI to do this. Here is the breakdown using simple analogies.

The Two Approaches: The "Guru" vs. The "Detective Team"

Approach 1: The "Guru" (Direct Estimation)

In this method, the researchers treated the AI like a wise, all-knowing educational guru. They said to the AI:

"Here is a math question. Based on your vast knowledge of how kids learn, tell me on a scale of 1 to 100 how hard this is."

The Result: The AI was pretty good at this! When looking at all the questions together, the AI's guesses matched the real student results about 80-83% of the time.

The Catch: The AI struggled with the youngest kids (Kindergarten and 1st Grade). It's like asking a grown-up to guess how hard a toddler's puzzle is; they might overthink it or miss the tiny details that make it hard for a 5-year-old.

Approach 2: The "Detective Team" (Feature-Based)

This method was more structured. Instead of asking for one big guess, the researchers asked the AI to act like a forensic detective. They gave the AI a checklist of specific clues to look for in every question, such as:

Is the vocabulary fancy?
Does the student have to do math in their head or write it down?
Are there tricky wrong answers?
Does it require reading a long story first?

The AI filled out this checklist for every single question. Then, the researchers took that checklist and fed it into a Machine Learning "Coach" (specifically, tree-based algorithms like Random Forests). The Coach didn't guess; it learned from thousands of past examples which clues actually mattered most.

The Result: This team won the race. By breaking the problem down into small clues, the model became incredibly accurate (up to 87% correlation). It was much better at spotting the difference between easy and hard questions, even for the youngest grades.

Why Did the "Detective Team" Win?

Think of it like this:

The Guru tries to solve a complex puzzle in one giant leap. Sometimes they get it right, but sometimes they miss the nuance.
The Detective Team breaks the puzzle into tiny pieces. They measure the "vocabulary weight," the "logic steps," and the "visual clues" separately. Then, the Coach combines all those tiny measurements to build a perfect picture of the difficulty.

The study found that the AI's ability to analyze specific parts of a question was far more powerful than its ability to just guess the whole thing at once.

The Surprising Findings

Old Tricks Don't Work: The researchers tried using old-school computer methods (counting words and sentence length) to guess difficulty. It was like trying to judge a movie's quality just by counting how many times the word "the" appears. It didn't work well. The AI's "human-like" understanding of meaning was the key.
The "Kindergarten Problem": The AI had a harder time with the easiest questions (Kindergarten/1st Grade). The researchers think this is because the range of difficulty in those grades is so small that it's hard to tell the difference between a "very easy" and a "slightly easy" question. It's like trying to tell the difference between two shades of white paint; it's much easier to tell the difference between white and black (which is what happens in higher grades).
It's Not Magic, It's Math: The AI didn't just "know" the answer. It needed a human to teach it what to look for (the checklist) and a computer program to learn how to weigh those clues.

The Bottom Line: A New Workflow for Teachers

The authors suggest a new 7-step recipe for anyone making tests in the future:

Gather your questions.
Ask experts (humans) what makes a question hard.
Teach the AI to look for those specific things (the checklist).
Let the AI read every question and fill out the checklist.
Train a computer model to learn how those checklist items predict difficulty.
Test the model on new questions to see if it works.
Use the model to predict the difficulty of future questions before you ever show them to a student.

Why This Matters

If schools can use AI to predict how hard a test question is, they can:

Save Money: They won't need to test thousands of students just to calibrate a few questions.
Save Time: New tests can be ready in weeks instead of years.
Be Fairer: They can spot tricky or confusing questions before they hurt a student's grade.

In short, the paper proves that while AI can't perfectly replace human testing yet, it is a powerful tool that can act as a super-assistant, helping educators build better, fairer, and faster tests.

Here is a detailed technical summary of the paper "Estimating Item Difficulty Using Large Language Models and Tree-Based Machine Learning Algorithms" by Razavi and Powers.

1. Problem Statement

Estimating item difficulty for educational assessments (K-12) traditionally relies on field-testing (piloting items with large student samples) or Subject Matter Expert (SME) ratings. Both methods are resource-intensive, time-consuming, and introduce delays in item deployment. While previous attempts to use Natural Language Processing (NLP) to predict difficulty from text content have shown limited success (often relying on surface-level features like readability formulas or word counts), there is a need for scalable, automated methods that can capture deeper semantic and cognitive complexities. The authors investigate whether Large Language Models (LLMs) can effectively predict item difficulty for K-5 Mathematics and Reading, either directly or by extracting features for machine learning models.

2. Methodology

Data Source

Dataset: 5,170 items from Edmentum's Exact Path Diagnostic (K-5 Math and Reading).
Ground Truth: Empirically calibrated Item Response Theory (IRT) difficulty parameters (Rasch $b$ -values) derived from large, diverse student populations.
Split: Stratified sampling created a training set ( $N=3,970$ ) and a holdout test set ( $N=1,200$ ) to ensure similar difficulty distributions across grades.

Two Estimation Approaches

The study implemented and compared two distinct paradigms using GPT-4o:

A. Direct LLM Estimation (Zero-Shot)

Process: The LLM was prompted to act as an expert evaluator, analyzing item content and metadata (grade, type) to assign a single numerical difficulty rating (1–100).
Post-Processing: Raw LLM scores were standardized (z-score) and rescaled to match the mean and standard deviation of the Rasch logit scale. A linear regression was then fitted to map these rescaled scores to the true difficulty values.
Goal: To test if an LLM can holistically judge difficulty without intermediate feature extraction.

B. Feature-Based Estimation (Hybrid Approach)

Feature Elicitation: SMEs identified 20+ domain-specific features (e.g., Cognitive Load, Depth of Knowledge, Distractor Trickiness, Vocabulary Complexity) and 13 reading-specific features.
Extraction: The LLM was prompted to extract and rate each specific feature for every item (using scales like 1–10 or Y/N).
Modeling: These LLM-generated features, combined with item metadata (word count, grade, domain), were used as predictors in ensemble tree-based machine learning models:
- Random Forest (RF)
- Gradient Boosting Machines (GBM/XGBoost)
Goal: To test if decomposing the task into structured feature extraction followed by supervised learning yields higher accuracy.

Baselines & Benchmarks

To validate the approaches, the authors compared results against:

Dummy Regressor: Predicts the mean difficulty for a specific grade level (the standard baseline).
TF-IDF + Random Forest: Traditional NLP approach using keyword frequencies.
Metadata-Only Models: Models using only grade, domain, item type, and word count (no LLM features).

3. Key Results

Direct LLM Estimation Performance

Overall: Showed moderate-to-strong correlations with true difficulty when aggregating all grades ( $r \approx .81$ for Reading, $.83$ for Math).
Grade-Level Variability: Performance was inconsistent.
- Grades 3–5: Strong predictive power.
- Grades K–1: Performance was poor; for Kindergarten Math, the LLM performed worse than the simple Dummy Regressor (predicting grade averages).
Error: Root Mean Square Error (RMSE) was generally lower than the baseline but higher than the feature-based approach.

Feature-Based Estimation Performance

Superior Accuracy: Both Random Forest and Gradient Boosting models significantly outperformed the Direct LLM approach and all baselines.
- Math: Correlation $r = .87$ ; RMSE reduced by ~18.5% compared to the Dummy Regressor.
- Reading: Correlation $r = .87$ ; RMSE reduced by ~31% compared to the Dummy Regressor.
Early Grades: The feature-based approach successfully improved accuracy for Grades K and 1, where the Direct LLM failed.
Baseline Comparison:
- The TF-IDF baseline performed no better than the Dummy Regressor, confirming that surface-level text features are insufficient.
- Metadata-only models showed some utility but were significantly outperformed by models including LLM-extracted features, proving the added value of semantic/cognitive analysis.

Feature Importance

Random Forest: Identified "Grade Level" and "Word Count" as top predictors, but also heavily relied on LLM-rated features like "Use of Visuals" (Math) and "Inference Required" (Reading).
Gradient Boosting (SHAP Analysis): Confirmed that Syntax Complexity was the single most important feature for Reading, while Skill Challenge and Overall Difficulty ratings were critical for Math. This demonstrates the models learned nuanced relationships between cognitive demands and difficulty.

4. Key Contributions

Validation of Hybrid Approach: The study demonstrates that using LLMs as feature extractors combined with traditional machine learning (RF/GBM) is superior to using LLMs as direct estimators (zero-shot) for item difficulty prediction.
Overcoming Early-Grade Limitations: The feature-based method successfully addressed the performance drop in K-1 grades observed in direct LLM estimation, likely by breaking down complex difficulty judgments into manageable, interpretable components.
Workflow Framework: The authors propose a 7-step workflow for testing professionals to implement similar systems, covering item selection, SME collaboration for feature definition, prompt engineering, and model validation.
Empirical Evidence: Provides robust evidence that LLMs can capture deep semantic and cognitive factors (e.g., distractor plausibility, multi-step reasoning) that traditional NLP metrics (like Flesch-Kincaid) miss.

5. Significance and Implications

Cost and Time Efficiency: This approach offers a scalable alternative to expensive field testing, potentially allowing item developers to pre-screen items for difficulty before piloting, thereby reducing the need for large student samples.
Adaptive Testing: More accurate and rapid difficulty estimation can improve the efficiency of Computerized Adaptive Testing (CAT) by allowing for faster item bank calibration.
Limitations & Future Work:
- The study focused on K-5; generalizability to higher grades or other subjects (Science, Social Studies) requires further research.
- Fine-tuning was not attempted due to data size and security constraints; future work could explore fine-tuning LLMs on proprietary item banks.
- Output Variability: While temperature was set to 0, inherent LLM variability remains a concern for high-stakes, single-item precision.

Conclusion: The paper concludes that while LLMs show promise as direct judges, their true potential in educational assessment lies in a hybrid pipeline: using LLMs to extract rich, expert-level cognitive and linguistic features, which are then fed into robust machine learning models to predict item difficulty with high accuracy.