A Bandit-Based Approach to Educational Recommender Systems: Contextual Thompson Sampling for Learner Skill Gain Optimization

Imagine you are a coach for a massive sports team with thousands of players. Some players are beginners, some are pros, and everyone learns at a different speed.

In a traditional classroom (or a standard online course), the coach gives everyone the exact same playbook. If the drills are too easy, the pros get bored. If they are too hard, the beginners get frustrated and quit. The coach tries to help, but with thousands of players, they simply can't watch everyone individually to see who needs what.

This paper introduces a smart, AI-powered coach that solves this problem. It doesn't just guess what a player needs; it learns by experimenting, much like a gambler trying to find the best slot machine, but with a very specific goal: making the player better, not just making them happy.

Here is the breakdown of how this "Smart Coach" works, using simple analogies:

1. The Problem: The "One-Size-Fits-All" Trap

Most online learning systems today work like a static library. If you like math, the system recommends other math books because "people who liked math also liked algebra."

The Flaw: This is like recommending a book to a 5-year-old just because their 15-year-old brother liked it. It ignores the fact that you are currently struggling with a specific concept. It also never tries anything new; it just keeps showing you the "popular" stuff, even if it's not helping you grow.

2. The Solution: The "Gambler's Coach" (Bandits)

The authors use a concept from math called Multi-Armed Bandits.

The Analogy: Imagine a casino with 1,000 slot machines. You don't know which one pays out the most.
- Exploration: You have to try different machines to see which ones work.
- Exploitation: Once you find a machine that pays well, you keep playing it.
The Twist: In a normal casino, you want to win money. In this educational system, the "money" is Skill Gain. The goal isn't to get the student to answer correctly right now; it's to find the exercise that makes their brain grow the most next time.

3. The Secret Sauce: "Contextual" Awareness

Older systems (like the "static library") just look at what you did in the past. This new system is Contextual.

The Analogy: A regular coach says, "You missed this shot, so here is another shot."
The Smart Coach (LinTS) says, "You missed this shot. But I also know you are tired, you are confused about angles, and you usually do better in the morning. So, instead of another shot, let's try a different drill that targets your specific confusion."

It looks at a "Context Vector" (a profile of the student) including:

Who they are: (e.g., "I'm a visual learner," "I'm in 8th grade").
How they feel: (e.g., "I'm frustrated," "I'm bored").
What they know: (e.g., "I'm great at addition but bad at fractions").

4. The Magic Algorithm: Thompson Sampling

How does the coach decide which exercise to pick? It uses a method called Thompson Sampling.

The Analogy: Imagine the coach has a deck of cards for every single exercise. Some cards say "This will help a lot," others say "This might help a little."
Instead of picking the one that looks best on paper, the coach shuffles the deck and draws a card.
If an exercise is uncertain (the deck is mixed), the coach might draw a "high potential" card just to test it out. If an exercise is known to be great, the deck is mostly "high potential" cards, so it's likely to be picked again.
This creates a perfect balance: it tries new things to learn more, but sticks to what works to get results.

5. The Results: What Happened?

The researchers tested this on a real math tutoring platform with thousands of students.

The Old Way (Collaborative Filtering): Like a popular playlist. It recommended exercises that were generally popular.
The New Way (LinTS): Like a personal trainer.
The Outcome: The new system made students 15% to 20% better at learning than the old systems.
- It stopped wasting time on exercises that were too easy or too hard.
- It identified a small group of "Super Exercises" that were incredibly effective for specific types of students and focused on those.
- It helped teachers see which students were struggling and exactly why, so they could step in with help.

The Big Takeaway

This paper proves that we can build digital tutors that actually adapt. Instead of forcing every student down the same path, we can use math to create a unique, winding path for every single learner.

It's the difference between a factory assembly line (everyone gets the same product) and a custom tailor (the clothes are made specifically for your body). The result? Students learn faster, stay engaged longer, and teachers can finally scale personalized help to thousands of students at once.

1. Problem Statement

The paper addresses the challenge of scaling personalized learning in Operations Research (OR), Management Science (MS), and Analytics education. Traditional digital learning environments (e.g., MOOCs) often rely on standardized, fixed learning paths that fail to adapt to the diverse skill levels and evolving needs of individual learners.

Existing Educational Recommender Systems (ERS) predominantly use Collaborative Filtering (CF). The authors identify three critical limitations in CF for educational contexts:

Lack of Personalization: CF relies on aggregated behavioral patterns (similarity between users or items) rather than individual learner profiles.
Static Nature: CF uses static similarity measures, failing to capture the temporal evolution of a learner's knowledge state.
No Exploration: CF tends to reinforce historically popular exercises, lacking a mechanism to explore potentially beneficial but less common exercises.

The core problem is to design an adaptive recommendation framework that dynamically selects exercises to maximize learner skill gain by balancing exploration (trying new exercises to learn their value) and exploitation (using known effective exercises).

2. Methodology

A. Framework and Reward Definition

The authors frame the problem as a Contextual Multi-Armed Bandit (CMAB) problem.

Context ( $x_t$ ): A vector containing learner features (sociodemographics, academic proficiency, affective state, and disengagement indicators) and exercise attributes.
Action ( $a_t$ ): The selection of a specific exercise from a finite set.
Reward ( $r_t$ ): Unlike standard ERS that use binary correctness or user ratings, this paper defines the reward as Skill Gain.
- Skill gain is calculated as the difference in the estimated mastery of a specific cognitive skill before and after an interaction: $r_t = K(s)_t - K(s)_{t-1}$ .
- Mastery estimates are derived using Bayesian Knowledge Tracing (BKT), a probabilistic model of skill acquisition.

B. Proposed Algorithms

The study compares four algorithms:

User-Based Collaborative Filtering (UserCF): Predicts reward based on the weighted average of rewards from similar learners (using cosine similarity).
Item-Based Collaborative Filtering (ItemCF): Predicts reward based on the weighted average of rewards from similar exercises attempted by the target learner.
Thompson Sampling (TS): A non-contextual Bayesian algorithm. It models the reward of each exercise as a Gaussian distribution (using a Normal–Inverse–Gamma prior) and selects the exercise with the highest sampled mean reward.
Linear Thompson Sampling (LinTS): A contextual extension of TS. It assumes the expected reward is a linear function of the context vector ( $\mu_a(x) = x^T \theta_a$ ). It maintains a posterior distribution over the parameter vector $\theta_a$ for each exercise, allowing it to adapt recommendations based on specific learner features.

C. Experimental Setup

Dataset: ASSISTments 2017, a large-scale dataset from an online mathematics tutoring system (1,708 learners, 3,162 exercises, ~935k interactions).
Preprocessing:
- Filtered for positive skill gains.
- Removed duplicate interactions (kept the final attempt).
- Excluded learners with <50 interactions to ensure stable modeling.
- Warm-start enforcement: Ensured all test users/exercises appeared in the training set.
Splitting: Temporal split (70% training, 15% validation, 15% test) to preserve the chronological order of learning.
Evaluation Metric: Mean instantaneous skill gain (average reward) on the held-out test set.

3. Key Contributions

First Empirical Evaluation of TS in ERS: This is the first study to apply Thompson Sampling specifically for educational recommendation, moving beyond the dominant UCB (Upper Confidence Bound) approaches in existing literature.
Skill Gain as the Optimization Objective: The paper shifts the optimization metric from "correctness" or "clicks" to actual learning progress (change in BKT mastery estimates), aligning the algorithm's goal with pedagogical effectiveness.
Contextual Modeling: It demonstrates that incorporating learner context (affective state, proficiency, demographics) via LinTS significantly outperforms non-contextual methods.
Comprehensive Benchmarking: The study provides a rigorous comparison between standard CF, non-contextual TS, and contextual LinTS, isolating the value of context in sequential decision-making for education.

4. Results

Performance: LinTS achieved the highest performance, outperforming all baselines.
- 15.2% improvement over standard (non-contextual) TS.
- 16.5% improvement over ItemCF.
- 20.7% improvement over UserCF.
Exploration-Exploitation Dynamics:
- UserCF showed premature convergence, locking onto a few exercises (over-exploitation).
- ItemCF spread selections too widely, lacking adaptive prioritization.
- LinTS demonstrated a dynamic shift: broad exploration in early training rounds followed by focused exploitation of a narrow set of high-value exercises in later rounds.
Contextual Value: The superior performance of LinTS over TS confirms that learner-specific features are critical for predicting which exercises will yield the highest skill gain for a specific individual.

5. Significance and Implications

Scalable Personalization: The framework enables instructors to provide individualized learning trajectories in large-scale courses (e.g., MOOCs) without manual intervention, adapting to varying quantitative skill levels.
Instructional Insights: The system can identify exercises that consistently generate high learning gains, helping instructors select better materials for in-class examples or assignments.
Targeted Intervention: By analyzing learner contexts, the system can flag students struggling with specific prerequisite skills (e.g., matrix operations) and recommend remedial exercises, facilitating differentiated instruction.
Pedagogical Alignment: By optimizing for skill gain rather than correctness, the system avoids the trap of recommending only easy exercises that learners can already solve, instead pushing learners toward their "zone of proximal development."

In conclusion, the paper establishes Contextual Thompson Sampling as a superior, theoretically grounded approach for educational recommendation, offering a robust mechanism to maximize learning outcomes in digital environments.