Baseline Performance of AI Tools in Classifying Cognitive Demand of Mathematical Tasks

Imagine you are a head chef (a teacher) running a busy kitchen. You have a massive menu of recipes (math problems) that range from simple "toast a slice of bread" tasks to complex "create a 12-course molecular gastronomy feast" challenges.

Your job is to sort these recipes into the right bins so you can serve the right level of challenge to your diners (students). If you give a beginner a complex feast, they'll burn out. If you give an expert just toast, they'll get bored.

Recently, a new assistant arrived in your kitchen: Artificial Intelligence (AI). You asked, "Can you look at these recipes and tell me which ones are simple and which ones are complex?"

This paper is the report card on how well that AI assistant did its first day on the job.

The Setup: The "Cognitive Demand" Menu

The researchers used a famous framework called the Task Analysis Guide (TAG). Think of this as a four-star rating system for math problems:

Memorization (1 Star): Just reciting facts. Like saying "2 + 2 = 4" without thinking.
Procedures Without Connections (2 Stars): Following a recipe step-by-step without knowing why the ingredients work. Like mixing cake batter because the box says so.
Procedures With Connections (3 Stars): Using the recipe to understand why the cake rises. You are connecting the steps to the science of baking.
Doing Mathematics (4 Stars): Creating a brand new recipe from scratch. You have to figure out the ingredients, the method, and the logic yourself. It's messy, hard, and requires deep thinking.

The Experiment: The "Out-of-the-Box" Test

The researchers didn't try to "train" the AI or give it special instructions. They just handed 11 different AI tools (some general ones like ChatGPT, some made specifically for schools like Khanmigo) a stack of 12 math problems and said, "Here is the rulebook. Tell me which star rating each problem gets."

They wanted to see what happens when a teacher uses AI right now, without spending hours learning how to "prompt engineer" (give it perfect instructions).

The Results: The AI is a "Safe-Middle" Assistant

The results were a mix of "not bad" and "pretty concerning."

1. The Average Score: 63%
The AI got about 6 out of 10 problems right. That's better than guessing, but far from the 100% accuracy a teacher needs to trust it blindly.

2. The "Middle-of-the-Road" Bias
This was the biggest finding. The AI was terrified of the extremes.

When a problem was clearly simple (Memorization), the AI often said, "Oh, it's a 2-star procedure."
When a problem was clearly hard (Doing Mathematics), the AI often said, "That's just a 3-star procedure."
The Analogy: Imagine a weather forecaster who is afraid to say "It's a tornado" or "It's a drought." Instead, they just keep saying, "It's a cloudy day." The AI kept pushing everything into the middle categories (Procedures) because it felt safer there.

3. The "Surface-Level" Trap
The AI was great at spotting keywords but terrible at understanding thinking.

Example: If a problem said "Show your work," the AI thought, "Oh, showing work means it's a simple procedure!"
Reality: Sometimes "showing your work" means explaining a complex, creative solution. The AI looked at the surface text (the words) rather than the deep thinking (the brain power) required.

4. General vs. Specialized Tools
You might think the AI tools built specifically for schools (like Magic School or Khanmigo) would be better than the general ones (like ChatGPT or Grok).

The Twist: They weren't. The general-purpose tools performed just as well (or just as poorly) as the school-specific ones. Being "education-branded" didn't make them smarter at this specific task.

The "Why" Behind the Mistakes

The researchers dug into the AI's "reasoning" (its explanation of why it gave a certain rating). They found the AI was overconfident but wrong.

The "Plausible Lie": The AI would give a very convincing, professional-sounding explanation for why a hard problem was actually easy. It sounded like a teacher, but it was missing the point.
The "Task G" Failure: There was one specific problem (Task G) where the AI failed 3 out of 4 times. It was a problem that looked like a real-world scenario but had a very specific instruction ("Set up a proportion"). The AI got confused by the real-world story and missed the specific instruction that made it a simple task.

What This Means for Teachers

So, should you fire your human assistant and hire the AI? No.

Not Ready for Prime Time: You cannot let the AI grade your lesson plans or sort your math problems on its own. It will misclassify the hardest and easiest problems, which are the ones you need to get right most.
A Good "Second Opinion": The AI is great as a drafting assistant. You can ask it, "Hey, what do you think the difficulty of this problem is?" and then you check its work. It can save you time by doing the first pass, but you must be the final judge.
The "Prompt" Problem: The study used a very basic prompt. If teachers learn how to talk to the AI better (giving it examples, asking it to think step-by-step), the scores might go up. But right now, out of the box, it's a bit clumsy.

The Bottom Line

AI is like a new intern in the kitchen. It's eager, it knows a lot of facts, and it can read a recipe book. But it doesn't yet have the "taste" or the "intuition" to know if a dish is truly simple or truly complex. It tends to play it safe and call everything "medium."

Until we teach it better how to think (not just read), teachers need to keep their aprons on and do the final tasting themselves.

Baseline Performance of AI Tools in Classifying Cognitive Demand of Mathematical Tasks

The Setup: The "Cognitive Demand" Menu

The Experiment: The "Out-of-the-Box" Test

The Results: The AI is a "Safe-Middle" Assistant

The "Why" Behind the Mistakes

What This Means for Teachers

The Bottom Line

1. Problem Statement

2. Methodology

2.1 Theoretical Framework

2.2 Data and Task Selection

2.3 Experimental Procedure

3. Key Results

3.1 Overall Accuracy

3.2 Systematic Biases and Patterns

3.3 Error Analysis and Reasoning Quality

4. Key Contributions

5. Significance and Implications

Baseline Performance of AI Tools in Classifying Cognitive Demand of Mathematical Tasks

The Setup: The "Cognitive Demand" Menu

The Experiment: The "Out-of-the-Box" Test

The Results: The AI is a "Safe-Middle" Assistant

The "Why" Behind the Mistakes

What This Means for Teachers

The Bottom Line

1. Problem Statement

2. Methodology

2.1 Theoretical Framework

2.2 Data and Task Selection

2.3 Experimental Procedure

3. Key Results

3.1 Overall Accuracy

3.2 Systematic Biases and Patterns

3.3 Error Analysis and Reasoning Quality

4. Key Contributions

5. Significance and Implications

More like this

XR and Hybrid Data Visualization Spaces for Enhanced Data Analytics

Biometric-enabled Personalized Augmentative and Alternative Communications

The People's Gaze: Co-Designing and Refining Gaze Gestures with General Users and Gaze Interaction Experts

Enhancing Tool Calling in LLMs with the International Tool Calling Dataset

Human-Centered Ambient and Wearable Sensing for Automated Monitoring in Dementia Care: A Scoping Review