Beyond Scores: Explainable Intelligent Assessment Strengthens Pre-service Teachers' Assessment Literacy

Here is an explanation of the paper "Beyond Scores: Explainable Intelligent Assessment Strengthens Pre-service Teachers' Assessment Literacy," translated into simple, everyday language with creative analogies.

The Big Problem: The "Black Box" Teacher

Imagine you are a new teacher. You hand out a test, and a computer program grades it. Instead of just giving you a grade like "85%," the computer gives you a complex report full of math symbols, probability charts, and terms like "latent knowledge states."

It's like handing a chef a recipe that only says, "The soup is 73% salty," without telling them which ingredient caused the saltiness or how to fix it. The chef (the teacher) looks at the number, shrugs, and says, "Okay, I guess the soup is salty," but they don't actually understand why.

This is the problem the researchers faced. New teachers (called pre-service teachers) are great at learning theory, but when they face these high-tech "black box" grading tools, they get stuck. They can't translate the confusing data into actual teaching strategies. They end up guessing or just looking at the final score, which doesn't help them improve their students.

The Solution: The "X-Ray" Machine (XIA)

The researchers built a new tool called XIA (eXplainable Intelligent Assessment). Think of XIA not as a calculator, but as a medical X-ray machine for learning.

Instead of just saying, "This student is sick," XIA shows the doctor (the teacher) the broken bone, explains why it broke, and even lets them simulate what would happen if they put a cast on it.

XIA does this in two special ways:

The "Why" (Contrastive Explanation): It answers, "Why did the computer think the student mastered this topic?" It compares the student's actual answers to a "what-if" scenario. Example: "The computer thinks the student knows Algebra because they got Question 1 right. But if they had gotten Question 1 wrong, the computer would have said they don't know it at all. So, Question 1 was the key."
The "What If" (Counterfactual Explanation): It answers, "What would happen if the student knew more?" It lets the teacher tweak the data to see how the diagnosis changes. Example: "If I assume the student actually understood the concept, the computer predicts they would have answered these three tricky questions correctly."

The Experiment: Training the Trainees

The team tested this on 21 new teachers in China. They split them into three groups:

Group A (The Control): Got no help. Just the raw test scores.
Group B (The Dashboard): Got a standard dashboard with stats (like difficulty levels and error rates), but no "why" explanations.
Group C (The Full XIA): Got the dashboard plus the X-ray machine (the "Why" and "What If" explanations).

The Results: From Guessing to Diagnosing

The results were fascinating, like watching a student go from guessing on a test to actually understanding the subject.

Group A (No Help): They barely changed. They kept relying on their gut feelings and the final score. They were like a driver trying to navigate a city with a map that only shows the destination, not the roads.
Group B (Stats Only): They started looking at more details. They noticed, "Oh, this question was really hard for everyone," or "This student made a specific type of mistake." They were better, but they were still just looking at the data, not necessarily understanding the logic behind it.
Group C (Full XIA): This group had the biggest "Aha!" moment.
- They stopped guessing: They stopped saying, "The student is bad at math." Instead, they said, "The student failed because they missed a specific prerequisite step, and here is the evidence."
- They became better judges: Their errors in judging student ability dropped significantly. They were less likely to make wild mistakes.
- They thought deeper: They started asking better questions. Instead of just accepting the computer's grade, they used the tool to challenge it: "Wait, the computer says they know this, but if I look at this specific question, it looks like they guessed. Let me check the 'What If' scenario."

The Takeaway: Teaching Teachers to Fish

The main lesson here is that giving teachers data isn't enough; you have to show them how the data is cooked.

If you give a teacher a raw score, they are like a person staring at a finished cake and trying to guess the recipe.
If you give them a dashboard, they can see the ingredients.
But if you give them XIA, you give them the recipe, the mixing instructions, and a simulation of what happens if you add too much sugar.

In simple terms:
This study proves that when you build AI tools that explain their reasoning (like a teacher explaining why a student got a question wrong, rather than just marking it red), new teachers learn to trust the data, understand the students better, and make smarter decisions in the classroom. They move from being "score readers" to "learning detectives."

Why This Matters for the Future

As schools use more AI to grade and track students, we don't want teachers to become passive observers who just read the computer's output. We want them to be partners with the AI. This tool shows us how to build AI that doesn't just give answers, but teaches teachers how to think, reflect, and improve their craft. It turns the "black box" into a clear window.

Here is a detailed technical summary of the paper "Beyond Scores: Explainable Intelligent Assessment Strengthens Pre-service Teachers' Assessment Literacy."

1. Problem Statement

The core challenge addressed is the gap between Advanced Assessment Technologies and Teacher Assessment Literacy (AL), particularly among pre-service teachers.

The Issue: Emerging technologies like Cognitive Diagnostic Assessments (CDAs) and Computer Adaptive Testing (CAT) can provide fine-grained data on student knowledge mastery. However, these tools often present "black box" outputs (opaque scores, statistical parameters) that pre-service teachers struggle to interpret.
The Consequence: Without the ability to translate data into actionable insights, teachers revert to intuition or simple score-based judgments. This hinders the development of Assessment Literacy (AL), defined as the multidimensional capability to interpret data, engage in reflection, and make instructional decisions.
The Gap: Existing Explainable AI (XAI) in education focuses largely on student-facing feedback or algorithmic transparency. There is a lack of teacher-facing systems that provide explanatory scaffolding to support diagnostic reasoning, evidence-based reflection, and instructional planning.

2. Methodology

The study employed a mixed-methods approach, combining a controlled user study with qualitative interviews and system design.

A. System Design: XIA (eXplainable Intelligent Assessment)

The authors designed and implemented XIA, a platform built on a Neural Cognitive Diagnosis Model (NeuralCD). The system architecture consists of:

Back-end:
- Data Processing: Integrates student responses, Q-matrices (item-to-knowledge mapping), and encodes them into one-hot representations.
- NeuralCD Engine: Decomposes inputs into four latent parameters: knowledge relevancy, student proficiency, question difficulty, and question discrimination. It estimates mastery probabilities across knowledge components (KCs).
- Explanation Module: Generates three types of reasoning:
  1. Student Representation Estimation: Updating latent states based on responses.
  2. Contrastive Reasoning: Comparing different response patterns to show why a specific diagnosis was made (e.g., "Why mastery is 40% vs. 34% if answers were swapped").
  3. Counterfactual Reasoning: Simulating "what-if" scenarios (e.g., "If the student had 34% mastery, what would their answer pattern look like?").
Front-end Interfaces:
1. Instructional Decision-Support Interface: Displays statistics (difficulty, accuracy, error patterns) and comparisons (individual vs. class) to help identify if issues stem from students, instruction, or item design.
2. Diagnostic Reasoning & Explanation Interface: Visualizes the inference chain, offering the contrastive and counterfactual explanations to foster reflection.

B. User Study Design

Participants: 21 pre-service teachers (high school math/technology education candidates) in China.
Experimental Design: A $3 \times 2 $mixed design (3 groups$ $mi x e dd es i g n (3 g r o u p s$ \times$ 2 time points: pre/post).
- Control Group (CG, n=7): No tool support.
- Decision-Support Group (DSG, n=7): Access to statistical data only (Interface 1).
- Full-Support Group (FSG, n=7): Access to both statistical data and explanatory reasoning (Interface 1 + 2).
Procedure:
1. Pre-test: Participants judged student mastery without tools.
2. Training/Exploration: Participants were introduced to IRT principles and allowed to explore the assigned XIA features using pre-test data.
3. Post-test: Participants judged new, parallel student cases using the tools.
Metrics:
- Quantitative: Assessment accuracy (Mean Absolute Error - MAE, Root Mean Square Error - RMSE) against a quasi-ground-truth derived from a large dataset; Questionnaires measuring Reflection, Self-Regulated Learning (SRL), and Assessment Awareness.
- Qualitative: Semi-structured interviews analyzing reasoning strategies and tool adoption.

3. Key Contributions

Design Knowledge for Teacher-Facing XAI:
- Derived two design requirements: (R1) Align decision support with classroom tasks; (R2) Visualize the evidence chain behind diagnostic results.
- Established three explanatory design principles: Clarity & Traceability, Sufficiency with Parsimony, and Actionability.
System Architecture:
- Proposed a reference architecture linking CDA-based learner modeling, explanation generation (contrastive/counterfactual), and teacher-centric interaction design.
Empirical Evidence:
- Provided preliminary evidence linking explanatory scaffolding to specific sub-components of Assessment Literacy (reflection, self-regulation, and awareness).

4. Results

Assessment Accuracy:
- The Full-Support Group (FSG) showed significant reductions in both MAE ( $p=0.009$ ) and RMSE ( $p=0.023$ ) from pre- to post-test.
- The Decision-Support Group (DSG) showed modest, non-significant improvements.
- The Control Group showed no significant improvement.
- Note: While within-group improvements were significant for FSG, between-group differences in gain scores did not reach statistical significance (likely due to small sample size), but the trend favored the full-support condition.
Assessment Literacy Dimensions (Questionnaires):
- Reflection: Both DSG and FSG improved significantly compared to CG. FSG showed the largest gain.
- Self-Regulation: Both experimental groups improved significantly; CG did not.
- Assessment Awareness: Only the FSG showed a significant improvement. This suggests that deep belief changes regarding the purpose of assessment require the richer informational environment provided by explanatory features.
Qualitative Findings (Interviews):
- FSG: Shifted from score-based intuition to evidence-based reasoning. Participants used contrastive and counterfactual features to calibrate their judgments, test hypotheses, and understand causal links between evidence and diagnosis.
- DSG: Expanded their attention to multiple indicators (difficulty, discrimination) but lacked systematic integration strategies.
- CG: Remained abstract in their awareness of the need for process data but lacked operational strategies.

5. Significance and Implications

Bridging Theory and Practice: The study demonstrates that explanatory scaffolding is crucial for translating complex diagnostic data into pedagogical action. It moves teachers from passive consumers of scores to active reasoners who can interrogate model outputs.
Mechanism of Change: The results suggest that explanatory features (specifically contrastive and counterfactual) trigger cognitive conflict and recalibration, helping teachers build causal mental models of student learning. This reduces "outlier" errors (large deviations) more effectively than statistical data alone.
Design Guidelines: The paper provides actionable guidelines for designing intelligent assessment tools: they must not only present data but also make the reasoning process transparent, traceable, and actionable for teachers.
Future Directions: The authors advocate for longitudinal studies to track long-term AL development, expansion to diverse disciplines, and the integration of multimodal data (e.g., classroom video) to further validate these findings in authentic settings.

In conclusion, XIA proves that embedding explainable AI mechanisms into teacher training can effectively strengthen assessment literacy, fostering a shift from intuitive judgment to systematic, evidence-based diagnostic reasoning.