Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring

Imagine you are a teacher who has to grade hundreds of essays every week. It's exhausting, time-consuming, and sometimes, you might be a little too harsh on a Tuesday morning or too lenient on a Friday afternoon. This is the problem Automated Essay Scoring (AES) tries to solve: using computers to do the grading so teachers can focus on teaching.

For a long time, computers were like rigid robots that could only count how many words were in an essay or check for spelling mistakes. They couldn't really "understand" the writing. But recently, we have these super-smart AI brains called Large Language Models (LLMs) that can read, write, and reason almost like humans.

This paper is like a report card for a specific experiment: Can these new AI brains grade Austrian high school German essays as well as a human teacher?

Here is the breakdown of what they did, using some everyday analogies:

1. The Setup: The "Exam Hall"

The researchers gathered 101 real student essays from Austrian high school exams. These weren't just random paragraphs; they were specific types of writing like opinion pieces, letters to the editor, and literary analyses.

Think of the grading system like a recipe. In Austria, there is a strict "recipe" (called a Rubric) that teachers must follow. It breaks the essay down into four ingredients:

Content (Did they say the right things?)
Structure (Is it organized well?)
Style (Does it sound good?)
Language (Is the grammar correct?)

The goal was to see if an AI could follow this recipe perfectly.

2. The Contestants: The AI Models

They tested four different AI models (DeepSeek, Qwen, Mixtral, and Llama). Imagine these as four different student interns hired to help grade the papers.

Mixtral was like an intern who got bored and just gave everyone a "C" (Grade 3) without reading.
DeepSeek was an intern who was too strict, gave weird feedback, and sometimes even spoke Chinese instead of German!
Qwen was okay but still a bit too harsh.
Llama3.3 was the clear winner. It was the most reliable, understood the instructions best, and gave the most human-like feedback.

3. The Training: How did they teach the AI?

The researchers tried three different ways to help the AI understand what a "good" grade looks like:

The "Zero-Shot" Approach (The Blind Guess): They just handed the AI the essay and the rules.
- Result: The AI was confused. It didn't know what a "Grade 1" (perfect) actually looked like compared to a "Grade 5" (fail).
The "RAG" Approach (The Library): They gave the AI a "cheat sheet" of other essays (some good, some bad) to look at while grading.
- Result: Better, but the AI still struggled to distinguish between the very best and the very worst essays. It was like giving a student a textbook but not telling them which page to read.
The "Few-Shot" Approach (The Mentorship): This was the best method. They showed the AI a few examples: "Here is a perfect essay (Grade 1), here is a messy one (Grade 5), and here is a middle-of-the-road one (Grade 3)." Then they asked the AI to grade a new essay based on those examples.
- Result: This worked the best. It was like showing the intern a portfolio of past work so they could get a feel for the standards.

4. The Results: Did they pass the test?

Here is the bad news: The AI is not ready to replace human teachers yet.

The Agreement Rate: When the AI and the human teacher graded the same essay, they only agreed on the final grade about 33% of the time.
The Sub-scores: Even when looking at just the "Content" or "Grammar," the AI only agreed with the human about 40% of the time.

Think of it like this: If a human teacher gives an essay an "A," the AI might give it a "C" or a "B." It's not totally random, but it's not accurate enough to be trusted with a real student's future.

5. Why did it fail?

The paper points out a few reasons why the AI struggled:

The "Black Box" Problem: The AI sometimes guessed the middle ground (Grade 3) because it was scared to be too extreme.
The "Handwriting" Hurdle: Many student essays were handwritten. The computer had to scan them and turn them into text (OCR). This process introduced errors, like turning a messy "s" into an "f," which confused the AI.
The "One Teacher" Bias: The dataset only had grades from one human teacher. If that teacher was a bit quirky or strict, the AI learned that specific style, not the "universal" truth of grading.

The Bottom Line

This paper is a reality check. While AI has made huge leaps, it's currently more like a helpful teaching assistant than a head teacher.

It can read an essay, check for grammar, and give a rough idea of the quality. But it can't yet understand the nuance of a student's unique voice or the subtle differences between a "good" and a "great" argument in the way a human can.

The Future: The authors suggest that in the future, AI won't replace teachers. Instead, it will be a tool that does the boring work (checking spelling, counting words, suggesting structure) so teachers can spend their time giving the real human feedback that students need. But for now, we still need humans in the loop to make sure the grades are fair.

Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring

1. The Setup: The "Exam Hall"

2. The Contestants: The AI Models

3. The Training: How did they teach the AI?

4. The Results: Did they pass the test?

5. Why did it fail?

The Bottom Line

1. Problem Statement

2. Methodology

2.1 Models Evaluated

2.2 Experimental Design

2.3 Evaluation Metrics

3. Key Contributions

4. Results

5. Significance and Limitations

Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring

1. The Setup: The "Exam Hall"

2. The Contestants: The AI Models

3. The Training: How did they teach the AI?

4. The Results: Did they pass the test?

5. Why did it fail?

The Bottom Line

1. Problem Statement

2. Methodology

2.1 Models Evaluated

2.2 Experimental Design

2.3 Evaluation Metrics

3. Key Contributions

4. Results

5. Significance and Limitations

More like this

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms

PACED: Distillation at the Frontier of Student Competence

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

Reversible Lifelong Model Editing via Semantic Routing-Based LoRA