DETECT: Determining Ease and Textual Clarity of German Text Simplifications

Imagine you are a teacher trying to grade student essays. Your goal is to check if the students successfully rewrote a difficult, complex paragraph into something simple and easy to read, without losing the original meaning or making it sound like gibberish.

For a long time, teachers of German text simplification have been using a broken ruler. They've been using tools like BLEU and SARI, which are like checking if the student used the same exact words as the original text. If the student swapped a hard word for a simple synonym, the old ruler says, "Bad job!" even though the student actually did a great job.

The authors of this paper, a team from the University of Zurich, decided to build a new, smarter ruler called DETECT.

Here is how they built it, explained with some everyday analogies:

1. The Problem: The "Broken Ruler"

Currently, if you ask a computer to simplify a German news article, the old tools can't tell the difference between a good simplification and a bad one. They are like a judge who only cares if the defendant is wearing the same shoes as the victim, ignoring whether the defendant actually committed the crime or not.

2. The Solution: DETECT (The "Smart Grader")

The team created DETECT, a new system that looks at three specific things, just like a human teacher would:

Simplicity: Is it actually easier to read?
Meaning Preservation: Did they keep the main point, or did they accidentally delete the most important part?
Fluency: Does it sound natural, or does it sound like a robot speaking?

3. The Big Challenge: No Human Teachers Available

Usually, to teach a computer how to grade, you need thousands of human experts to grade thousands of essays first. But for German text simplification, there weren't enough human experts available to do this. It was like trying to teach a new student how to grade without any answer keys.

4. The Creative Workaround: The "AI Interns"

Since they couldn't find enough human teachers, the authors used Large Language Models (LLMs)—the same kind of AI that powers chatbots—as "AI Interns."

Here is their clever 5-step recipe:

Step 1: The Practice Test (The Dataset): They gathered a bunch of complex German sentences and found their simplified versions (like a practice test).
Step 2: The AI Students: They asked six different AI models to rewrite these sentences. Some were good, some were bad.
Step 3: The AI Judges: This is the magic part. They asked a super-smart AI (GPT-4o) to act as a "Head Teacher." They gave the Head Teacher a set of rules (a rubric) and asked it to grade the work of the other AI models.
- The Twist: The Head Teacher didn't just give a single score. It was trained to spot specific mistakes, like "You added fake information!" or "You made the sentence too long!"
Step 4: The Final Exam: They took the scores from these "AI Judges" and used them to train DETECT. Think of DETECT as a student who studied the Head Teacher's grading notes so thoroughly that it can now grade essays itself.
Step 5: The Reality Check: To make sure DETECT wasn't just copying the AI Interns, they brought in real human experts to grade a final set of essays.

5. The Results: DETECT Wins!

When they compared DETECT to the old "broken rulers" (BLEU, SARI, etc.), DETECT was much closer to what the real human experts thought.

Old Tools: "This sentence is 50% similar to the original, so it's a B."
DETECT: "This sentence kept the meaning perfectly, is very easy to read, and flows well. It's an A+."

Why This Matters

The paper shows that we don't always need armies of human graders to build better AI tools. By using AI to teach AI (with a little bit of human guidance to fix the rules), we can create systems that understand meaning and accessibility, not just word counts.

In a nutshell:
The authors built a German-specific "smart grading system" for text simplification. Since they couldn't find enough human teachers, they used AI judges to create a massive library of graded examples. They then trained a new AI (DETECT) on these examples. The result? A tool that understands what makes a text truly simple and clear, outperforming all the old, outdated tools. It's like upgrading from a ruler that only measures length to a microscope that can see the quality of the ink.

Here is a detailed technical summary of the paper "DETECT: Determining Ease and Textual Clarity of German Text Simplifications."

1. Problem Statement

Automatic Text Simplification (ATS) aims to make text accessible to diverse groups, including language learners and individuals with cognitive disabilities. While German ATS research has advanced with new datasets and multilingual Large Language Models (LLMs), automatic evaluation remains a significant bottleneck.

Limitations of Current Metrics: Standard metrics like BLEU, SARI, and BERTScore rely on n-gram overlap or embedding similarity. They fail to directly measure the core criteria of simplification quality: Simplicity, Meaning Preservation, and Fluency. Consequently, they show weak correlations with human judgments.
The German Gap: While specialized, learnable metrics exist for English (e.g., LENS), they rely heavily on human-annotated corpora. No equivalent metric exists for German due to the absence of large-scale, human-annotated evaluation datasets.
Annotation Scarcity: Creating human-annotated datasets is expensive and time-consuming, hindering the development of robust evaluation tools for German.

2. Methodology

The authors propose DETECT, the first German-specific, learnable evaluation metric. The framework adapts the LENS architecture but replaces human annotations with synthetic data generated by LLMs. The methodology follows a five-step pipeline:

A. Dataset Construction (SIMPEVALDE)

Source: The authors constructed SIMPEVALDE, a benchmark dataset combining existing German corpora (LHA-APA and DEPLAIN-APA) aligned at CEFR levels A2 and B1.
Curation: Due to alignment issues in existing corpora (e.g., hallucinations or missing information in "gold" pairs), the authors implemented a systematic filtering process using adjusted BERTScore and manual review.
Size: The final dataset contains 160 pairs (training and test sets) covering three simplification strategies: delete, split, and paraphrase.

B. Synthetic Data Generation

ATS Generation: Six different LLMs (including instruction-tuned models like LeoLM, DiscoLlama, Qwen2, Llama3, and task-specific fine-tuned models like mBART-DEPLAIN) generated simplifications for the complex sentences in SIMPEVALDE.
LLM-as-a-Judge: Instead of human annotators, three open-source distilled LLMs (Distil-Llama-8B, Distil-Qwen-7B, and Zephyr-7B) acted as judges to assign quality scores.
Prompt Engineering: The authors developed a Prompt-Final through an iterative Human-in-the-Loop process with GPT-4o. This prompt:
- Separates the evaluation into three distinct criteria (Simplicity, Meaning Preservation, Fluency) rather than a single composite score.
- Allows continuous scoring (0–100) instead of discrete levels.
- Incorporates specific German "Easy Language" (Leichte Sprache) guidelines.
- Uses a weighted aggregation formula for the total score, penalizing failures in Meaning Preservation and Simplicity more heavily than Fluency.

C. Model Training

Architecture: DETECT is a feed-forward neural network (FFNN) based on RoBERTa (specifically using German-specific WECHSEL embeddings).
Input: The model takes embeddings of the complex source, the simplified output, and reference texts, along with their dot products and differences.
Objective: It is trained to predict the three separate LLM-derived scores (Simplicity, Meaning Preservation, Fluency) and a total score.
Optimization: Hyperparameters (learning rate, dropout, hidden layer size) were tuned to prevent overfitting given the smaller dataset size compared to English counterparts.

D. Validation

The model was validated against:

LLM-Judge Scores: The averaged scores from the three distilled models used for training.
Human-Judge Scores: A new dataset of 360 test pairs manually graded by three native German-speaking experts using a simplified RANK & RATE protocol.

3. Key Contributions

DETECT Metric: The first learnable evaluation metric specifically designed for German text simplification that evaluates Simplicity, Meaning Preservation, and Fluency holistically.
Synthetic Supervision Pipeline: A novel framework demonstrating that high-quality training data for evaluation metrics can be generated entirely via LLMs, bypassing the need for expensive human annotation.
Prompt Refinement: A systematic approach to refining evaluation rubrics using LLMs and human feedback, resulting in a prompt that significantly improves inter-rater agreement compared to standard rubrics.
New Benchmark: The creation of SIMPEVALDE and a corresponding human evaluation dataset (360 pairs), which is the largest German human evaluation dataset for ATS to date.

4. Results

Correlation with Human Judgment: DETECT significantly outperforms standard metrics (BLEU, SARI, BERTScore) in correlating with human judgments.
- Total Score: DETECT achieves a Pearson correlation of 0.64 with human judgments, compared to 0.55 for BERTScore and 0.32 for BLEU.
- Meaning Preservation: DETECT shows the strongest gain here, reaching 0.68 correlation, vastly surpassing BERTScore (0.48) and SARI (0.04).
- Fluency: DETECT leads with 0.35, outperforming BERTScore (0.31).
- Simplicity: This remains the weakest dimension (0.32), though DETECT still outperforms SARI and BLEU.
LLM vs. Human Agreement: The LLM-Judge scores showed a strong correlation with human scores for Meaning Preservation ( $r=0.77$ ), validating the synthetic supervision approach.
Strategy Performance: DETECT performed best on split-based simplifications and showed some degradation on paraphrase-based tasks, likely due to the semantic complexity of paraphrasing.

5. Significance and Limitations

Significance:

Scalability: The study proves that high-quality evaluation metrics for low-resource languages (like German in the context of ATS) can be developed without massive human-annotated datasets by leveraging LLMs for synthetic supervision.
Quality Dimensions: It highlights that standard metrics fail to capture "meaning preservation," a critical factor for accessibility, whereas DETECT successfully models this.
Guidelines: The paper provides transferable guidelines for using LLMs in "Human-in-the-Loop" settings for language accessibility tasks.

Limitations:

Domain Specificity: The model is trained and evaluated exclusively on news domain data; generalization to medical, legal, or educational texts is unproven.
Granularity: DETECT tends to cluster outputs into broad quality groups (high vs. low) rather than providing fine-grained rankings for similar-quality candidates.
LLM Instability: The reliance on distilled LLMs for supervision introduces potential instability and occasional misinterpretation of German-specific linguistic nuances.
Sentence-Level Only: The current metric applies only to sentence-level simplification, not document-level.

In conclusion, DETECT represents a major step forward in German NLP evaluation, bridging the gap between automated metrics and human judgment through a scalable, LLM-driven methodology.