Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification

Imagine you have a very grumpy, rude robot that loves to say mean things. Your goal is to teach this robot to be polite and kind without changing what it's trying to say. This is called Text Detoxification.

But here's the problem: How do you know if your robot is actually getting better? If you ask a human to grade every single sentence the robot writes, it takes forever and costs a fortune. So, scientists try to build "automatic graders" (computer programs) to do the grading.

This paper is like a massive report card for these automatic graders, but with a twist: instead of just checking English, they tested them on nine different languages (like Arabic, Chinese, Russian, Hindi, and others).

Here is the story of what they found, explained simply:

1. The Old Graders Were "Blind"

For a long time, the automatic graders used a simple trick called ChrF. Imagine a teacher who only checks if a student's essay uses the exact same words as the "perfect" answer key.

The Flaw: If the robot says, "I am angry," and the perfect answer is "I feel furious," the old grader would give it a bad score because the words are different, even though the meaning is the same.
The Result: The old grader was terrible at understanding the meaning behind the words. It was like judging a painting only by counting the number of red pixels, ignoring the picture itself.

2. The New "Super-Graders" (The XCOMET Family)

The researchers introduced new tools based on Large Language Models (LLMs). Think of these as super-smart teaching assistants who have read millions of books.

How they work: Instead of just counting words, they look at the whole picture. They compare three things at once:
1. The Rude Original (What the robot started with).
2. The Polite New Version (What the robot wrote).
3. The Human Ideal (What a human would have written).
The Analogy: Imagine a judge at a cooking competition. The old grader just checked if the ingredients were on the list. The new grader tastes the dish, compares it to the original bad recipe, and checks if it tastes like the chef's perfect version.

3. The "Three-Part" Test

To grade the robot properly, the researchers realized they needed to check three things, like a three-legged stool. If one leg is missing, the stool falls over.

Leg 1: Fluency (Is it smooth?)
- Old way: Did it sound like a robot?
- New way: The super-grader checks if the sentence flows naturally, like a human speaking, not just if the grammar is technically correct.
Leg 2: Content (Did it keep the meaning?)
- Old way: Did it keep the same words?
- New way: Did it keep the story? If the robot was angry about a broken car, the new version should still be about a broken car, just without the swearing.
Leg 3: Toxicity (Is it nice?)
- Old way: Did it stop using bad words?
- New way: The grader checks if the attitude changed. It compares the "badness" of the original to the new version to see if the robot actually improved.

4. The "Human vs. Robot" Showdown

The researchers also tested if they could just use a giant AI (like GPT-4 or Llama) to act as the judge instead of building special tools.

The Surprise: The giant AIs were great at some things (like checking if the meaning was preserved) but sometimes struggled with others (like checking if the sentence sounded natural in specific languages).
The Winner: The custom-built "Super-Graders" (the XCOMET models) were the most consistent champions across all nine languages. They were like the Olympic athletes of grading—reliable, fast, and accurate.

5. The "Fine-Tuning" Secret Sauce

Finally, they tried taking a standard AI and giving it a crash course specifically on "how to grade detoxified text."

The Result: This "trained" AI became incredibly good at the job, almost as good as the custom super-graders, but it was much cheaper to run. It's like taking a general doctor and training them specifically to be a heart surgeon; they become the best at that one task.

The Big Takeaway

The paper concludes that if you want to build a system that cleans up rude text on the internet (for social media, chatbots, or kids' apps), you can't just use the old, simple tools. You need smart, multi-language graders that understand the meaning and the feeling of the text, not just the spelling.

They have now released all their tools, data, and "report cards" to the public, so anyone can build better, kinder, and safer AI for the whole world.

Here is a detailed technical summary of the paper "Evaluating Text Style Transfer: A Nine-language Benchmark for Text Detoxification."

1. Problem Statement

Text style transfer (TST), specifically text detoxification (removing toxic/offensive content while preserving meaning and fluency), faces a critical bottleneck: the lack of reliable, multilingual evaluation metrics.

Correlation Gap: Existing automatic metrics (e.g., ROUGE, BLEU, ChrF) correlate poorly with human judgments. They often rely on surface-level lexical overlap, failing to capture semantic preservation when toxic phrases require substantial rewording.
Multilingual Gap: Most prior research focuses on English. There is a scarcity of comprehensive benchmarks for low-resource and diverse languages (e.g., Amharic, Arabic, Hindi).
Metric Limitations: Current standard metrics for detoxification (Style Accuracy, Content Similarity, Fluency) often treat these dimensions in isolation or use flawed proxies (e.g., using ChrF for fluency ignores the original input context).

2. Methodology

The authors conducted a comprehensive experimental study across nine languages (Arabic, Amharic, Chinese, English, German, Hindi, Russian, Spanish, Ukrainian) using the TextDetoxEval dataset (16,600 input-output pairs from 20 systems) and the Russian-specific DialogueEvaluation-2022 dataset.

A. Proposed Metric Improvements

The paper proposes a new evaluation framework that moves beyond simple lexical overlap to triplet-based modeling (Input, Generated Output, Reference):

Fluency (Replaced ChrF):
- Approach: Utilized neural-based metrics (COMET and XCOMET) which model semantic relationships between the input, output, and reference.
- Models Tested: wmt22-comet-da, XCOMET-XL, XCOMET-XXL, and the efficient XCOMET-lite.
- Rationale: Unlike ChrF, these models understand that a fluent detoxified sentence may look lexically different from the reference but semantically equivalent to the input.
Content Similarity (Replaced Input-Output Cosine):
- Approach: Proposed SIM-JOINED, a weighted combination of two cosine similarities:
  $csim = w_{i,g} \cdot \text{cossim}(v_{input}, v_{gen}) + w_{g,r} \cdot \text{cossim}(v_{gen}, v_{ref})$
- Weights: Set to $w_{i,g}=0.4$ and $w_{g,r}=0.6$ .
- Rationale: Pure input-output similarity fails when detoxification requires significant rewording. Incorporating the human reference ensures the output aligns with high-quality paraphrasing standards.
Toxicity/Style Transfer (Replaced Binary Classifier):
- Approach: Proposed CLS-NEW, a relative probability metric. Instead of just checking if the output is non-toxic, it compares the toxicity probability of the Input ( $P_{in}$ ), Generated ( $P_{gen}$ ), and Reference ( $P_{ref}$ ).
- Logic:
  - If $P_{gen} < P_{in}$ (more toxic), score = 0.
  - If $P_{gen} \geq P_{ref}$ (matches human quality), score = 1.
  - Otherwise, score is proportional to the improvement.
- Rationale: This reduces sensitivity to classifier calibration and measures relative improvement rather than absolute classification.

B. Evaluation Paradigms

The study compared three distinct evaluation families:

Neural Metrics: The proposed XCOMET/SIM-JOINED/CLS-NEW framework.
LLM-as-a-Judge: Using various Large Language Models (DeepSeek-R1, LLaMA 3.3, GPT-4.1, CompassJudger) to score fluency, content, and toxicity via prompting.
Fine-tuned LLMs: Fine-tuning Llama-3.1-8B using LoRA on the detoxification data to create task-specific judges.

3. Key Contributions

First Multilingual Benchmark: The first large-scale evaluation of text detoxification metrics across nine diverse languages, covering all publicly available datasets for the task.
Metric Innovation: Introduction of triplet-based metrics (XCOMET for fluency, SIM-JOINED for content, CLS-NEW for toxicity) that significantly outperform baselines.
Comprehensive Comparison: A rigorous comparison of traditional neural metrics, LLM-as-a-judge, and fine-tuned LLMs, highlighting their specific strengths and weaknesses across languages.
Resource Release: Public release of evaluation code, fine-tuned models, and detailed results to ensure reproducibility.

4. Key Results

The study measured performance using Spearman's rank correlation ( $\rho$ ) against human annotations.

Fluency:
- ChrF performed poorly (near-zero correlation in many languages) due to its reliance on n-gram overlap.
- XCOMET-lite and XCOMET-XXL achieved the highest correlations. Notably, the quantized XCOMET-lite matched the performance of the massive XXL model, making it ideal for production.
- LLMs: LLaMA 3.3-70B outperformed XCOMET in several languages (Arabic, Hindi, Russian), suggesting LLMs are strong for fluency assessment.
Content Similarity:
- SIM-JOINED (the proposed weighted metric) and XCOMET models generally outperformed LLMs.
- Surprisingly, the baseline SIM-INPUT-GEN (input-output only) performed best in 5 languages, suggesting human annotators sometimes prioritize strict semantic fidelity to the source over reference style. However, the authors argue this is a flaw in annotation consistency, and XCOMET provides a more robust evaluation of meaningful paraphrasing.
Toxicity:
- CLS-NEW (triplet-based) outperformed the standard classifier in most languages, except English.
- LLMs: GPT-4.1-mini and DeepSeek-R1-Distill-Qwen-32B showed strong performance, often surpassing CLS-NEW in Russian and Arabic.
Fine-tuning:
- Fine-tuned Llama-3.1-8B achieved the highest correlations for Content Similarity and Toxicity across most languages.
- For Fluency, the fine-tuned model excelled in English (due to pretraining data bias) but lagged in other languages compared to XCOMET or larger LLMs.
Joint Metric ( $J$ ):
- The proposed J-NEW (combining XCOMET-lite, SIM-JOINED, CLS-NEW) achieved the highest overall correlation with human judgments in 5 out of 9 languages.

5. Significance and Implications

Robust Evaluation Pipeline: The paper establishes a new standard for evaluating text detoxification, moving away from superficial lexical metrics toward semantic, triplet-based neural metrics.
Efficiency vs. Performance: It demonstrates that lightweight, quantized models (XCOMET-lite) can replace massive models for evaluation without sacrificing accuracy, facilitating scalable deployment.
LLM Utility: While fine-tuned LLMs show promise, the study suggests that for multilingual tasks, pre-trained neural metrics (XCOMET) or specific large LLMs (GPT-4.1, LLaMA 3.3) may currently offer better cross-lingual robustness than a single fine-tuned 8B model.
Practical Application: The proposed metrics and guidelines enable developers to build more reliable content moderation systems for social media, dialogue systems, and streaming services across diverse linguistic contexts.

Conclusion: The authors successfully demonstrate that current evaluation methods are insufficient for multilingual text detoxification. By introducing triplet-based neural metrics and rigorously testing LLM-based approaches, they provide a roadmap for building evaluation systems that align closely with human judgment across nine languages.