A Dataset for Probing Translationese Preferences in English-to-Swedish Translation

Imagine you are learning a new language, say Swedish, by watching thousands of movies and reading subtitles. You start to speak, but you sound a bit like a robot who just learned the dictionary but never hung out with locals. You use words that are technically correct but sound stiff, awkward, or like a direct copy-paste from English. In the world of linguistics, this "robot accent" is called Translationese.

This paper is like a detective story where researchers built a special tool to catch AI models doing exactly this. Here is the breakdown in simple terms:

1. The Problem: The "Robot Accent"

When computers translate English to Swedish, they often produce text that is grammatically correct but feels unnatural. It's like ordering a coffee and saying, "I would like a liquid bean beverage," instead of "I'd like a coffee." It works, but it sounds weird.

The researchers found that even the smartest AI models (Large Language Models) are guilty of this. They tend to stick too closely to the English source text, resulting in Swedish that feels stiff and "translated" rather than natural and "native."

2. The Solution: A "Spot the Difference" Game

To fix this, the team created a new dataset (a collection of data) called a Minimal Pair Probe. Think of this as a "Spot the Difference" game for sentences.

For every sentence, they created two versions:

The "Robot" Version: A literal, stiff translation that sounds like Translationese.
The "Human" Version: A natural, idiomatic Swedish sentence that a native speaker would actually say.

They also added "error tags," which are like red flags the researchers put on the sentences to explain why the robot version was weird. Was it missing a word? Did it use the wrong slang? Did it translate an idiom too literally?

3. The Experiment: Testing the AI's Taste Buds

The researchers then fed these sentence pairs to various AI models and asked a simple question: "Which one sounds better?"

They tested the AI in two different scenarios:

Scenario A (The Blind Taste Test): They showed the AI just the Swedish sentences, without telling them what the original English was.
Scenario B (The Translation Task): They showed the AI the English sentence and said, "Translate this into Swedish," giving them the context.

4. The Results: The AI's Bad Habits

The findings were quite revealing:

The AI Loves the Robot Accent: Even when the AI wasn't forced to translate, it often preferred the stiff, "robot" version of the Swedish sentence. It seems the AI has a built-in bias toward literal, word-for-word phrasing.
The "Source Language" Trap: When the AI was given the English source sentence (Scenario B), it became even more likely to choose the stiff translation. It's as if the English sentence acts like a magnet, pulling the AI toward a literal translation and away from natural Swedish.
Context Helps, But Not Enough: Giving the AI more background story (like the sentences before the target sentence) helped it choose the natural version more often. It's like giving a translator a whole chapter of a book instead of just one sentence; they get the vibe better. However, even with a lot of context, the AI still struggled to fully ditch the "robot accent."
Bigger Isn't Always Better: Interestingly, making the AI models bigger and smarter didn't always fix the problem. Sometimes, the bigger models actually got worse at spotting the natural Swedish when they were trying to translate from English.

5. Why This Matters

Think of this dataset as a gym for AI. Just as a weightlifter needs specific weights to build muscle, AI models need specific, high-quality data to learn how to speak naturally.

Currently, many AI models are trained on internet data that is full of these "robot translations." This paper provides a free, open tool (a dataset) that researchers can use to train their models to stop sounding like robots and start sounding like real Swedes.

In a nutshell:
The authors built a "taste test" to prove that AI models sound like stiff robots when translating. They found that showing the AI the original English text makes it sound even more robotic. Their new dataset is a training tool designed to help future AI models learn to speak with a natural, human voice, rather than a literal, translated one.

Here is a detailed technical summary of the paper "A Dataset for Probing Translationese Preferences in English-to-Swedish Translation."

1. Problem Statement

Translationese refers to the phenomenon where translated texts retain traces of the source language's syntax, style, and lexical choices, making them distinct from texts originally written in the target language. While often subtle, this results in output that is less idiomatic, more literal, and stylistically simplified.

The Issue: Large Language Models (LLMs) and specialized Machine Translation (MT) systems frequently produce translationese, even when they exhibit higher lexical diversity than older systems.
The Gap: Existing datasets for evaluating translationese are often proprietary, lack fine-grained error annotations, or do not explicitly contrast translationese with idiomatic alternatives in a controlled manner. Furthermore, it is unclear how exposure to source text (context) influences a model's preference for literal vs. idiomatic phrasing.
Goal: To create a resource that allows researchers to probe the intrinsic preferences of LLMs regarding translationese versus idiomaticity in English-to-Swedish translation.

2. Methodology

Dataset Construction

The authors constructed a new, freely available dataset consisting of 600 sentence pairs derived from the OpenSubtitles corpus (spoken dialogue).

Sources:
- Source: English sentences from OpenSubtitles.
- Machine Translations: Translated into Swedish using OPUS-MT (a specialized neural MT system) and GPT-5 (a state-of-the-art LLM).
- Human Alternatives: Native Swedish speakers provided idiomatic, natural alternatives for the machine translations.
Annotation Process:
- Two annotators (native Swedish speakers with linguistic training) reviewed the pairs.
- They assigned error tags to machine translations and provided contextual explanations.
- GPT-5 translations were evaluated against human alternatives; those deemed equal or superior were noted, while errors were tagged similarly to OPUS outputs.
Error Tagging System:
The authors developed a custom, fine-grained tagging system (distinct from the standard MQM framework) to categorize issues:
- Major Errors: Grammar (GR), Missing parts (SAK), Incorrect words (LF), Loss of meaning (BET).
- Minor Errors: Semantic shifts (SEM), Lexical preference (PR) (unnatural word choice).
- Descriptive Tags: Idioms (ID), Slang (SL), Style/Domain-specific (ST), Direct translation (DIR).

Experimental Setup

The dataset was used to evaluate the intrinsic preferences of various language models using a Minimal Pair setup.

Models Evaluated: A range of multilingual LLMs including LLaMA-3 (8B), EuroLLM (1.7B, 9B), and Gemma (270M to 12B), including both base and instruction-tuned versions.
Prompting Strategies:
1. No Translation Context: The model is presented only with the Swedish sentence (Machine vs. Human) to test general preference without translation instructions.
2. Translation Context: The model is instructed to translate an English source sentence. This was tested with varying context lengths (0 to 10 preceding English sentences).
Metrics:
- Accuracy: Percentage of times the model assigns a higher probability to the human alternative over the machine translation.
- $\Delta$ LP (Delta Log Probability): The length-normalized mean log probability difference. Positive values indicate a preference for the human variant; negative values indicate a preference for the translationese variant.

3. Key Contributions

First Open Swedish Translationese Dataset: The release of the first freely available dataset explicitly contrasting translationese with idiomatic alternatives for English-to-Swedish translation.
Fine-Grained Annotation: Unlike previous datasets, this includes detailed error tags (e.g., distinguishing between semantic shifts and lexical preferences) and source context, enabling analysis of why models fail.
Benchmark for Idiomaticity: Provides a standardized benchmark to evaluate how well LLMs can produce natural, non-literal output in non-English languages.
Analysis of Source Bias: The study introduces a methodology to isolate the effect of source text exposure on model preferences.

4. Results

Dataset Analysis

Human vs. Machine: Human translations were longer and had a higher type-token ratio (more lexical diversity) than both OPUS and GPT-5 translations.
GPT-5 Performance: GPT-5 produced significantly fewer errors than OPUS-MT, particularly in Lexical Preference (PR) and Missing parts (SAK). However, GPT-5 still produced translationese in roughly 40% of cases where it was judged equal to human output, and 6% of cases where it was actually an improvement over the human annotator.

Model Preference Experiments

Strong Translationese Bias: Across all models and settings, there is a consistent bias toward the machine-translated (translationese) phrasing.
- In the No Context setup, models still often preferred the translationese variant, though they chose the human alternative more frequently than in translation setups.
- Source Text Bias: When the English source sentence was provided (Translation Context), models were significantly more likely to choose the literal/translationese variant. This suggests that exposure to the source sentence "steers" the model toward literal translation.
Impact of Context Length:
- Adding context (1–10 sentences) generally helped models interpret the intended meaning better, slightly increasing the selection of human alternatives compared to zero context.
- However, even with 10 sentences of context, the preference for translationese remained strong for most models.
Model Size Trends:
- Human > OPUS: Larger models generally performed better (higher accuracy in choosing human alternatives).
- Human > GPT: Interestingly, for comparisons against GPT-5, larger models performed worse. As model capability increased, their preference for the GPT-5 translation (which is often high-quality but still slightly translationese) increased, making it harder to distinguish from the human ideal.
Error Type Sensitivity:
- Models struggled most with Direct Translations (DIR) and Slang (SL), especially when context was provided.
- Models were better at avoiding obvious Grammar (GR) and Missing (SAK) errors, as these are more easily detectable than subtle semantic shifts.

5. Significance and Conclusion

Training Data Contamination: The findings support the hypothesis that LLMs are biased toward translationese because their training data contains large amounts of machine-translated text.
Source Exposure Risk: The study demonstrates that providing source text in prompts can inadvertently reinforce literal translation habits, even in models designed for high-quality generation.
Future Directions: The dataset serves as a critical resource for developing "polishing" steps or fine-tuning strategies to reduce translationese. It highlights the need for models to be trained or tuned to prioritize target-language idiomaticity over source-language fidelity, particularly for low-resource or non-English languages.
Limitations: The dataset is currently limited to spoken dialogue (subtitles) and is manually curated (small size), though it provides a high-quality baseline for future expansion.

In summary, this paper provides a crucial tool for diagnosing and mitigating the "translationese" problem in LLMs, revealing that while modern models are improving, they still struggle to break free from the literal constraints of source languages without specific intervention.