What Triggers my Model? Contrastive Explanations Inform Gender Choices by Translation Models

Imagine you have a very smart, but slightly mysterious, robot translator. You give it a sentence in English like, "The writer finished the book." The word "writer" doesn't tell you if the person is a man or a woman. But when the robot translates this into German or Spanish, it has to pick a gender because those languages require it (like choosing between "the writer" vs. "the female writer").

Often, this robot guesses based on old stereotypes. If it sees the word "writer," it might just assume it's a man because that's what it saw most often in its training books.

This paper asks a simple but deep question: What specific words in the sentence make the robot make that guess? And, does the robot look at the same clues that a human would?

Here is the breakdown of their investigation, explained with some everyday analogies.

1. The Detective Work: "Contrastive Explanations"

The researchers didn't just ask the robot, "Why did you do that?" (Robots can't really answer that). Instead, they played a game of "What If?"

The Scenario: They took a sentence with a mystery gender (e.g., "The chef is cooking").
The Trick: They forced the robot to translate it twice.
- Version A: The robot's natural choice (e.g., "The male chef").
- Version B: They manually changed the translation to the opposite gender (e.g., "The female chef").
The Investigation: They then looked at the original English sentence and asked: "Which specific words pushed the robot toward Version A instead of Version B?"

Think of it like a tug-of-war. The robot is standing in the middle. The researchers are pulling the rope with different words to see which one is strong enough to drag the robot's decision to one side or the other.

2. The Findings: Who is the Robot Listening To?

The researchers found some fascinating overlaps and differences between how the robot thinks and how humans think.

The Good News: They Agree on the "Big Hitters"

When the researchers looked at the top words that influenced the robot, they found a huge overlap (about 85%) with the words humans said influenced their gender guesses.

Analogy: Imagine you and a friend are trying to guess who a mystery person is based on a description. You both point to the same three clues (e.g., "wearing a suit," "driving a truck," "saying 'sir'"). You are both looking at the same evidence!
The Result: The robot isn't completely hallucinating; it is actually paying attention to the same contextual clues humans use.

The Bad News: They Look at the "Fine Print" Differently

While they agree on the big clues, they disagree on what kind of clues matter most.

The Robot's Focus: The robot is obsessed with Nouns and Verbs. It's like a robot that only looks at the main actors and the actions. If the sentence says "The engineer built a bridge," the robot locks onto "engineer" and "built" and decides the gender based on those.
Human Focus: Humans are more holistic. We look at Proper Names, Adjectives, and even the whole phrase.
The Distance Problem:
- The Robot: It only really cares about words that are right next to the mystery person. If the clue is 3 words away, the robot ignores it. It's like a robot that only reads the word immediately next to the mystery person.
- Humans: We scan the whole sentence. We can pick up on a clue that is far away or part of a complex phrase. We are like detectives who read the whole report, not just the first line.

3. Why Does This Matter?

The authors argue that we can't just say, "The robot is biased, let's fix it." We need to know why it's biased.

The "Black Box" Problem: Usually, AI is a "black box"—we see the input and the output, but we don't know what happened inside.
The Solution: By using these "contrastive explanations" (the tug-of-war game), they opened the box. They showed that the robot's bias comes from specific words in the sentence that trigger a stereotypical response.

The Big Takeaway

This study is like a mirror. It shows us that our translation robots are learning from us (the data they were trained on). They see the same gender clues we do, but they process them in a more rigid, mechanical way.

In simple terms:
The robot isn't a magic oracle; it's a student who learned from a biased textbook. This paper helps us see exactly which sentences in that textbook made the robot think "Doctor = Man" and "Nurse = Woman." Once we know exactly which words trigger those thoughts, we can rewrite the textbook (the training data) or teach the robot to look at the whole sentence, not just the immediate neighbors, to make fairer choices.

The Goal: To move from just measuring the bias to understanding its origins, so we can finally fix it.

Here is a detailed technical summary of the paper "What Triggers my Model? Contrastive Explanations Inform Gender Choices by Translation Models."

1. Problem Statement

Machine Translation (MT) models, particularly Neural Machine Translation (NMT) and Large Language Models (LLMs), exhibit persistent gender bias. While previous research has extensively measured this bias (e.g., through coreference resolution in unambiguous sentences), there is a lack of understanding regarding how and why models make specific gendered decisions in gender-ambiguous contexts.

The Gap: Existing interpretability studies often rely on artificial, template-based data or focus on pronoun resolution where gender is explicitly defined by context. They rarely investigate natural, ambiguous source data where the model must infer gender based on subtle contextual cues (stereotypes, collocations) rather than explicit grammatical agreement.
The Challenge: There is no standardized scoring threshold for determining which input tokens are "influential enough" to trigger a specific gender choice in the target language. Furthermore, it is unclear if the model's "reasoning" (salient tokens) aligns with human perception of gender cues.

2. Methodology

The authors propose an exploratory study using Contrastive Explanations and Saliency Attribution to analyze gender choices in NMT.

Data

Source: 60 manually filtered, natural English sentences containing gender-ambiguous target referents (e.g., "writer," "counselor").
Target Languages: German (DE) and Spanish (ES), both of which require grammatical gender inflection.
Human Annotations: 20 human annotators (of varying genders) marked source words that influenced their perception of the referent's gender.
Models: The study uses OPUS-MT (an open-source transformer-based NMT model) to generate translations.

Experimental Design

Contrastive Translation Generation:
- The model generates an original translation ( $T_{orig}$ ).
- The authors manually create a contrastive translation ( $T_{foil}$ ) by grammatically adapting the output to the opposite gender (e.g., changing German Berater to Beraterin).
Saliency Attribution:
- Using the inseq toolkit, the authors compute saliency scores based on the contrastive gradient norm.
- This measures how much an input token influences the model to increase the probability of the target token in $T_{orig}$ while decreasing the probability of the corresponding token in $T_{foil}$ .
- Preprocessing: Target referents, punctuation, stopwords, and sentence-end tokens are removed. Sub-word tokens are merged to aggregate scores.
Attribution Threshold Analysis (Addressing RQ1):
To determine the optimal way to identify salient words, four approaches were tested:
- Approach 1: Top $X\%$ of words by score (tested 5–25%).
- Approach 2: Only the single highest-scoring word.
- Approach 3: Words exceeding a fixed absolute score threshold (0.01–0.10).
- Approach 4: The minimum subset of words whose cumulative score reaches $X\%$ of the sentence's total attribution (tested 5–50%).
Evaluation Metrics:
- Plausibility (Precision): The overlap between the set of salient words identified by the model and the set of words annotated by humans.
- Linguistic Analysis: Part-of-Speech (POS) distribution and dependency distance (syntactic distance to the target referent) of salient words.

3. Key Contributions

Shift from Measurement to Origin: Moves beyond simply quantifying bias to exploring the origins of gender decisions in ambiguous natural data.
Threshold Investigation: Systematically evaluates different methods for defining "salient" input tokens, addressing the lack of a standard scoring threshold in interpretability research.
Human-Model Alignment: Provides a quantitative comparison between model attribution and human perception of gender cues in ambiguous contexts.
Linguistic Insight: Offers a granular analysis of which types of words (POS) and where (dependency distance) the model looks to make gender decisions.

4. Results

A. Attribution Levels & Model-Human Overlap (RQ1 & RQ3)

High Overlap: There is a significant overlap between the words the model deems salient and those humans find influential.
Best Approach: Approach 4 (Relative Attribution) yielded the highest overlap. Specifically, taking the top 15% of cumulative attribution scores per sentence resulted in the highest micro-precision:
- EN→DE: 0.817
- EN→ES: 0.897
- Average: 0.851
Threshold Sensitivity: Fixed absolute thresholds (Approach 3) performed worst, suggesting that relative importance within a sentence is a better metric than absolute score values.
Annotator Agreement: When restricting the analysis to only words agreed upon by at least two annotators, the overlap decreased (to ~0.69 for DE and 0.78 for ES), highlighting the inherent variability in human gender perception.

B. Linguistic Analysis (RQ2)

Part-of-Speech (POS) Differences:
- Model: Heavily relies on Nouns (34-31%) and Verbs (27-32%). These categories are over-represented compared to their frequency in the general data.
- Humans: Show a more balanced distribution, with a stronger emphasis on Proper Nouns and Adjectives compared to the model.
Dependency Distance:
- Model: Primarily influenced by words at a dependency distance of 1 (immediate syntactic neighbors) and 2. The model focuses on very local context.
- Humans: Influenced by a broader range of distances (1 through 6), indicating humans utilize wider contextual cues beyond immediate syntactic neighbors.

C. Outliers

Words considered salient by the model but not by humans included unique tokens (e.g., names like "Oreskovich"), numbers, and tokens with parsing errors (e.g., <unk> in "Whos").

5. Significance and Implications

Validation of Contrastive Explanations: The study confirms that contrastive attribution is a valid tool for uncovering the specific triggers of gender bias in MT, showing that models and humans often look at similar contextual cues, even if the weighting differs.
Bias Mitigation Strategy: Understanding that models rely heavily on local Noun/Verb associations (stereotypes) rather than broader context suggests that bias mitigation strategies should target these specific feature interactions.
Interpretability as a Debugging Tool: The work demonstrates that interpretability can serve as a "debugging tool" to detect not just that a model is biased, but which specific input features drive that bias, enabling more informed interventions (e.g., data augmentation or re-weighting specific POS categories).
Limitations & Future Work: The study is limited by a small dataset (60 sentences) and a single model (OPUS-MT). Future work should expand to larger datasets, state-of-the-art models, and non-binary gender scenarios, potentially using metrics like PECORE (Plausibility Evaluation of Context Reliance) for deeper validation.

Conclusion

The paper successfully bridges the gap between interpretability and gender bias research. It demonstrates that while NMT models and humans share a high degree of agreement on which words trigger gender perceptions (approx. 85% overlap), they differ significantly in how they process these words (focusing on local Noun/Verb dependencies vs. broader human context). This insight is crucial for developing more fair and transparent translation systems.