Sensivity of LLMs' Explanations to the Training Randomness:Context, Class & Task Dependencies

Imagine you have a very smart, but slightly unpredictable, student named Transformer. This student is brilliant at reading and understanding text, but here's the catch: every time you teach them a new lesson, you shuffle the order of the flashcards, change the lighting in the room, or tweak their mood slightly (this is what the paper calls "training randomness").

Even though the student learns the same material and gets the same test score, if you ask them, "Why did you choose this answer?" they might give you a completely different reason depending on how the lesson was shuffled.

This paper is like a detective story investigating why this student's explanations change so much. The authors, Romain, Jérémie, and François-Xavier, wanted to know: Does the reason for the change depend on the sentence structure, the specific topic, or the type of test?

Here is the breakdown of their findings using simple analogies:

1. The Setup: The "200 Twins" Experiment

To study this, the researchers didn't just train one student. They trained 200 "twins" of the same AI model.

They all learned from the exact same textbook (data).
They all got the same grade (accuracy).
But, each twin had a slightly different "random seed" (like a different personality quirk or a different order of studying).

Then, they asked all 200 twins to explain the same sentence. They measured how much the twins agreed with each other. If they all said, "I chose this because of the word 'John'," that's stable. If one said "John," another said "the verb," and a third said "the punctuation," that's unstable.

2. Factor #1: The Sentence Structure (The "Scrambled Puzzle")

The Question: Does shuffling the words in a sentence make the explanations wobbly?
The Analogy: Imagine a sentence is a sentence made of Lego bricks.

Normal Sentence: "John loves James."
Shuffled Sentence: "James loves John" (or just the bricks in a random pile).

The Finding: When the words were in the right order, the 200 twins all agreed perfectly on why they chose an answer. But when the words were shuffled (even if the meaning was still obvious), the twins started to disagree slightly.

Verdict: The structure matters, but only a little bit. The AI gets a bit confused by the "noise" of the shuffled words, but it's not a huge deal.

3. Factor #2: The Topic (The "Missing Clue")

The Question: Does the specific answer choice (class) change how stable the explanation is?
The Analogy: Imagine a detective game.

Case A (Easy): The culprit is always "John." The detective just has to find the name "John" in the room. Easy! Everyone agrees.
Case B (Hard): The culprit is "NOT John." The detective has to look at the whole room and realize, "Hmm, John isn't here, so it must be someone else."

The Finding: This is where things get messy.

When the answer depended on a specific, obvious word (like "John"), the explanations were very stable.
When the answer depended on the absence of a word (like "It's NOT John"), the 200 twins started giving very different reasons. Some pointed to the beginning of the sentence, some to the end.
Verdict: The topic has a medium impact. If the AI has to explain why something isn't there, its reasoning becomes much less consistent.

4. Factor #3: The Task (The "Subject Matter")

The Question: Does the type of job the AI is doing change the stability?
The Analogy:

Task A (Astro-physics): Sorting papers about "Stars" vs. "Math." The words are very different (e.g., "black hole" vs. "equation"). It's like sorting apples from oranges.
Task B (Opinion vs. Fact): Sorting news articles into "Opinion" vs. "Fact." The words are very similar. You need to read the tone and the relationship between words to tell the difference. It's like sorting red apples from slightly darker red apples.

The Finding:

The AI was very stable when sorting the easy, distinct topics (Stars vs. Math).
The AI was very unstable when sorting the tricky, similar topics (Opinion vs. Fact).
Verdict: The task has the biggest impact. The harder the job is to understand, the more the AI's "reasoning" changes depending on how it was trained.

The Big Picture Conclusion

The paper tells us that AI explanations are fragile. They aren't absolute truths; they depend heavily on:

How the text is written (a little bit).
What the AI is looking for (a medium amount).
How hard the job is (a huge amount).

Why does this matter?
If you are a doctor using an AI to diagnose a patient, or a judge using it to review a case, you can't just trust the AI's "reasoning" blindly. If the AI says, "I think this is a crime because of word X," you need to know: Is that a solid reason, or did the AI just get lucky with its training shuffle?

The authors suggest that in the future, we shouldn't just look at one explanation. We should look at the distribution of explanations (ask the AI 200 times and see if it gives the same answer every time) to know if we can really trust it.

Here is a detailed technical summary of the paper "Sensitivity of LLMs' Explanations to the Training Randomness: Context, Class & Task Dependencies" by Loncour, Bogaert, and Standaert.

1. Problem Statement

While Transformer models are foundational to Natural Language Processing (NLP), explaining their decision-making processes remains a significant challenge. A critical, often overlooked issue is explanation instability: recent studies indicate that the same model architecture trained on the same data but with different random seeds (randomness) can produce significantly different explanations for identical inputs.

Current research predominantly analyzes explanations on a single-instance basis. This paper addresses the gap by investigating the distribution of explanations rather than individual instances. The authors aim to quantify how three specific factors influence the sensitivity of explanations to training randomness:

Syntactic Context: The order and arrangement of words.
Class Characteristics: Whether a class is defined by the presence or absence of specific discriminant markers.
Task Complexity: The inherent difficulty and nature of the classification task.

2. Methodology

Experimental Setup

Models: The authors utilized RoBERTa-base for English tasks and CamemBERT-base for French tasks.
Training Protocol: For each dataset, they fine-tuned 200 models using identical hyperparameters (learning rate $2 \times 10^{-5}$, batch size 16, 1 epoch) but varying the random seed. The seed controlled data shuffling, dropout neuron deactivation, and classification head initialization.
Model Selection: From the 200 models, a subset of $m$ equivalent models was selected based on having statistically indistinguishable accuracy on the test set.
Data Filtering: Only "compatible texts" were used—instances where all $m$ equivalent models predicted the exact same label.
Explanation Method: Layer-wise Relevance Propagation (LRP) was used to generate explanations. LRP is a deterministic method that assigns a relevance score to each token, offering a balance between plausibility and faithfulness.

Evaluation Metric

The authors employed the Mean Correlation With Mean Explanation (MCWME) metric to measure stability.

Mechanism: For a given text, the mean explanation is computed from a subset of the models. The correlation (Pearson's) between this mean and the explanation of each remaining model is calculated.
Process: Using leave-one-out cross-validation, $m$ correlations are computed and averaged. A higher MCWME indicates higher stability (lower sensitivity to randomness), while a lower score indicates high sensitivity.

3. Key Experiments and Results

Experiment A: Impact of Syntactic Context

Setup: Two datasets were created using 10-word sentences.
- Dataset 1 (Ordered): Sentences containing "John" vs. "James" in a coherent order.
- Dataset 2 (Shuffled): The exact same words as Dataset 1 but with word order randomized.
Result:
- Ordered Sentences: Showed near-perfect explanation stability (high MCWME). LRP correctly identified the discriminant word ("John" or "James").
- Shuffled Sentences: Showed significantly lower stability.
Analysis: Even in simple tasks, shuffling words during fine-tuning increases sensitivity. The authors suggest that Transformers learn small, spurious word relations in shuffled data that should be zero but are erroneously highlighted by the LRP attention mechanism.

Experiment B: Class Dependency (Presence vs. Absence of Markers)

Setup:
- Class A: Defined by the presence of the name "John".
- Class B: Defined by the absence of "John" (replaced by random words like "James" or "today").
Result:
- The class defined by the presence of a marker showed higher stability.
- The class defined by the absence of a marker showed significantly lower MCWME (around 0.7, compared to near 1.0 for presence).
Analysis: When a class is defined by what is not there, explanations become more sensitive to training randomness. The "flat" relevance distribution in these cases leads to higher variance, though not random noise (values > 0.5 suggest some structural consistency, likely driven by sentence boundaries).

Experiment C: Task Dependency (ArXiv vs. InfOpinions)

Setup:
- Task 1 (ArXiv): Classifying abstracts into Astrophysics vs. Mathematics. (Shorter texts, ~148 tokens).
- Task 2 (InfOpinions): Classifying press articles into Information vs. Opinion. (Longer texts, ~338 tokens).
Result:
- ArXiv: High explanation stability.
- InfOpinions: Significantly lower stability.
Analysis: The gap is attributed to the discriminative power of the vocabulary. ArXiv classes rely on distinct technical terms, making the decision boundary clearer and less sensitive to randomness. InfOpinions requires a deeper understanding of semantic relations and nuance, making the model's internal logic more volatile across different training seeds.

4. Key Contributions

Quantification of Sensitivity: The paper provides empirical evidence that training randomness significantly impacts explanation stability, moving beyond anecdotal observations to statistical analysis.
Factor Isolation: It isolates and ranks the impact of three factors on sensitivity:
- Tasks: Largest impact (Task complexity and vocabulary discriminative power).
- Classes: Medium impact (Presence vs. absence of discriminant markers).
- Context: Smallest impact (Syntactic order), though still statistically significant.
Methodological Framework: Introduces a robust protocol using MCWME and equivalent model subsets to evaluate explanation distributions rather than single points.

5. Significance and Conclusion

The findings suggest that explanation stability is not an inherent property of a model or method but is heavily dependent on the data distribution and task nature.

Implications for XAI: Current explainability frameworks often treat explanations as static. This paper argues that characterizing the sensitivity to training randomness should be a standard addition to XAI frameworks.
Plausibility vs. Faithfulness: The authors raise critical open questions:
- Does high sensitivity to randomness undermine the faithfulness of the explanation (i.e., is the explanation truly reflecting the model's behavior if that behavior changes with a seed)?
- How does the need to interpret a distribution of explanations affect plausibility for human users?
Recommendation: For tasks where high explanation stability is required, simpler models or tasks with highly discriminative features are preferable. For complex tasks requiring deep semantic understanding, users must be aware that explanations may vary significantly depending on the specific training instance.

In summary, the paper establishes that training randomness is a critical variable in the reliability of LLM explanations, with the nature of the task and the specific class definitions playing the most dominant roles in determining how stable those explanations will be.