From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories

This paper evaluates the performance of off-the-shelf sentiment classifiers on a large corpus of Holocaust oral histories, introducing an agreement-based stability taxonomy (ABC) to stratify inter-model divergence and demonstrate that low-to-moderate agreement is primarily driven by boundary decisions around neutrality.

Daban Q. Jaff

Published 2026-04-01
📖 5 min read🧠 Deep dive

Imagine you are trying to understand the emotional tone of a thousand different people telling their life stories about surviving a terrible war. You decide to use three different "emotion-detecting robots" (AI models) to read these stories and tell you if the person is feeling Happy, Sad, or Neutral.

However, there's a catch: these robots were trained on very different things. One learned from Twitter posts, another from product reviews on Amazon, and the third from general news articles. None of them were ever taught how to understand the heavy, complex, and often indirect language of Holocaust survivors.

This paper is basically a report card on how well these robots get along when they try to analyze these difficult stories. Here is the breakdown using simple analogies:

1. The Problem: The "Translator" Mismatch

Think of the Holocaust oral histories as a very complex, ancient dialect. The AI robots are like translators who only speak modern slang, business jargon, or casual texting.

  • The Result: When you ask them to translate the same sentence, they often disagree. One might say, "This is sad," while another says, "This is just a fact," and a third says, "This is angry."
  • The Finding: The paper found that these robots disagree a lot. They don't just disagree on the "sad" parts; they mostly fight over what counts as "neutral" (neither happy nor sad).

2. The Experiment: The "Three Judges"

The author didn't try to pick the "best" robot. Instead, they treated the three robots like three judges on a reality TV show.

  • The Setup: They fed the robots 107,000 chunks of speech (utterances) and nearly 600,000 sentences.
  • The Goal: To see where the judges agree and where they scream at each other.

3. The Solution: The "ABC" Sorting System

Since the robots disagree so much, the author created a new way to sort the results, called the ABC Taxonomy. Think of this as a traffic light system for trust:

  • Category A (Green Light - Full Agreement): All three robots agree on the same feeling (e.g., all say "Angry").
    • Note: Because one robot can't say "Neutral," this category only counts when they all agree on a strong emotion (Positive or Negative). This is the "safe zone" where you can trust the result.
  • Category B (Yellow Light - Partial Agreement): Two robots agree, but the third disagrees.
    • Example: Two say "Sad," one says "Neutral." This is the "maybe zone." It happens most often (about 68% of the time).
  • Category C (Red Light - Total Chaos): All three robots give three different answers (one says Happy, one Sad, one Neutral).
    • Meaning: This is the "confusion zone." It tells us the story is so complex that the robots are completely lost.

4. The Big Discovery: The "Neutral" Trap

The paper found that the robots mostly fight over the Neutral category.

  • The Analogy: Imagine a survivor describing a horrific event in a very calm voice.
    • Robot 1 (trained on Twitter) thinks: "They are using calm words, so they must be Neutral."
    • Robot 2 (trained on reviews) thinks: "The context is terrible, so this must be Negative."
    • Robot 3 thinks: "It's a mix."
  • The Conclusion: The disagreement isn't usually about whether the story is "Good" or "Bad." It's about whether the story is "Too complex to label." The robots struggle to distinguish between "calmly describing a tragedy" and "being emotionally neutral."

5. The "Emotion Check" (The T5 Probe)

To double-check their work, the author used a fourth tool (a T5-based emotion classifier) to look at the "mood" of the stories in the different categories.

  • Category A (Agreement): When the robots agreed on "Angry," the emotion tool confirmed it was full of anger. When they agreed on "Happy," it was full of joy.
  • Category B & C (Disagreement): When the robots were fighting, the emotion tool showed a messy mix of feelings (anger mixed with sadness, or joy mixed with fear). This proves that the robots were fighting because the human stories were actually emotionally messy and complex.

Why Does This Matter?

This paper teaches us a valuable lesson about using AI on sensitive history:

  1. Don't trust a single AI blindly: If you use just one robot to analyze Holocaust stories, you might get the wrong answer.
  2. Disagreement is data: The fact that the robots disagree tells us something important: the stories are too complex for simple "Happy/Sad" labels.
  3. A New Strategy: Instead of trying to force an AI to be perfect, we should use a "safety net" approach. Only trust the AI when all three agree (Category A). For the rest, we should flag them as "needs human attention" because the AI is confused.

In short: The paper shows that when AI tries to read the painful, complex stories of Holocaust survivors, it gets confused and argues with itself. The author's solution is to build a system that highlights where the AI is confused, so researchers know exactly where to be careful and where they can trust the machine.