The Core Problem: The "Bad Word" Trap

Imagine you are a security guard at a club. Your job is to stop people from being rude or harmful. Currently, most automated security guards (AI toxicity detectors) work like a metal detector at an airport.

If the metal detector beeps, it assumes there is a weapon. It doesn't care why the metal is there.

If you are holding a knife to cut a steak, it beeps.
If you are holding a knife to threaten someone, it beeps.
If you are holding a toy knife from a Halloween costume, it beeps.

The current AI models act exactly like this metal detector. They scan a sentence, find "bad words" (like slurs or insults), and immediately flag it as toxic. They treat the words themselves as the danger, regardless of who is saying them, who is listening, or what is happening around them.

The paper argues this is a flawed way to measure harm. Just because a sentence contains a "bad word" doesn't mean it's actually hurting anyone in that specific moment.

The Real Solution: The "Contextual Stress" Framework

The authors propose a new way to think about toxicity, called the Contextual Stress Framework (CSF).

Instead of asking, "Does this sentence contain bad words?" they ask: "Does this specific message, to this specific person, in this specific situation, cause stress and break the rules of the room?"

Think of it like a human bouncer who knows the context:

Scenario A: Two friends are joking around. One says a word that is usually a slur, but they are using it as a term of endearment between them. The human bouncer sees they are laughing and knows the friendship. Verdict: No harm.
Scenario B: A stranger says that same word to a friend in a public argument. The human bouncer sees the fear in the friend's eyes. Verdict: Harmful.

The paper claims that toxicity isn't a property of the words themselves; it's a relationship between the speaker, the listener, and the situation.

Why the Old Way Fails (The "False Alarms" and "Missed Dangers")

Because current AI is like the metal detector, it makes two big mistakes:

False Positives (Catching the Innocent): It bans harmless speech because it contains "bad words."
- Example: In some communities, people reclaim offensive words to show solidarity. If an AI sees that word, it bans the post, silencing a community that is actually having fun and bonding.
False Negatives (Missing the Real Danger): It misses harmful speech that doesn't use "bad words."
- Example: A person might say, "You're so quiet, you must not have anything smart to say," in a very polite tone. It sounds nice, but it's a cruel insult designed to shut someone down. The AI sees no "bad words" and lets it pass, while the victim feels hurt.

The New Test: Measuring "Stress" Instead of "Badness"

The authors suggest we stop trying to label a sentence as "Toxic" or "Not Toxic" with a single score. Instead, we should measure Stress and Norm Violation.

Norm Violation: Did the speaker break the social rules of this specific group?
Stress: Did the listener (or the group) react with anger, fear, or withdrawal?

They tested this idea by looking at a Reddit community called r/BlackPeopleTwitter. They compared what the AI thought was toxic against what the actual people in the community reacted to.

The Result: The AI and the people often disagreed. The AI flagged friendly jokes as toxic, but the people laughed. The AI missed subtle, mean-spirited comments that the people found hurtful.
The Lesson: You cannot judge harm just by reading the text; you have to see how the people react to it.

The Proposal: A New Report Card (CSF-Eval)

The paper proposes a new way to test and build these AI systems, called CSF-Eval.

Instead of giving an AI a single grade (like "90% accurate"), we should ask it to break down its thinking into five parts, like a doctor's report:

Text Risk: Does the text look dangerous on its own?
Norm Violation: Does it break the rules of this specific group?
Stress/Disruption: Is there evidence that people are upset or arguing?
Uncertainty: "I don't have enough info to know if this is bad." (The AI should admit when it's guessing).
Policy Action: "Based on the above, here is what we should do."

The Bottom Line

The paper concludes that we need to stop pretending that harm is hidden inside a sentence waiting to be found.

Harm is created when a message is received in a specific context. To build safer online spaces, we need AI that understands the difference between a joke among friends and a weapon in a fight, rather than just a machine that counts how many "bad words" are in a room.

Technical Summary: Toxicity Detection Should Measure Contextual Harm, Not Text-Intrinsic Badness

1. Problem Statement

Current toxicity detection systems rely on a flawed abstraction: they treat toxicity as an intrinsic property of isolated text strings ( $y = f(x)$ ). This approach collapses critical determinants of communicative harm—such as the speaker, audience, interaction history, normative setting, and reception—into a single decontextualized prediction.

The paper identifies two core failures resulting from this abstraction:

The Object Problem: There is no settled definition of toxicity. Legal, platform, and academic communities use overlapping but non-equivalent notions (e.g., "hateful," "abusive," "uncivil"). Consequently, the same utterance can be legally protected, removable under policy, or labeled both toxic and non-toxic depending on the dataset, rendering benchmark progress a misleading indicator of safety.
The Proxy Problem: By operationalizing toxicity as a text-to-label mapping, detectors fail to capture situated communicative harm. This leads to systematic errors: over-flagging dialectal or reclaimed language (false positives) and missing coded, pragmatic, or context-dependent abuse (false negatives). Furthermore, these systems are brittle under meaning-preserving transformations and adversarial attacks.

The authors argue that benchmark accuracy on decontextualized labels often reflects a model's ability to learn dataset-specific annotation conventions rather than its capacity to reduce harm in real-world, situated environments.

2. Methodology and Framework: Contextual Stress Framework (CSF)

To address these issues, the authors propose the Contextual Stress Framework (CSF), which reframes toxicity not as a property of text, but as a contextual relation.

Core Definitions

Toxicity: Defined as a relation between a communicative act, an interpreting audience, and a normative setting, where a perceived norm violation induces stress or disruption.
Toxic Speech: Speech that induces stress or disruption through a perceived violation of accepted moral or communicative norms within the specific context of interpretation.

Mathematical Formulation

The framework models a communicative event as $e = (x, C, A)$ , where $x$ is the act, $C$ is the context, and $A$ is the audience.

Perceived Norm Violation ( $\nu$ ): The degree to which an audience member perceives the event as violating relevant norms. This is defined as perceived violation, not objective moral truth.
Stress Response ( $\sigma$ ): The stress or disruption induced in the audience member.
Individual Toxicity ( $\tau$ ): A function $g(\nu, \sigma)$ that combines perceived violation and stress. The function is monotone in both arguments and assigns near-zero toxicity if either component is absent.
Event-Level Toxicity ( $T$ ): An aggregate of individual toxicities across the relevant audience, weighted by factors such as exposure, relevance, or vulnerability.

Measurement Strategy

The paper distinguishes between text-intrinsic risk (lexical cues) and reception-based disruption (observable stress). For online NLP systems, where physiological data is unavailable, the framework proposes using behavioral proxies for stress, such as reply escalation, withdrawal, tone shifts, or affective language in responses.

3. Key Contributions

A. Theoretical Reframing

The paper shifts the field's focus from text classification to contextual harm measurement. It argues that context is not merely an auxiliary feature to improve prediction accuracy but is constitutive of the target variable. Toxicity is an emergent property of the interaction between text, audience, and norms.

B. The Contextual Stress Framework (CSF)

CSF provides a formal structure to separate:

Text-intrinsic cues.
Contextual assumptions.
Audience characteristics.
Perceived norm violation.
Reception/stress signals.
Uncertainty.
Policy rules.

C. CSF-Eval: A New Evaluation Agenda

The authors propose CSF-Eval, an evaluation framework that moves beyond single-label accuracy. It requires systems to output a measurement vector $M(e) = (r_{text}, \hat{\nu}, \hat{\sigma}, u, \pi)$ , representing:

$r_{text}$ : Text-intrinsic risk.
$\hat{\nu}$ : Estimated perceived norm violation.
$\hat{\sigma}$ : Estimated stress/disruption.
$u$ : Uncertainty under partial observability.
$\pi$ : Policy recommendation (explicitly separated from measurement).

CSF-Eval evaluates systems across five contrastive slices:

Same text, different context: Testing if the system recognizes that the same words function differently based on audience and setting.
Different form, same harm: Testing if the system detects coded or pragmatic abuse without relying on overt toxic markers.
Missing context: Testing if the system expresses uncertainty or abstains when context is incomplete, rather than forcing a confident label.
Reception and disruption signals: Testing if the system uses behavioral evidence (e.g., escalation) as noisy evidence of disruption.
Measurement-policy separation: Testing if the system distinguishes between estimating harm and enforcing policy.

4. Empirical Results

The authors provide an illustrative probe using data from the r/BlackPeopleTwitter subreddit to demonstrate the divergence between text-intrinsic toxicity and reception-based disruption.

Methodology: They compared the OpenAI Moderation API and Google Perspective API (text-intrinsic detectors) against PONOS (Proportion of Negative Observed Signals), a metric measuring the proportion of replies expressing negative reactions.
Findings:
- There was a weak correlation between text-intrinsic scores and PONOS ( $\rho \approx 0.20$ ).
- Conversely, the two text-intrinsic APIs strongly correlated with each other ( $\rho \approx 0.87$ ).
- Quadrant Analysis:
  - LH (Low PONOS, High Text Toxicity): 14.5% of posts were over-flagged. These often involved in-group solidarity, reclaimed language, or dialectal humor (e.g., "That's my n***a!").
  - HL (High PONOS, Low Text Toxicity): 14.4% of posts were missed. These involved sarcasm, pragmatic antagonism, or context-specific norm violations that lacked explicit slurs.
Conclusion: Text-intrinsic risk and reception-based disruption are distinct quantities. Current detectors systematically fail to align with actual community disruption, particularly in dialect-rich or reclaimed language contexts.

5. Significance and Claims

The paper claims that toxicity detection must evolve from predicting dataset labels to measuring situated communicative harm. Its significance lies in:

Correcting the Measurement Target: It argues that safety-critical systems cannot pretend isolated text is sufficient. By separating text risk from reception, CSF explains why current models over-flag dialects and miss pragmatic abuse.
Operationalizing Uncertainty: It proposes that "missing context" should be treated as a failure condition, requiring systems to express uncertainty or abstain rather than generating overconfident, potentially harmful labels.
Decoupling Measurement and Enforcement: It advocates for separating the estimation of harm (measurement) from the decision to remove or down-rank content (policy), allowing for more transparent and accountable moderation.
Benchmark Reform: It calls for the community to adopt CSF-Eval standards, requiring benchmarks to report slice-level performance (e.g., context shifts, missing data) rather than aggregate accuracy, and to explicitly document whose perspective and which contextual signals are represented.

The authors maintain a modest stance, acknowledging that toxicity cannot be measured perfectly and that full context is often unavailable in real-time deployment. However, they argue that acknowledging partial observability and modeling uncertainty is a necessary step toward safer, more robust moderation systems.

Toxicity Detection Should Measure Contextual Harm, Not Text-Intrinsic Badness