On the Concept of Violence: A Comparative Study of… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Idea: Who Defines "Violence"?

Imagine you are at a dinner party, and someone asks, "Is that action violent?"

Human Answer: "Well, it depends. If they said it while smiling, maybe not. If they said it while angry, yes. If they were interrupted before they could finish, maybe it doesn't count." Humans are messy, nuanced, and love to say, "It depends."
AI Answer: "I have processed the data. The answer is Violence." or "The answer is Not Violence." AI loves to pick a side and stick to it.

This study asked: When we ask computers to judge what is violent, do they think like us, or do they have a completely different moral compass?

How They Did It: The Radio Game Show

The researchers didn't start in a sterile lab. They started on an Italian radio show called Chiacchiericcio (which means "chatter").

The Human Test: The host read 22 tricky scenarios to the listeners (like "A comedian insults an audience member" or "Protesters block a road"). Over 3,000 people voted on whether these were "Violence," "Not Violence," or "It Depends."
The Robot Test: The researchers took those same 22 scenarios and asked 18 different AI chatbots (like Llama, Mistral, and others) to vote on them. They forced the AIs to pick just one answer, just like the humans.

What They Found: The "Robot vs. Human" Clash

The results were fascinating because the robots and humans agreed on the "easy" stuff but fought over the "gray areas."

1. The "It Depends" Problem

Humans love the middle ground. When things were ambiguous, humans often voted "It Depends."

The AI Twist: The AIs hated the middle ground. They rarely chose "It Depends." Instead, they squeezed those ambiguous situations into either "Violence" or "Not Violence."
Analogy: Imagine a human looking at a gray sky and saying, "It might rain." The AI looks at the same sky and forces a binary choice: "It is raining" OR "It is sunny." The AI is less comfortable with uncertainty.

2. The "Keyboard Warrior" Gap (Online Insults)

The Scenario: Someone sends a mean message or insults a group on social media.
Human View: 90% of humans said, "Yes, that is violence." They see the emotional harm and the psychological attack.
AI View: Only 50% of the AIs called it violence. Many AIs said, "That's just words. No one got hit with a fist."
The Takeaway: Humans understand that words can hurt just as much as fists. The AIs, however, seem to have a "physical bias." They are trained to recognize punches and kicks, but they struggle to recognize the violence of a nasty comment.

3. The "Interrupted Villain" Paradox

The Scenario: A speaker is about to say something terrible (like "We should eliminate a group of people"), but the host cuts them off before they can finish.
Human View: Humans said, "No, that's not violence. They didn't actually say it! They were stopped."
AI View: The AIs said, "Yes, that is violence."
The Takeaway: The AIs focused on the intent and the words that would have been said. Humans focused on the outcome (nothing bad actually happened because they were stopped). The AI is judging the idea; the human is judging the action.

Why Do the Robots Think This Way?

The study found that bigger, smarter AI models didn't necessarily act more like humans. A model with 10 billion parameters wasn't necessarily better at understanding "violence" than one with 1 billion.

Instead, it comes down to how they were trained.

The "Safety Filter" Effect: AI companies train their bots to be "safe" and "neutral." This often means they are taught to avoid controversial topics.
The Result: When an AI is unsure, it doesn't say, "I'm confused." It picks the "safest" answer based on its programming. This makes them look rigid and sometimes wrong compared to the fluid, context-aware human mind.

The Big Warning: Don't Trust the Robot Judge

The authors end with a crucial warning for all of us.

The Trap: Because AI speaks so confidently and fluently, we tend to trust it like a teacher or a judge. We think, "The computer said it's not violence, so it must be true."

The Reality: The AI isn't a moral expert. It's a statistical guesser. It's like a very well-read parrot that has memorized millions of books but doesn't actually understand the pain of a human heart.

Search Engines (The Old Way): If you Google a question, you get 10 different links. You have to read them, compare them, and decide for yourself. You see the disagreement.
Chatbots (The New Way): The AI gives you one perfect answer. It hides all the disagreement and uncertainty. It makes you feel like there is only one "correct" truth.

The Bottom Line

This study shows that while AI is getting better at mimicking humans, it still misses the nuance of human morality.

Humans see violence in words, in silence, and in context.
AI sees violence mostly in physical actions and struggles with the "gray areas."

The Lesson: When you ask an AI, "Is this violent?", don't treat its answer as the final verdict. Treat it as a second opinion from a very literal, slightly confused robot. The real judgment still belongs to us.

1. Problem Statement

The paper addresses the ambiguity surrounding the definition of "violence" in contemporary society. While physical aggression is a prototypical form, the classification of exclusion, humiliation, online harassment, and symbolic acts remains subjective and culturally dependent.

The Core Issue: As Large Language Models (LLMs) increasingly mediate ethical dilemmas and social judgments, it is unclear whether these systems reproduce human moral pluralism, flatten complex nuances into sanitized neutrality, or impose a distinct, algorithmic moral framework.
Research Gap: There is a lack of systematic comparison between human moral reasoning and LLM classifications regarding ambiguous social behaviors, specifically examining how AI handles the "depend-on" (context-sensitive) category versus binary "violence/non-violence" judgments.

2. Methodology

The study employs a comparative experimental design involving 22 morally divisive scenarios, analyzed through both human crowdsourcing and a heterogeneous set of LLMs.

A. Dataset and Stimuli

Stimuli: 22 carefully constructed sentences designed to span four thematic domains:
1. Verbal expressions (e.g., online insults, threats).
2. Symbolic acts (e.g., public obscene acts).
3. Relational dynamics (e.g., staring, non-consensual touching, property damage).
4. Omission/Exclusion/Indifference (e.g., blocking roads).
Human Data: Collected via a radio broadcast (Chiacchiericcio on Radio Deejay) distributed via social media.
- Sample: ~3,300 respondents per sentence (Total ~73,335 judgments).
- Task: Classify each sentence into one of three categories: Violence, Non-violence, or Depend-on.
- Constraint: No demographic data was collected; only aggregated percentages were used.

B. AI Participants (LLMs)

Model Selection: 18 instruction-tuned, open-weight LLMs were selected from the Ollama system to represent diverse architectures and scales (ranging from ~1B to >10B parameters).
- Families: LLaMA derivatives, Mistral, Qwen 2.5, Phi-3, Nous Hermes, and Gemma variants.
- Constraints: Models were run on a single server (8GB VRAM) to ensure feasibility. Two models (phi3:mini, gemma3:4b) were excluded for failing to output rigid classifications without explanation.
Prompting: A strict JSON-based prompt template was used to force a single categorical output without justification, minimizing bias from conversational drift.

C. Statistical Analysis

Global/Sentence-Level: Chi-square tests of independence ( $2 \times 3$ contingency tables) compared human vs. LLM label distributions. P-values were adjusted using the Benjamini–Hochberg False Discovery Rate (FDR) procedure.
Domain-Level: Aggregated analysis across the four thematic domains.
Consensus & Agreement:
- Fleiss' Kappa: Measured inter-model agreement.
- Spearman Correlation: Analyzed the relationship between LLM consensus and human consensus.
- Accuracy: Measured against the human majority label per sentence to assess model alignment.

3. Key Results

A. Global Distribution Shift

Humans: Classified 72.3% as Violence, 13.9% as Depend-on, and 13.8% as Non-violence.
LLMs: Classified 71.9% as Violence, 18.8% as Non-violence, and only 9.4% as Depend-on.
Finding: A statistically significant shift ( $\chi^2 = 11.35, P = 0.0034$ ) where LLMs reduced the "context-dependent" category, reallocating those judgments toward "Non-violence." This suggests LLMs compress contextual ambiguity into categorical decisions.

B. Sentence-Level Divergences

Significant discrepancies ($FDR < 0.05$) were found in 9 out of 22 sentences, primarily in Verbal Expressions:

Under-classification of Online Harassment: For private/public insults and coordinated mass insults (Sentences 10, 11, 13), humans overwhelmingly labeled these as violence (>90%), while LLMs labeled them as violence only ~50-56%, often defaulting to Non-violence.
Over-classification of Interrupted Incitement: In a scenario where a host interrupts a speaker about to incite physical elimination (Sentence 20), LLMs labeled it Violence (81.2%) significantly more often than humans (27.1%). Humans viewed the interruption as a mitigating factor; LLMs focused on the intent/content.
Staring vs. Touching: Persistent staring (Sentence 7) was often deemed Non-violence by LLMs (50%) but Violence by humans (45.5% vs 17.3% non-violence). Conversely, non-consensual touching was almost universally agreed upon as violence by both, though LLMs showed a slight deviation.

C. Model Characteristics

Inter-Model Variability: LLM consensus was low (mean 63.9%) on sentences where humans and AI disagreed, indicating that AI disagreement often coincides with human-AI disagreement.
Parameter Count: No monotonic correlation was found between model size (parameters) and alignment with human judgments. Small models (e.g., Llama 3.2 1B) performed poorly (18.2% accuracy), while mid-sized instruction-tuned models (e.g., Llama 3.2 3B) achieved high alignment (81.8%).
Domain Specificity: Models generally aligned better with humans on Relational Dynamics (physical/interpersonal) than on Verbal Expressions (symbolic/digital).

4. Key Contributions

Empirical Evidence of Moral Compression: The study demonstrates that LLMs systematically reduce the "depend-on" (contextual) category, forcing ambiguous social scenarios into binary Violence/Non-violence outputs.
Identification of Cognitive Asymmetry:
- Humans integrate situational resolution (e.g., an interruption) and pragmatic context.
- LLMs prioritize propositional content and explicit intent, often ignoring outcome constraints or mitigating factors.
Decoupling Size from Alignment: The research challenges the assumption that larger models inherently possess better moral alignment, showing that post-training strategies (instruction tuning, safety alignment) are more critical than parameter count.
Epistemic Risk Analysis: The paper highlights that LLMs, unlike search engines which expose pluralism, present synthesized, confident answers that mask uncertainty. This risks "authority bias," where users accept probabilistic AI outputs as definitive moral truths.

5. Significance and Implications

Social Impact: As AI systems become "cognitive companions" for ethical reasoning, their tendency to flatten moral nuance could reshape public understanding of harm, responsibility, and social norms.
Safety & Alignment: The findings suggest that current safety guardrails may inadvertently channel responses toward pre-defined normative directions, creating a "structural confirmation bias" that differs from human moral pluralism.
Future Directions: The authors argue for the necessity of transparency in AI outputs. Users must be educated to view LLMs as probabilistic tools rather than arbiters of truth, especially in high-stakes contexts involving social harm. The study calls for "critical engagement" with AI to prevent the reification of algorithmic biases into unwarranted normative authority.

Limitations Noted: The human sample was drawn from a specific radio audience (potential selection bias), and the number of AI models was limited by computational resources. However, the aggregated nature of the data and the diversity of the model families provide a robust structural comparison.

On the Concept of Violence: A Comparative Study of Human and AI Judgments