AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection

Imagine you are a detective trying to solve a mystery in a crowded, noisy town square (the internet). People are shouting theories, some are telling the truth, some are joking, and some are genuinely believing in secret plots to control the world.

Your job is two-fold:

Find the clues: Identify the specific phrases that prove someone is talking about a conspiracy (like "secret group," "hidden plan," or "they are poisoning us").
Judge the intent: Decide if the person shouting is believing the conspiracy or just reporting on it (like a news anchor saying, "Some people claim the earth is flat").

This paper describes a team of AI detectives (from the National Technical University of Athens) who built a super-smart system to do exactly this. They call it an "Agentic LLM Pipeline."

Here is how they did it, explained with simple analogies:

1. The Problem: The "Reporter Trap"

Imagine a news reporter says, "The article claims that aliens built the pyramids."
A simple AI might get confused. It sees the words "aliens" and "pyramids" and thinks, "Aha! Conspiracy!" and marks it as a believer.
But the reporter isn't a believer; they are just repeating what someone else said. This is called the Reporter Trap. Most AI models fall into this trap because they focus on what words are used, not how they are used.

2. The Solution: A Team of Specialists

Instead of using one giant brain to do everything, the authors built a workflow with different AI agents, each playing a specific role. Think of it like a high-end law firm or a courtroom.

Part A: The Clue Hunters (Subtask 1)

Goal: Find the specific conspiracy phrases.

The Detective (DD-CoT): This AI doesn't just guess. It uses a technique called Dynamic Discriminative Chain-of-Thought. Imagine a detective who, before writing a report, has to argue against their own conclusion.
- Example: "I think 'The Media' is the villain here. But wait, could 'The Media' be the victim? No, because the sentence says they 'manipulated' people. Okay, I'm sure they are the villain."
- By forcing the AI to argue both sides, it stops making mistakes about who is doing what.
The Notary (Deterministic Verifier): Large Language Models are great at thinking but terrible at counting. They might say, "The clue starts at word 5," but actually, it starts at word 6. This Notary is a strict, boring robot that checks the text character-by-character to ensure the AI didn't lie about where the clue starts and ends.

Part B: The Courtroom (Subtask 2)

Goal: Decide if the person believes the conspiracy.

To avoid the "Reporter Trap," they built an "Anti-Echo Chamber." Instead of one AI giving an opinion, they set up a Parallel Council of four distinct personalities who debate the case in secret:

The Prosecutor: Always looks for evidence of a conspiracy. "They used the word 'cabal'! That's a conspiracy!"
The Defense Attorney: Always looks for reasons not to convict. "Wait, they used the word 'claims' and 'reportedly.' They are just reporting news, not believing it."
The Literalist: Only looks at the exact words. "If the text doesn't explicitly say 'I believe,' then we can't convict."
The Profiler: Looks at the "vibe." "They are using all-caps and shouting. That sounds like a true believer."

The Calibrated Judge:
After the four jurors vote, a Judge steps in. The Judge doesn't just count the votes (2 vs. 2). The Judge looks at the reasons the jurors gave.

If the Prosecutor says "They said 'cabal'" but the Defense says "But they said 'according to the article'," the Judge knows the Defense is right.
The Judge is programmed to be conservative. If there is any doubt, they rule "Not Guilty" (Not a conspiracy) to avoid falsely accusing news reporters.

3. The "Hard Negative" Training

To teach the AI the difference between a believer and a reporter, the team used a special training trick called Contrastive Retrieval.

They didn't just show the AI examples of conspiracies.
They showed it Hard Negatives: Examples that looked exactly like conspiracies (using the same scary words) but were actually just news reports or jokes.
It's like training a dog to find a specific scent, but then giving it a bottle of perfume that smells exactly the same but isn't the target. The dog learns to ignore the smell and look for the context.

The Results

This system was a huge success:

For finding clues: It doubled the accuracy compared to a standard AI.
For judging intent: It improved accuracy by nearly 50%.
Ranking: It came in 3rd place in the world for finding clues and 10th for judging intent, beating many systems that used much larger, more expensive computers.

The Takeaway

The paper proves that you don't need a "super-brain" AI to solve complex problems. Instead, you need good organization. By breaking the job down into small, specialized roles (Detective, Notary, Prosecutor, Defense, Judge) and forcing them to argue with each other, you get a much smarter result than a single AI working alone.

It's the difference between asking one person to write a legal brief and having a whole law firm debate the case before handing it to a judge. The result is fairer, more accurate, and much harder to trick.

Here is a detailed technical summary of the paper "AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection."

1. Problem Definition

The paper addresses SemEval-2026 Task 10, which focuses on two distinct but related challenges in analyzing conspiratorial discourse on Reddit:

Subtask 1 (S1): Psycholinguistic Marker Extraction. The goal is to identify and localize specific textual spans (character-level) that correspond to five evolutionary psychology-based categories: ACTOR (agents), ACTION (what they do), EFFECT (consequences), VICTIM (who is harmed), and EVIDENCE (proof/claims).
Subtask 2 (S2): Conspiracy Endorsement Detection. The goal is to classify whether a Reddit comment endorses a conspiracy theory or merely discusses/reports on one.

Core Challenges:

Semantic vs. Structural Decoupling: Traditional models often conflate semantic reasoning (identifying the concept) with structural localization (finding the exact character offset), leading to "hallucinated spans" or off-by-one errors.
The "Reporter Trap": Models frequently misclassify neutral reporting (e.g., "The article claims X") as conspiracy endorsement because they conflate the topic of conspiracy with the stance of endorsement.
Ambiguity: Distinguishing between passive voice agents, victims, and satirical or ironic content requires deep psycholinguistic reasoning rather than superficial keyword matching.

2. Methodology

The authors propose a two-stage agentic LLM pipeline that decouples semantic reasoning from deterministic operations. The system uses GPT-5.2 as the base model, orchestrated via LangGraph, with prompts optimized using GEPA (Genetic Evolution Prompt Algorithm).

A. Subtask 1: Marker Extraction via DD-CoT

Instead of a single-pass extraction, the system employs a Self-Refine loop with four sequential agents:

DD-CoT Generator: Uses Dynamic Discriminative Chain-of-Thought (DD-CoT). Unlike standard CoT, the generator must explicitly state evidence for a label and provide a counter-argument against the most likely confusable label (e.g., distinguishing ACTOR from VICTIM in passive sentences).
Enhanced Critic: Audits the generator for verbatim accuracy, boundary precision, label discrimination, and missing spans. Crucially, it includes a "Negative-Example Gating" check to delete all spans if the text contains no conspiracy markers.
Refiner: Applies minimal edits based on the Critic's feedback.
Deterministic Verifier: A non-LLM post-processing node that maps the LLM-generated strings to exact character offsets using a five-tier matching cascade (Exact Match $\to$ Case-Insensitive $\to$ Normalized $\to$ Fuzzy $\to$ SequenceMatcher). This eliminates hallucinated spans.

Key Innovation: The system explicitly separates the LLM's interpretive signal (what to extract) from the deterministic localization (where it is), ensuring character-level precision.

B. Subtask 2: Classification via "Anti-Echo Chamber"

To solve the "Reporter Trap," the system uses an adversarial Parallel Council architecture:

Forensic Profiler: A deterministic node that extracts lightweight linguistic signals (e.g., attribution density, shouting score, "Just Asking Questions" patterns) before LLM deliberation.
Parallel Council: Four independent personas (Jurors) analyze the document without seeing each other's outputs:
- Prosecutor: Identifies evidence for endorsement.
- Defense Attorney: Identifies evidence for neutral reporting, debunking, or satire.
- Literalist: Enforces strict literal entailment and burden-of-proof rules.
- Stance Profiler: Analyzes epistemic posture and group dynamics.
Calibrated Judge: Aggregates the votes using a weighted consensus score and conservative adjudication rules. It defaults to "non-conspiracy" if evidence is ambiguous or contradictory, effectively suppressing false positives.

C. Contrastive Few-Shot Retrieval

The system retrieves in-context examples from a vector store (ChromaDB) using a contrastive strategy:

For S1: Stratified retrieval ensures underrepresented marker types (EVIDENCE, VICTIM) are included.
For S2: Hard Negative Mining retrieves documents that contain conspiracy vocabulary but represent neutral reporting (e.g., news articles). This forces the model to learn the boundary between topic and stance.

3. Key Contributions

First Agentic LLM Framework: The first method to combine agentic workflows with psycholinguistic marker extraction and endorsement detection.
DD-CoT Framework: Introduces explicit counter-argumentation to resolve semantic ambiguities (e.g., agency in passive voice), improving Actor detection significantly.
Hybrid Extraction Architecture: Decouples LLM reasoning from deterministic span localization, solving the "hallucinated span" problem common in LLMs.
Anti-Echo Chamber: A multi-agent adversarial council that successfully mitigates the "Reporter Trap" by separating topical discussion from endorsement.
GEPA Optimization: Automated prompt evolution using a gradient consensus metric for S2 and F2-score for S1, yielding robust prompts without manual tuning.

4. Experimental Results

The system was evaluated on the SemEval-2026 Task 10 dataset (4,800 annotations across 4,100 Reddit comments).

Metric	Subtask 1 (Marker Extraction)	Subtask 2 (Conspiracy Detection)
Baseline (Zero-shot GPT-5.2)	Macro F1: 0.12 (Dev)	Macro F1: 0.53 (Dev)
Agentic System (Dev)	Macro F1: 0.24 (+100%)	Macro F1: 0.79 (+49%)
Agentic System (Test)	Macro F1: 0.21	Macro F1: 0.75
Ranking	3rd on Dev	10th on Test (S1) / 13th (S2)

Ablation Insights:

Self-Refine: The iterative critique loop contributed the largest gain (+6.7% F1) by correcting boundary errors.
DD-CoT: Improved ACTOR identification by +2.7 F1 points by disambiguating agents from victims.
Contrastive Retrieval: Reduced the False Positive Rate in S2 by 50% (from 0.160 to 0.080).
Parallel Council: Improved recall by 16.4% while maintaining precision.

Portability: The approach was tested on Qwen-3-8B (an open-weights model). Despite simplifications, it achieved competitive results (0.16 F1 on S1, 0.63 on S2), proving the architecture's transferability.

5. Significance and Limitations

Significance:

Paradigm Shift: Demonstrates that workflow engineering (agentic structures) can substitute for model scaling. Complex discriminations (e.g., irony, passive agency) are better solved by structured multi-agent deliberation than by a single "perfect prompt."
Interpretability: The system provides explicit reasoning (counter-arguments, forensic metrics) for its decisions, making it more transparent than black-box classifiers.
Robustness: Successfully handles the "Reporter Trap," a critical failure mode in misinformation detection where models confuse reporting with belief.

Limitations:

High-Context Irony: The system still struggles with "Poe's Law" scenarios (extreme parody) and high-context sarcasm, which often require user history or thread-level context not present in single-document analysis.
No Fine-Tuning: The system relies purely on prompting and orchestration; fine-tuning on task-specific data could further improve boundary precision.
Computational Cost: The multi-agent pipeline increases latency (approx. 30s per document) and token usage significantly compared to single-pass models.

Conclusion:
The AILS-NTUA system establishes a new benchmark for psycholinguistic NLP by proving that structured agentic deliberation, combined with deterministic verification and contrastive retrieval, can effectively disentangle complex semantic roles and stance in conspiratorial discourse.