Argumentation for Explainable and Globally Contestable Decision Support with LLMs

Imagine you are a doctor trying to decide the best treatment for a patient with a serious brain tumor. You have a super-smart AI assistant (a Large Language Model, or LLM) that has read millions of medical books. You ask it, "What should we do?"

The AI gives you an answer, but it's a bit of a mystery. It says, "Do Surgery," but when you ask why, it just mumbles a vague explanation that doesn't quite make sense. Worse, if you realize it's wrong, you can't easily fix its logic for future patients; you'd have to argue with it every single time.

This paper introduces a new framework called ArgEval to solve this problem. Think of it as upgrading the AI from a "black box" oracle into a transparent, editable rulebook.

Here is how it works, using simple analogies:

1. The Problem: The "Black Box" Chef

Current AI models are like a master chef who cooks amazing meals but refuses to show you the recipe. If the food tastes bad, you can't tell them to change the ingredients for next time; you can only complain about this specific meal. In medicine, this is dangerous. If the AI makes a mistake, we need to know why and fix the rule so it doesn't happen again.

2. The Solution: The "Master Blueprint" (ArgEval)

Instead of asking the AI to make a decision from scratch every time, ArgEval asks the AI to first build a Master Blueprint (called a "General Argumentation Framework") based on official medical guidelines.

Step 1: Building the Map (The Ontology): Imagine the AI reads all the medical guidelines and draws a map of every possible treatment (Surgery, Radiation, Chemo, etc.). It organizes them like a family tree.
Step 2: Writing the Rules (The Arguments): For each treatment on the map, the AI writes a "Pros and Cons" list.
- Pro: "Surgery is great for survival."
- Con: "But don't do surgery if the tumor is in a dangerous spot."
- Con: "And don't do surgery if the patient is very old."
- Crucially, the AI attaches conditions to these rules. It's like a traffic light: "If the patient is old AND the tumor is deep, the 'No Surgery' light turns red."

3. The Magic: Instantiating the Blueprint

When a real patient walks in (let's call him Mr. Smith, 85 years old), the system doesn't ask the AI to guess. Instead, it takes the Master Blueprint and filters it based on Mr. Smith's details.

The system looks at the "Surgery" blueprint.
It sees the rule: "If Age > 70, Surgery is risky."
It sees Mr. Smith is 85.
Result: The "Surgery" option gets a huge red "X" (or a very low score). The "Radiation" option gets a green checkmark.

The output isn't just a guess; it's a visual argument showing exactly which rules were applied and why. It's like showing the doctor the specific pages of the rulebook that led to the decision.

4. The Superpower: "Global Contestability"

This is the paper's biggest innovation. In old systems, if you found a mistake, you had to argue with the AI for that one specific patient.

With ArgEval, if you find a mistake, you can edit the Master Blueprint.

The Scenario: Imagine the AI says "Don't do Surgery" for Mr. Smith, but you (the expert doctor) know that for this specific type of tumor, surgery is actually okay even for older patients.
The Fix: Instead of just overriding the answer for Mr. Smith, you go into the Master Blueprint and tweak the rule: "Actually, if the tumor is in the Thalamus, Surgery is okay."
The Ripple Effect: You save that change. Now, every single future patient with that specific tumor type will get the correct advice automatically. You fixed the logic for the whole world, not just one case.

5. Why This Matters

Trust: Doctors can see the "math" behind the decision. It's not magic; it's a logical chain of rules.
Efficiency: The AI doesn't have to "think" from scratch for every patient. It just checks the pre-built rules, which is much faster and cheaper.
Safety: If the AI makes a mistake, humans can fix the root cause (the rule) so the AI never makes that mistake again.

Summary Analogy

Think of current AI as a fortune teller who gives you a vague prediction. You can't question their logic, and if they are wrong, they just give you a different vague prediction next time.

ArgEval is like a legislative assembly.

They write a clear Constitution (the Master Blueprint) based on laws (medical guidelines).
When a new case comes up, they apply the Constitution to get a verdict.
If the verdict is wrong, you don't just argue with the judge; you amend the Constitution.
Once amended, the new law applies to everyone, ensuring the system gets smarter and safer over time.

This paper shows that by using this "Constitution" approach, the AI can give doctors better, explainable, and safer advice for treating brain tumors, while using less computer power than other methods.

1. Problem Statement

Large Language Models (LLMs) demonstrate strong general capabilities but face significant barriers in high-stakes domains (e.g., healthcare) due to:

Opacity and Unreliability: Their stochastic nature leads to hallucinations and omissions, making it difficult to provide faithful explanations or correct errors reliably.
Limitations of Existing Argumentation Approaches: Recent methods combining LLMs with computational argumentation (e.g., ArgLLMs, ArgRAG) improve explainability but suffer from two main drawbacks:
1. Binary Restriction: They are often limited to predefined binary choices (true/false claims) rather than open-ended decision spaces.
2. Local Contestability: They allow users to contest specific instances (local), but the underlying decision logic remains static. Correcting a mistake for one case does not prevent the model from making the same error for future similar cases.

2. Methodology: The ArgEval Framework

The authors propose ArgEval, a framework that shifts from instance-specific reasoning to the structured evaluation of general decision options. The pipeline consists of two main stages:

A. General Task Processing (Offline Phase)

This stage constructs the "knowledge base" for decision support using natural-language policy documents (e.g., clinical guidelines).

Decision Ontology Construction: An LLM mines text chunks from policy documents to build a structured ontology ( $O$ ) of decision options (e.g., specific treatments), organizing them hierarchically and linking them to source text.
General QBAF Construction: For each decision option in the ontology, the system constructs a General Quantitative Bipolar Argumentation Framework (QBAF).
- Mining: An LLM recursively mines arguments supporting and attacking the decision option, forming a tree structure (depth 1 or 2).
- Scoring: The LLM assigns "base scores" (intrinsic strength) to arguments.
- Formalization: Crucially, the system translates natural-language conditions for each argument into formal conditions (using JSON schemas). This creates a mapping between argument applicability and specific case parameters.
- Output: A set of General QBAFs ( $G$ ) and a global parameter schema ( $\Pi$ ).

B. Case-Specific Inference (Online Phase)

When a specific case (e.g., a patient profile) is presented:

Parameter Extraction: An LLM extracts structured case parameters from the natural-language case description, guided by the global schema $\Pi$ .
Instantiation: The General QBAFs are instantiated for the specific case. Arguments whose formal conditions are not met by the extracted parameters are removed, along with their descendants and relations.
Evaluation: The resulting instantiated QBAFs are evaluated using gradual semantics (specifically DF-QuAD) to compute the final strength of the root argument (the decision option).
Output: A recommendation score and a faithful explanation (the instantiated argument graph).

C. Global Contestability

Unlike previous methods, ArgEval supports global contestability. If a user contests a decision, they can modify:

The base scores of arguments in the General QBAF.
The formal conditions or parameter schema.
Because the General QBAF is shared across all cases, these modifications automatically update the logic for all future cases satisfying the same conditions, preventing repeated errors.

3. Key Contributions

ArgEval Framework: A novel decision support system that combines LLMs with argumentation to handle open-ended decisions (not just binary claims) and provides faithful explainability via QBAFs.
Global Contestability: A mechanism where correcting a single instance updates the underlying general logic, improving performance across the entire decision space.
Efficiency: By pre-computing general argumentation frameworks, ArgEval significantly reduces inference costs compared to methods that re-mine arguments for every single case.
Empirical Validation: Application to glioblastoma treatment recommendation, demonstrating competitive performance against baselines with a fraction of the computational cost.

4. Experimental Results

The framework was tested on recommending treatments for glioblastoma using four major clinical guidelines and 360 synthetic patient cases.

Performance:
- ArgEval variants achieved competitive Label Match Rates (LMR) and Normalized Discounted Cumulative Gain (NDCG) compared to baseline LLMs and ArgLLMs-O.
- The best ArgEval variant (using Qwen3-30B, depth 2, no root score estimation, with argument schemes) achieved an LMR of 0.8818 and NDCG of 0.9771.
Efficiency:
- ArgEval was significantly more efficient. The most expensive ArgEval variant required ~2.9x fewer completion tokens than the cheapest base LLM and ~8.7x fewer than the cheapest ArgLLMs-O variant.
Contestability Impact (Case Study):
- In a specific case study, the model initially ranked "Radiotherapy 60 Gy" and "Surgical Resection" incorrectly for an 85-year-old patient.
- By manually adjusting the base scores of attacking arguments in the General QBAF and refining the parameter schema for surgical contraindications:
  - The specific case score improved from an LMR of 0.77 to 1.0.
  - The overall dataset performance improved significantly (LMR increased from 0.8009 to 0.8784), proving that local corrections yield global benefits.

5. Significance

Trust in High-Stakes AI: ArgEval addresses the "black box" problem by providing explanations that are structurally faithful to the reasoning process (the argument graph), rather than post-hoc rationalizations.
Scalable Correction: It solves the "repeated mistake" problem in AI. In healthcare, a single error correction in the general logic can prevent harm to future patients, making the system safer and more robust over time.
Cost-Effectiveness: It demonstrates that high-quality, explainable decision support does not require massive computational resources per query, making it more viable for real-world deployment in resource-constrained environments.

In conclusion, ArgEval represents a paradigm shift from treating LLMs as static predictors to dynamic, contestable reasoning engines where the underlying logic can be audited and refined by human experts, ensuring long-term reliability in critical domains.