Deterministic Fuzzy Triage for Legal Compliance Classification and Evidence Retrieval

Imagine you are a detective trying to solve a massive case. You have a stack of evidence that is 10,000 pages thick (contracts, emails, policies), and you need to find the few specific sentences that prove someone followed the rules—or broke them.

Doing this manually is slow and exhausting. Using a super-smart AI chatbot (like a "Legal Copilot") is fast, but it's like hiring a detective who sometimes makes things up, changes their mind every time you ask the same question, and won't tell you how they reached their conclusion. In a courtroom or a government audit, that's a disaster. You need proof that is consistent, repeatable, and explainable.

This paper proposes a different kind of AI detective: The Deterministic Fuzzy Triage System.

Here is how it works, broken down into simple concepts:

1. The "Dual-Encoder" (The Matchmaker)

Think of the AI as a super-fast librarian.

The Problem: You have a rule (e.g., "All employees must have unique passwords") and a pile of contract sentences. You need to know which sentences match that rule.
The Solution: The AI uses two "encoders." One reads the rule, and the other reads the contract sentence. It turns both into a secret code (a vector).
The Magic: It compares the codes. If they look similar, it gives them a high "match score." If they look different, the score is low.
Why it's special: Unlike a fancy chatbot that guesses and wanders, this librarian is deterministic. If you give it the same rule and the same sentence today, tomorrow, or next year, it will always give you the exact same score. It's like a math equation: $2 + 2$ is always $4$. This is crucial for legal audits because you need to prove your process didn't change.

2. The "Fuzzy Triage" (The Traffic Light)

This is the paper's biggest innovation. Most AI just says "Yes" or "No." But in law, things are rarely black and white. Sometimes a clause is a perfect match; sometimes it's a terrible match; and often, it's just "meh" (it's vague or ambiguous).

The authors built a Traffic Light System for the AI's confidence score:

🟢 Green Light (Auto-Compliant): The AI is 100% sure the contract follows the rule.
- Action: The computer automatically marks it as "Safe." No human needed.
🔴 Red Light (Auto-Non-Compliant): The AI is 100% sure the contract breaks the rule.
- Action: The computer automatically flags it as "Danger." No human needed.
🟡 Yellow Light (Human Review): The AI is unsure. The score is in the middle.
- Action: The computer puts this on a "To-Do" list for a human lawyer to read.

The "Fuzzy" part: The system doesn't just guess where the lights go. The authors carefully tuned the "Yellow Zone" so that the computer handles 96–98% of the work automatically, but it only makes mistakes on the "Green" and "Red" zones less than 2% of the time. It forces the AI to say, "I'm not sure, a human should look at this," rather than guessing and being wrong.

3. Why not just use a "Super Smart" Chatbot?

You might ask, "Why not just use the latest, most powerful AI?"

The Chatbot is a Black Box: It's like a magician pulling a rabbit out of a hat. You see the rabbit (the answer), but you don't know how it got there. If a regulator asks, "Why did you say this contract is safe?" the chatbot might say, "Because I felt like it," or give a different reason next time.
The Deterministic Model is a Glass Box: It's like a calculator. You can see every step of the math. You can say, "Here is the rule, here is the contract, here is the score, and here is the threshold we set." If a judge or auditor wants to re-run the test, they get the exact same result.

4. The Real-World Impact

Imagine a hospital checking if their software follows privacy laws (HIPAA).

Without this system: A team of humans spends months reading thousands of documents.
With a Chatbot: They get answers quickly, but they can't trust them for court, and they have to re-read everything to be safe.
With this System:
1. The AI instantly scans 10,000 documents.
2. It marks 9,000 as "Clearly Safe" (Green) and 500 as "Clearly Broken" (Red).
3. It puts only 500 "Maybe" documents (Yellow) on a lawyer's desk.
4. The lawyer only reads the 500 "Maybe" ones.
5. If an auditor comes in, the lawyer can show the exact settings and scores used to make those decisions.

The Bottom Line

This paper argues that for high-stakes jobs like law and compliance, we don't need the "flashiest" AI that might hallucinate. We need a boring, reliable, transparent tool that knows when to stop and ask a human for help.

It's the difference between a magic wand (unpredictable, hard to explain) and a well-calibrated scale (consistent, auditable, and knows exactly when it's too heavy to lift alone). This system gives legal teams a way to use AI without losing their ability to explain their decisions to a judge.

1. Problem Statement

The paper addresses the challenge of Legal Governance, Risk, and Compliance (GRC) teams needing to triage massive volumes of contractual evidence (contracts, policies, emails) against regulatory frameworks (e.g., HIPAA, NERC-CIP).

Key Challenges Identified:

Opacity and Non-Determinism: Current Large Language Model (LLM) "copilots" are often stochastic, non-deterministic, and lack audit trails, making them difficult to certify for regulatory compliance where decisions must be reproducible.
Binary vs. Graded Relevance: Existing models often force a binary (yes/no) decision, failing to capture the nuance of "graded relevance" (e.g., a clause might be "somewhat relevant" vs. "highly on-point").
Extreme Class Imbalance: In compliance tasks, relevant clauses are rare (positive rate $\approx 0.6\%$ ), leading to models that default to "no violation" to maximize accuracy.
Lack of Triage Mechanisms: Standard systems do not explicitly separate cases into "auto-decide" (high confidence) and "human-review" (uncertain) categories, forcing humans to review everything or trust the model blindly.

Goal: To build a deterministic, reproducible, and explainable system that performs graded retrieval and binary classification while explicitly routing uncertain cases to human reviewers via a "fuzzy triage" mechanism.

2. Methodology

The proposed solution combines a Dual-Encoder Architecture with a Fuzzy Triage Head, trained on two distinct benchmarks.

A. Model Architecture

Backbone: A RoBERTa-base dual encoder.
- Queries (rules/controls) and Clauses (contract text) are encoded separately.
- Outputs are projected into a 512-dimensional vector space.
- Scoring: Cosine similarity is used as the base scoring function ( $s(q, c)$ ).
Dual-Task Design:
1. Ranking (ACORD): Trained on the ACORD benchmark (insurance clauses) using a listwise ranking objective to optimize for graded relevance (scores 0–4).
2. Classification (CUAD): Fine-tuned on a CUAD-derived binary dataset (compliance vs. non-compliance) using a positivity-weighted binary cross-entropy loss to handle extreme class imbalance.

B. Fuzzy Triage Head

Instead of a single threshold, the system maps the scalar similarity score into three regions:

Auto-Noncompliant: Score $< \tau_{low}$
Human-Review: $\tau_{low} \le \text{Score} \le \tau_{high}$
Auto-Compliant: Score $> \tau_{high}$

Threshold Tuning: The thresholds ( $\tau_{low}, \tau_{high}$ ) are tuned on validation data to maximize auto-decision coverage subject to a hard constraint on the empirical error rate for auto-decided examples (target $\le 2\%$ ).

C. Training & Reproducibility

Hardware: Single NVIDIA A100 GPU.
Determinism: All experiments use fixed random seeds (40–44), pinned library versions, and identical preprocessing. This ensures that the same inputs always yield identical scores and triage decisions, a critical requirement for legal defensibility.
Loss Functions:
- ACORD: Listwise loss (softmax over a group of clauses) to optimize NDCG.
- CUAD: Weighted Binary Cross-Entropy (weights tested: $w=0$ and $w=200$ ) to prioritize recall.

3. Key Contributions

Deterministic Baseline: A simple, reproducible RoBERTa-based dual encoder that outperforms majority/random baselines and offers a transparent alternative to opaque LLM copilots.
Fuzzy Triage Mechanism: A novel head that partitions the score space into "Auto," "Review," and "Reject" bands, explicitly managing the trade-off between automation coverage and error rates.
Regulatory Alignment: The system is designed to map directly to regulatory concepts (e.g., residual risk handling under HIPAA §164.312) by providing clear audit trails for why a clause was auto-approved or sent for review.
Performance under Imbalance: Demonstrates that a small, deterministic model can achieve high recall ( $\approx 0.98$ ) and strong discrimination (AUC $\approx 0.98$ ) on highly imbalanced legal data.

4. Experimental Results

The model was evaluated across five random seeds on a single GPU.

A. ACORD Graded Retrieval (Ranking)

NDCG@5: $\approx 0.38 - 0.42$
NDCG@10: $\approx 0.45 - 0.50$
4-Star Precision@5: $\approx 0.37$
Observation: The model effectively ranks highly relevant clauses above irrelevant ones, providing a strong signal for evidence retrieval.

B. CUAD Binary Classification (Compliance)

AUC: $\approx 0.98 - 0.99$ (Significantly outperforming random/majority baselines).
F1 Score: Ranges from $0.22$ to $0.30$ depending on the positive-class weight.
Recall vs. Precision Trade-off:
- With Pos Weight = 200: Recall reaches 0.975 (catching almost all violations) at the cost of lower precision (0.174). This is ideal for a "screener" role where missing a violation is costly.
- With Pos Weight = 0: Higher precision (0.256) but lower recall (0.200).

C. Fuzzy Triage Performance

Coverage: The system automatically decides on 96%–98% of test cases.
Error Control:
- For the unweighted model, auto-decisions have an error rate of $\approx 1.15\%$ (within the 2% constraint).
- For the high-recall model, the error rate on auto-decisions is $\approx 3.18\%$ (slightly above the strict 2% target but acceptable for high-recall screening).
Impact: The fuzzy head successfully concentrates uncertainty into a small "human review" band (2–4% of cases), allowing the bulk of clear cases to be processed automatically.

5. Significance and Conclusion

Why this matters:

Legal Defensibility: Unlike stochastic LLMs, this system is fully deterministic. Regulators and auditors can re-run the pipeline with the same seeds and configuration to verify identical outcomes, a prerequisite for high-stakes legal compliance.
Practical Workflow Integration: The "Fuzzy Triage" approach mirrors how human auditors actually work: they auto-approve clear cases, auto-reject clear violations, and only spend time on the ambiguous "gray area."
Cost-Effectiveness: The model achieves strong performance using a single A100 GPU and standard RoBERTa-base, avoiding the computational costs and opacity of massive LLMs.

Limitations & Future Work:

The current fuzzy thresholds are tuned via grid search; future work aims for learned triage policies (e.g., conformal prediction).
The binary labels were derived from graded data, introducing some noise.
Future research will focus on joint training across multiple GRC datasets and per-tenant fairness audits.

Final Verdict: The paper argues that for regulated industries, simple, deterministic, and transparent models with explicit uncertainty handling are often more valuable than complex, opaque LLMs. This approach provides a concrete, auditable interface for mapping AI scores to legal risk and compliance decisions.