A Benchmark for Gap and Overlap Analysis as a Test of… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are shopping for life insurance. You have ten different policies in front of you, written in dense, confusing legal language. You have a specific situation in mind: "What happens if I die in a car accident while intoxicated, or if I commit suicide 13 months after buying the policy?"

You need to know two things:

Overlap: Which of these ten policies will actually pay out my family?
Gap: Which policies will say "No, we don't cover that," and why?

Doing this manually is a nightmare. Lawyers have to read every word, cross-reference every clause, and hope they didn't miss a tiny detail.

This paper introduces a new way to solve that problem using Knowledge Graphs (KGs) and a "test drive" called a Benchmark. Here is the breakdown in simple terms.

1. The Problem: The "Black Box" of Legal Text

The authors argue that just having a lot of data (like a massive library of insurance contracts) isn't enough. You need a way to prove that your computer system understands the rules exactly the way a human expert does.

Usually, we test AI by asking it to read a contract and guess the answer. But AI (like Large Language Models or LLMs) is like a very smart student who reads the book but sometimes makes up rules based on what sounds right, rather than what is actually written. They might say, "Oh, suicide is usually bad, so I'll say this policy denies it," even if the specific policy says it covers it after 12 months.

2. The Solution: Building a "Digital Twin" of the Contracts

Instead of just feeding the AI the raw text, the authors built a Knowledge Graph. Think of this as turning the messy, 50-page PDF of an insurance contract into a clean, organized LEGO set.

The TBox (The Blueprint): This is the rulebook. It defines what a "Policy," "Suicide Clause," or "Grace Period" actually means in the world of insurance. It's the instruction manual for the LEGO set.
The ABox (The Built Model): This is the actual LEGO structure built for each of the 10 specific contracts. Every single fact (e.g., "Contract C1 has a 24-month suicide waiting period") is a specific LEGO brick snapped into place.
The Traceability: Crucially, every single LEGO brick has a tiny tag attached to it that says, "I came from Page 4, Paragraph 2 of Contract C1." If the computer makes a decision, you can instantly look at the tag and see the exact source text.

3. The Test: The "Scenario Challenge"

To see if this LEGO system works, the authors created 58 "What-If" scenarios (Competency Questions).

Example Scenario: "The insured dies by suicide 13 months after the policy started."

They ran this scenario against their LEGO model using a precise query language (SPARQL). Because the LEGO bricks are snapped together perfectly according to the blueprint, the computer can instantly say:

"Contracts 1-5 and 7-10 say COVERED (because their waiting period is 12 months)."
"Contract 6 says DENIED (because its waiting period is 24 months)."

And because of the tags, it can show you the exact sentence in the contract that proves it.

4. The Showdown: LEGO vs. The "Smart Student"

The authors then asked three top-tier AI models (ChatGPT, Gemini, Claude) to read the raw text and answer the same 58 scenarios.

The Results:

The LEGO System (Ontology): Was 100% consistent. It never got tired, never guessed, and always pointed to the exact evidence.
The "Smart Student" (LLMs): They were okay at simple questions but started failing on complex ones.
- The "Missing Clause" Trap: If a contract didn't mention a specific rule (like "no drinking"), the AI often assumed the claim was denied. The LEGO system correctly said, "This scenario doesn't apply to this contract," because the contract simply didn't have that rule.
- The "Complex Structure" Trap: When the scenario involved a complex joint policy (two people on one plan), the AI got confused and gave different answers for different contracts, even when the logic was the same.

The Metaphor:
Imagine you are trying to find a specific ingredient in a kitchen.

The LLM is like a chef who looks at the pantry, smells the air, and guesses, "I bet there's no salt here," because they don't see a salt shaker. They might be right, or they might be wrong.
The Knowledge Graph is like a robot that has a digital map of every single jar on every shelf. It checks the map, sees the jar is missing, and says, "Confirmed: No salt." If you ask why, it points to the map coordinates.

5. Why This Matters

This paper proves that for high-stakes jobs (like insurance, law, or healthcare), you can't just rely on AI that "guesses" based on patterns. You need a system that models the rules explicitly.

The "Benchmark" they created is like a standardized driving test for these AI systems. It shows that:

Structure wins: Turning text into a structured map (the LEGO set) makes the AI much more reliable.
Evidence is key: You can't just get an answer; you need to see the "receipt" (the source text) proving why that answer was chosen.
It's reusable: While they used insurance, this same "Blueprint + LEGO + Test Drive" method can be used for healthcare laws, bank regulations, or any complex rulebook.

In short: The authors built a "truth machine" for insurance contracts. They showed that while AI is great at chatting, it needs a structured, rule-based backbone to be trusted with serious decisions where getting the answer wrong costs people their money or lives.

1. Problem Statement

The evaluation of Knowledge Graph (KG) quality has traditionally relied on structural and logical metrics (e.g., schema validity, completeness). However, these metrics often fail to assess whether a KG can effectively support specific competency questions that users care about in real-world scenarios.

The paper addresses the challenge of Gap and Overlap Analysis in policy-like documents (specifically life insurance contracts). Stakeholders need to determine, for a given scenario (e.g., "insured dies by suicide 13 months after issuance"), which contracts cover the event (Overlap) and which do not (Gap).

The Core Issue: Gaps and overlaps in this domain are usually substantive (due to genuine differences in coverage terms, exclusions, or restrictions) rather than caused by missing data.
The Limitation of Current Methods: Evaluating this requires resources that align natural language text with formal representations and provide evidence-linked ground truth. Existing datasets (like ContractNLI or CUAD) focus on single-document inference or clause extraction but lack cross-contract comparison using formal semantics. Furthermore, Large Language Models (LLMs) often struggle with strict structural applicability and providing traceable evidence for complex contractual logic.

2. Methodology

The authors propose a comprehensive, executable benchmark designed to test KG "task readiness." The methodology consists of three interlinked components:

A. The Dataset (Contracts)

Corpus: Ten synthetic but diverse life insurance contracts (C1–C10) representing various product types (Term, Whole, Variable Universal, Indexed Universal, Joint Survivorship, etc.).
Complexity: Contracts are categorized as Simple, Moderate, and Complex. They were generated using LLMs (Claude Sonnet 4.6) and rigorously reviewed by a domain expert to ensure legal plausibility and terminological consistency.
Diversity: The contracts feature variations in critical clauses, such as suicide exclusion periods (12 vs. 24 months) and grace periods, creating genuine gaps and overlaps.

B. The Ontology (TBox and ABox)

TBox (Schema): A verified OWL/Turtle ontology defining the domain vocabulary (Policy, Parties, Coverage, Exclusions, Riders, etc.). It employs standard ontology engineering patterns:
- Subsumption: Hierarchical class structures (e.g., Policy $\to$ UniversalLifePolicy).
- Value Partition: Using named individuals for mutually exclusive values (e.g., SuicideBenefitType) rather than free-text literals.
- Traceability: Every class and property includes rdfs:comment annotations detailing which contracts instantiate them.
ABox (Instances): An instantiated knowledge base where each contract (C1–C10) is represented as an individual linked to specific facts (premiums, exclusions, beneficiaries).
- Evidence Linking: Crucially, every ABox individual is annotated with sourceContractID, sourceContractSection, and coverageSourceText. This ensures that any query result can be traced back to the exact clause in the original document.

C. The Competency Question Suite

Scenarios: 58 structured scenarios covering basic claims, loans, exclusions, and edge cases.
Ground Truth: Each scenario includes:
- A natural language description.
- Two SPARQL queries: one to find contracts where the claim is Covered ( $Q^+$ ) and one where it is Denied ( $Q^-$ ).
- Manually verified labels for all 10 contracts (Covered, Denied, or Not_Applicable).
- Clause-level excerpts justifying the label.

3. Key Contributions

Curated Benchmark Dataset: Release of 10 domain-expert verified life insurance contracts and a corresponding, traceable ontology (TBox/ABox).
Executable Competency Questions: A suite of 58 scenarios with SPARQL-based ground truth and clause-level evidence, enabling systematic, reproducible evaluation.
Task-Oriented Evaluation Framework: A novel approach to KG quality assessment that moves beyond structural checks to test whether a KG can consistently answer specific domain questions with evidence.
Comparative Analysis: A rigorous comparison between an ontology-driven pipeline (SPARQL over the KG) and a text-only LLM pipeline (direct inference from raw text).

4. Results

The authors evaluated the 58 scenarios across 10 contracts (580 total evaluations) using three state-of-the-art LLMs (Claude Sonnet 4.6, ChatGPT-5.3, Gemini-3) and compared them against the ontology-driven ground truth.

Accuracy:
- Claude Sonnet 4.6: 87.76% accuracy.
- ChatGPT-5.3: 72.93% accuracy.
- Gemini-3: 65.17% accuracy.
- Ontology Pipeline: 100% consistent with ground truth (by definition, as it executes the verified queries).
Error Analysis:
- Complexity Sensitivity: Error rates increased significantly for complex contracts (e.g., Variable Universal Life, Joint Survivorship).
- Primary Failure Mode: The most common error was misclassifying Not_Applicable as Denied. LLMs often interpreted the absence of a specific clause as a denial, whereas the ground truth (and logical structure) dictated that if a clause is missing, the scenario is simply not applicable to that product.
- Inconsistency: Inter-model agreement was only 48.8% for correct answers. Models often disagreed on structural applicability (e.g., whether a loan scenario applied to a joint policy).
- Evidence Quality: LLMs frequently provided superficial or generic evidence (e.g., citing general product descriptions) rather than specific clause excerpts, failing the "traceability" requirement.
Ontology Performance: The SPARQL-driven pipeline demonstrated perfect consistency and determinism. Because the ABox encodes constraints explicitly (e.g., exclusion periods as typed integers), the same query logic applied uniformly across all contracts, producing results with automatic, clause-level justification.

5. Significance

KG Task Readiness: The paper establishes that gap and overlap analysis is a robust test for whether a KG is "ready" for a specific task. It proves that explicit ontological modeling is superior to unstructured text inference for tasks requiring strict adherence to contractual scope and structural constraints.
Explainability and Auditability: The benchmark highlights the critical need for evidence-grounded AI. In high-stakes domains like insurance, an answer is insufficient without a traceable link to the source rule. The proposed benchmark enforces this via the sourceContractID and coverageSourceText properties.
Hybrid AI Potential: The results suggest a path for neuro-symbolic AI. While LLMs struggle with strict logic and consistency, they are fluent in language. The benchmark provides the infrastructure (ground truth, executable queries) to train or evaluate hybrid systems where LLMs extract structured features to drive symbolic reasoning, or translate natural language queries into formal SPARQL.
Reusability: Although focused on life insurance, the framework (aligned text + ontology + executable scenarios) serves as a template for evaluating KGs in other regulatory domains (healthcare, finance), addressing the broader challenge of making AI outputs reproducible and auditable.

A Benchmark for Gap and Overlap Analysis as a Test of KG Task Readiness