Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA

Imagine you are trying to fix a leak in your house. You call a plumber, and instead of just giving you a quick answer, they pull out a massive, dusty encyclopedia of plumbing codes. But here's the catch: the answer isn't in one single page. It's scattered across three different books, and the books are connected by tiny, invisible threads. If you miss one thread, you might fix the leak but accidentally flood the basement.

This is exactly the problem the researchers at Seoul National University are tackling with their new paper, "Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA."

Here is the story of their discovery, broken down into simple concepts.

1. The Problem: The "Statutory Retrieval Gap"

Most legal AI tools today are trained like detectives. They look for similar past cases (like finding a detective story where a thief was caught in a similar way). This works great in countries like the US or UK, where laws are built on past court decisions.

But in countries like South Korea (and many others), laws are built on statutes (written rules). Think of these rules not as a pile of books, but as a giant, multi-level tree.

The Trunk: The main law (e.g., "Buildings must be safe").
The Branches: Detailed rules (e.g., "Fire extinguishers must be here").
The Leaves: Tiny technical specs (e.g., "The handle must be 1.2 meters high").

The Trap: When a regular AI is asked, "Can I use a removable railing here?", it looks for the word "railing" in the trunk. It finds nothing. It doesn't know to follow the "branch" down to the "leaf" where the answer actually lives. The researchers call this the Statutory Retrieval Gap. The AI is looking in the wrong room because the answer is hidden in a different part of the house, connected only by a tiny, invisible hallway (a citation).

2. The Solution: A New Map (SEARCHFIRESAFETY)

To fix this, the team built a new testing ground called SEARCHFIRESAFETY. They used fire safety regulations as their test case because fire safety is life-or-death. If you get it wrong, people could get hurt.

They created a digital map (a graph) that connects every law to the specific technical rules it references. It's like giving the AI a GPS that doesn't just say "Go North," but says, "Go North, then turn left at the 'Enforcement Decree' sign, then walk down the 'Technical Standard' hallway."

They tested the AI with two types of questions:

Real Questions: "Can I install this specific railing?" (Requires finding the hidden leaf on the tree).
Safety Questions: "Here is a half-finished map. Can you tell me the answer?" (This tests if the AI will lie or admit it doesn't know).

3. The Big Discovery: The "Overconfident Expert"

The researchers found two major things:

A. The Map Works (Retrieval)
When they gave the AI the "GPS map" (the citation graph) to help it navigate, it got much better at finding the right answers. It stopped guessing and started following the legal threads. It proved that for statute-based laws, you can't just search for keywords; you have to follow the connections.

B. The Danger of "Fake Confidence" (Safety)
This is the scary part. When the AI was given incomplete information (like the half-finished map), many of the smartest AI models didn't say, "I don't know." Instead, they hallucinated.

Imagine a student taking a test. If they don't know the answer, a safe student raises their hand and says, "I need more time." But these AI models acted like overconfident experts. They made up a plausible-sounding answer that sounded perfect but was completely made up.

The researchers found that when they trained the AI specifically on legal texts to make it "smarter," it actually became more dangerous in uncertain situations. It became so eager to sound like a lawyer that it stopped admitting when it was missing a crucial piece of the puzzle.

4. The Takeaway: We Need "Humble" AI

The paper concludes that building a legal AI isn't just about making it smarter or faster. It's about making it safer.

Current AI: "I see a word match! Here is my answer!" (Even if the answer is wrong).
Safe AI: "I see a word match, but I'm missing the technical rule that connects to it. I cannot answer this safely."

The researchers argue that for laws that affect real-world safety (like fire codes), we need AI that knows when to stop and ask for help rather than guessing. They have built a new tool to test exactly this: Can the AI find the hidden connections, and more importantly, can it admit when it doesn't have enough information to give a safe answer?

In short: They built a better map for legal AI, but they also discovered that the smartest maps can sometimes lead you off a cliff if the driver is too confident to admit they are lost.

1. Problem Definition

The paper addresses a critical gap in current Legal AI benchmarks, which predominantly focus on Common Law (case law) reasoning. In Common Law, legal reasoning relies on finding semantic similarities between static case precedents. However, Statute-Centric legal systems (Civil Law), such as South Korea's, operate differently:

Hierarchical Fragmentation: Legal meaning is distributed across a dynamic hierarchy of documents (Act $\to$ Enforcement Decree $\to$ Enforcement Rule $\to$ Technical Standard).
The Statutory Retrieval Gap: User queries often use colloquial language that lexically matches high-level statutes, but the actual answer resides in low-level technical standards. These documents are connected only by explicit citation chains, not semantic similarity. Conventional dense retrievers fail to bridge this gap because the target document is semantically distant from the query.
Safety and Hallucination: In safety-critical domains (e.g., fire safety), models must not only retrieve evidence but also abstain from answering when the statutory context is incomplete. Current models often hallucinate confident answers even when key bridging documents are missing, posing real-world safety risks.

2. Methodology: The SEARCHFIRESAFETY Benchmark

The authors introduce SEARCHFIRESAFETY, a benchmark designed specifically for statute-centric regulatory reasoning, instantiated using South Korean fire-safety regulations.

A. Dataset Construction

Corpus Compilation: A temporally synchronized corpus (as of April 30, 2025) of 131 statutes and regulations was compiled. It includes a Human-in-the-Loop pipeline to convert complex tables, formulas, and PDF diagrams into structured text, ensuring zero information loss.
Citation Graph Augmentation: The flat text corpus was augmented with a citation graph. The authors parsed explicit hyperlinks and used regex to detect implicit intra-statute references (e.g., "as defined in the preceding Article"), creating a fully connected network of legal dependencies.
Dual-Source QA Construction:
1. Real-World Expert QA (876 pairs): Collected from the National Fire Agency's petition portal. These require traversing citation chains to map colloquial user inquiries to specific legal provisions.
2. Synthetic Multi-Hop QA (3,395 pairs): Generated using GPT-4o to create "Strict Conditional Dependency" questions. These are designed so that an answer is only possible if both Document A (the query context) and Document B (the delegated technical standard) are present. This allows for testing model behavior under Partial Context (where Document B is withheld).

B. Proposed Retrieval Strategy: Structure-Aware Reranking (SAR)

To address the retrieval gap, the authors propose SAR, a graph-guided reranking framework:

Mechanism: Starting with top- $K$ retrieved documents (seeds), SAR induces a local subgraph. It propagates relevance scores from seeds to their explicitly linked neighbors.
Robust Voting: It employs a dual-penalty mechanism to filter noise:
- Penalizes "hub" seeds that cite indiscriminately.
- Penalizes "super-hub" targets (like generic Article 1) that receive too many citations.
Residual Fusion: The final score combines the semantic score ( $S_{dense}$ ) with a structural bonus ( $B(n)$ ). The structural bonus is gated by $(1 - S_{dense})$ , ensuring it primarily boosts low-ranked documents that are structurally connected but semantically distant, without disrupting high-confidence semantic matches.

C. Evaluation Protocols

The benchmark evaluates models under three settings:

Zero-Shot: Tests parametric knowledge.
Full Context: Provides all necessary documents (A + B) to test multi-hop reasoning.
Partial Context: Provides only Document A. The model is expected to select "Cannot be determined" (abstain) rather than hallucinating an answer.

3. Key Results

Retrieval Performance

Dense vs. Sparse: Dense retrievers (Qwen3-Emb, BGE-M3) significantly outperform sparse methods (BM25) due to the lexical mismatch between user queries and legal terminology.
SAR Effectiveness: SAR consistently improves retrieval performance across all embedding models.
- For Qwen3-Emb, SAR increased Recall@50 from 69.02% to 73.49% and nDCG@50 from 37.90% to 41.58%.
- For BGE-M3, SAR achieved the best overall ranking quality (Recall@50: 74.57%).
Graph vs. Semantic Neighbors: Visualization (PCA) showed that explicit citation graphs create direct bridges to ground-truth documents that semantic kNN graphs miss, validating the "Statutory Retrieval Gap" hypothesis.

Generation and Safety Performance

The Safety Trade-off: While domain-adapted models (via Continued Pretraining on legal corpora) improved accuracy in Zero-Shot and Full Context settings, they significantly degraded in Partial Context settings.
Hallucination under Uncertainty: Models fine-tuned on legal data became overconfident. When the bridging document was missing, they frequently generated plausible but unsupported answers instead of abstaining.
- Example: GPT-4o achieved 72.10% accuracy in Partial Context (correctly abstaining), whereas fine-tuned open-weight models dropped to ~42-54%, often hallucinating specific legal thresholds.
Conclusion: Improved retrieval alone is insufficient; models must be explicitly trained or evaluated on uncertainty awareness and the ability to refuse answering when evidence is incomplete.

4. Key Contributions

SEARCHFIRESAFETY Benchmark: The first benchmark to jointly evaluate hierarchical retrieval and model safety in statute-centric domains, moving beyond the case-law focus of existing datasets.
Citation Graph Dataset: A high-quality legal corpus with explicit citation graph annotations, enabling systematic evaluation of multi-hop reasoning and safe abstention.
Structure-Aware Reranking (SAR): A novel retrieval method that leverages explicit legal citation structures to recover evidence that semantic similarity alone misses.
Safety Insights: Empirical evidence demonstrating that domain adaptation can exacerbate hallucination risks in safety-critical settings by reducing a model's ability to recognize missing evidence.

5. Significance

This work highlights that the path to reliable Legal AI in Civil Law jurisdictions requires a paradigm shift:

From Semantic to Structural: Retrieval systems must explicitly model the hierarchical and citation-based nature of statutes, not just semantic similarity.
From Accuracy to Safety: Benchmarks must evaluate refusal behavior. A model that answers confidently with incomplete evidence is dangerous in regulatory domains.
Real-World Impact: By focusing on fire safety, the paper underscores that legal AI errors can lead to physical harm, necessitating rigorous evaluation of uncertainty handling before deployment.

The findings suggest that future Legal RAG systems must integrate graph-guided retrieval with uncertainty-calibrated generation to ensure both factual grounding and safety.