MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

Here is an explanation of the paper MAWARITH, broken down into simple concepts with creative analogies.

🌟 The Big Picture: The "Family Pie" Problem

Imagine you have a giant, delicious family pie (the estate) that needs to be sliced up and given to your relatives after you pass away. But there's a catch: you can't just slice it however you want. There is a very strict, ancient, and complex rulebook (Islamic Inheritance Law) that dictates exactly who gets a slice, how big that slice is, and who gets nothing at all.

This rulebook is like a high-stakes game of chess played with fractions. One wrong move (like forgetting a cousin or miscalculating a percentage) ruins the whole game.

The paper introduces a new tool called MAWARITH to test how good Artificial Intelligence (AI) is at playing this specific game.

🧩 What is MAWARITH? (The Dataset)

Before this paper, AI researchers mostly tested AI on simple multiple-choice questions (like "Who gets the pie? A, B, or C?"). But in real life, you need to explain how you got there.

MAWARITH is a massive library of 12,500 practice problems written in Arabic. Think of it as a "Drill Sergeant" for AI.

The Problems: Each one is a unique family scenario (e.g., "The deceased leaves a wife, two sons, a mother, and a distant uncle").
The Solution: Unlike old tests, MAWARITH doesn't just give the answer. It provides the full step-by-step reasoning, like a teacher showing their work on a math test. It shows exactly how the AI should:
1. Find the Players: Who is actually allowed to play? (Some relatives are "blocked" by closer relatives).
2. Apply the Rules: Who gets a fixed slice (like 1/6) and who gets the leftovers?
3. Do the Math: Calculate the exact percentages.

📏 The New Scorecard: MIR-E

In the past, if an AI got the final answer right, it got an "A," even if it got there by guessing or using the wrong logic. That's like getting a math test right because you guessed the answer, even though your work was wrong.

The authors created a new grading system called MIR-E. It's like a multi-stage obstacle course.

Stage 1: Did you identify the right people?
Stage 2: Did you block the right people?
Stage 3: Did you calculate the shares correctly?
Stage 4: Did you handle the "adjustments" (what happens if the slices add up to more than the whole pie, or less than the whole pie)?

If the AI fails at Stage 1, the whole score drops, because you can't calculate the rest of the pie if you don't know who is eating it.

🤖 The Race: Who Won?

The researchers tested five different AI models (some open-source, some commercial) to see who could solve these inheritance puzzles.

The Champion: Gemini-2.5-flash (a commercial AI) was the clear winner. It scored about 90%. It was like a master chef who followed the recipe perfectly, chopped the ingredients right, and baked the pie without burning it.
The Rest of the Pack: The other models (like LLaMA, Qwen, and Fanar) scored below 50%. They were like amateur bakers who often forgot to invite a guest, gave the wrong slice size, or tried to bake a pie that was bigger than the oven.

🚫 Why Did the Others Fail? (The "Hallucination" Problem)

The paper found that the AI models failed in very specific, human-like ways:

The "Ghost Guest" Error: The AI would invent relatives who didn't exist or include people who were legally blocked from inheriting.
- Analogy: It's like inviting your neighbor's dog to the family dinner because the AI thought, "Oh, dogs are family too," even though the rulebook says only humans get a slice.
The "Math Panic": Even when the AI knew who should get the pie, it messed up the fractions.
- Analogy: It knew the mother gets a slice, but instead of giving her 1/6, it gave her 1/3 because it forgot a specific rule about how many siblings were present.
The "Language Confusion": The AI struggled to read complex Arabic descriptions of family trees.
- Analogy: If the text said "the son of the son's daughter," the AI might get confused and think there are two different people instead of one specific person.

🔍 The "Adjustment" Trap

There are two special rules in this game called ʿAwl and Radd.

ʿAwl: If the slices add up to more than the whole pie, everyone's slice gets shrunk proportionally.
Radd: If the slices add up to less than the whole pie, the extra gets redistributed to specific people.

The AI models were terrible at knowing when to use these rules. They often forgot to shrink the pie or forgot to redistribute the extra, leading to a messy, unfair distribution.

💡 The Takeaway

This paper shows that while AI is great at writing poems or answering trivia, it is still struggling with complex, rule-based logic where one small mistake ruins the whole result.

Commercial AIs (like the one that won) seem to have better "common sense" and rule-following abilities.
Open-source AIs need more training specifically on these strict legal rules.

The authors hope that by releasing this dataset (MAWARITH), they can help build future AIs that act like expert legal scholars, capable of solving these complex family puzzles with step-by-step accuracy, rather than just guessing the answer.

In short: They built a giant practice exam for AI to learn how to divide a family inheritance fairly, and they found that while one AI is getting an A, most others are still failing math class.

Here is a detailed technical summary of the paper "MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs."

1. Problem Statement

Islamic inheritance law (ʿilm al-mawārīth) presents a unique and rigorous challenge for Large Language Models (LLMs). Unlike general knowledge retrieval, solving inheritance cases requires structured, multi-step reasoning governed by strict jurisprudential rules. The process involves:

Heir Identification: Determining eligible heirs based on kinship and blocking rules (ḥajb).
Rule Application: Applying specific allocation rules and handling complex adjustments like ʿawl (proportional reduction when shares exceed the estate) and radd (redistribution of surplus when shares are insufficient).
Numerical Computation: Calculating exact fractional shares.

Existing benchmarks often rely on Multiple-Choice Questions (MCQs), which fail to evaluate the reasoning chain. A model might guess the correct final answer while using flawed logic or hallucinated legal sources. Furthermore, early errors in heir identification propagate deterministically, invalidating all subsequent calculations. There is a lack of datasets that support end-to-end reasoning evaluation with intermediate step verification.

2. Methodology

A. The MAWARITH Dataset

The authors introduce MAWARITH, a large-scale, annotated dataset containing 12,500 Arabic inheritance cases.

Composition: The dataset covers a wide spectrum of complexity, from simple cases with one heir category to complex scenarios with up to 12 distinct heir categories (36 total kinship types).
Annotation Strategy: Each case includes:
- Natural Language Scenario: A description of the deceased and surviving relatives.
- Structured Reasoning Trace (<thought>): A step-by-step derivation mirroring human jurists, including heir extraction, blocking analysis, share assignment, adjustment application (ʿawl/radd), and final normalization.
- Final Answer (<answer>): A concise summary of the final shares.
- Structured Output Schema: Machine-readable fields for heirs, blocked heirs, shares, and adjustment types to facilitate automated evaluation.
Data Generation: Cases were generated using a deterministic inheritance calculator, converted to natural language, enriched by Islamic law experts, standardized using LLMs (Gemini-2.5-flash), and rigorously validated.

B. Evaluation Metric: MIR-E

To move beyond simple final-answer accuracy, the authors propose MIR-E (Mawarith Inheritance Reasoning Evaluation). This is a weighted, multi-stage metric that scores the reasoning pipeline conditionally:

Heir & Blocking Identification ( $S_h$ ): Measures F1 score for correct eligible heirs and blocked heirs, penalizing missing, spurious, or incorrectly blocked heirs.
Share Assignment ( $S_s$ ): Evaluates the accuracy of assigned fractions for correctly identified heirs.
Adjustment Detection ( $S_a$ ): Checks if the model correctly identifies the need for ʿawl or radd (only scored if previous steps are correct).
Final Allocation ( $S_f$ ): Verifies the final normalized distribution.

The final score is a weighted sum: $MIR-E = 0.30 S_h + 0.30 S_s + 0.10 S_a + 0.30 S_f$ .

C. Experimental Setup

Models Evaluated: Six LLMs in a zero-shot setting (no fine-tuning):
- Commercial: Gemini-2.5-flash.
- Open-Source Multilingual: Qwen3-32B, GPT-OSS-120B, LLaMA 3.3-70B.
- Arabic-Specialized: Fanar-C-2-27B (General) and Fanar-Sadiq (Islamic-specialized with RAG).
Prompting: Models were instructed to follow strict Islamic legal rules and output in the predefined structured format.

3. Key Results

Overall Performance

Gemini-2.5-flash significantly outperformed all other models, achieving a MIR-E score of ~90% on both validation and test sets.
All other models (open-source and specialized) scored below 50%, with Qwen3-32B being the strongest among them (~44%).

Pipeline Analysis (Error Propagation)

Heir Identification Bottleneck: The primary failure point for non-Gemini models is the first step. While Gemini achieved a 78.2% success rate in correctly identifying eligible heirs and blocking rules, other models struggled below 25%.
Cascading Errors: Once the heir list is incorrect, subsequent steps (share assignment, adjustment) become invalid. For open-weight models, performance dropped drastically between Step 1 (Heir ID) and Step 2 (Share Assignment).
Adjustment & Final Steps: When earlier steps were correct, most models performed reasonably well on ʿawl/radd detection and final calculation, suggesting the primary deficit is in legal rule application and structural parsing, not arithmetic.

Error Analysis

Heir Identification Errors: The dominant failure mode was False Eligibility (FE) (adding heirs who should be blocked) and False Blocking (FB) (excluding eligible heirs). Models frequently failed to apply blocking rules for distant relatives when closer heirs were present.
Linguistic Parsing: Models struggled with complex Arabic kinship expressions (e.g., "four daughters of a son's son"), often splitting them into incorrect separate entities.
Rule Knowledge Gaps: Even when heirs were correct, models sometimes misapplied conditional rules (e.g., failing to reduce a mother's share from 1/3 to 1/6 when siblings are present), indicating a lack of deep jurisprudential understanding.

4. Key Contributions

MAWARITH Dataset: The first large-scale (12.5k cases), step-by-step annotated dataset for Islamic inheritance reasoning in Arabic, moving beyond MCQs to full reasoning chains.
MIR-E Metric: A novel evaluation framework that decomposes reasoning into stages, allowing researchers to pinpoint exactly where models fail (e.g., identification vs. calculation) and measure error propagation.
Comprehensive Benchmarking: A rigorous zero-shot evaluation revealing a significant performance gap between commercial reasoning models (Gemini) and open-source/specialized models, highlighting the difficulty of structured legal reasoning for current LLMs.
Error Taxonomy: A detailed analysis identifying specific failure patterns, such as linguistic parsing of kinship terms and the misapplication of blocking rules (ḥajb).

5. Significance and Future Directions

Legal AI Reliability: The study demonstrates that for high-stakes legal domains, "fluent" generation is insufficient; models must adhere to strict, verifiable reasoning steps. Current open models are not yet reliable for end-to-end legal reasoning without significant fine-tuning or architectural changes.
Resource for Research: MAWARITH provides a standardized testbed for developing and evaluating reasoning-oriented LLMs in the Islamic legal domain.
Future Work: The authors suggest exploring Process Reward Models (PRMs) and reinforcement learning to provide step-level feedback, which could help models learn to correct intermediate errors. They also plan to expand the dataset to cover more complex scenarios (e.g., missing persons, intersex heirs, multiple deaths).

In conclusion, MAWARITH establishes that while LLMs show promise in legal text understanding, they currently lack the structured, rule-based reasoning capabilities required for complex Islamic inheritance law, with commercial models currently leading the field significantly over open alternatives.