Here is an explanation of the paper MAWARITH, broken down into simple concepts with creative analogies.
🌟 The Big Picture: The "Family Pie" Problem
Imagine you have a giant, delicious family pie (the estate) that needs to be sliced up and given to your relatives after you pass away. But there's a catch: you can't just slice it however you want. There is a very strict, ancient, and complex rulebook (Islamic Inheritance Law) that dictates exactly who gets a slice, how big that slice is, and who gets nothing at all.
This rulebook is like a high-stakes game of chess played with fractions. One wrong move (like forgetting a cousin or miscalculating a percentage) ruins the whole game.
The paper introduces a new tool called MAWARITH to test how good Artificial Intelligence (AI) is at playing this specific game.
🧩 What is MAWARITH? (The Dataset)
Before this paper, AI researchers mostly tested AI on simple multiple-choice questions (like "Who gets the pie? A, B, or C?"). But in real life, you need to explain how you got there.
MAWARITH is a massive library of 12,500 practice problems written in Arabic. Think of it as a "Drill Sergeant" for AI.
- The Problems: Each one is a unique family scenario (e.g., "The deceased leaves a wife, two sons, a mother, and a distant uncle").
- The Solution: Unlike old tests, MAWARITH doesn't just give the answer. It provides the full step-by-step reasoning, like a teacher showing their work on a math test. It shows exactly how the AI should:
- Find the Players: Who is actually allowed to play? (Some relatives are "blocked" by closer relatives).
- Apply the Rules: Who gets a fixed slice (like 1/6) and who gets the leftovers?
- Do the Math: Calculate the exact percentages.
📏 The New Scorecard: MIR-E
In the past, if an AI got the final answer right, it got an "A," even if it got there by guessing or using the wrong logic. That's like getting a math test right because you guessed the answer, even though your work was wrong.
The authors created a new grading system called MIR-E. It's like a multi-stage obstacle course.
- Stage 1: Did you identify the right people?
- Stage 2: Did you block the right people?
- Stage 3: Did you calculate the shares correctly?
- Stage 4: Did you handle the "adjustments" (what happens if the slices add up to more than the whole pie, or less than the whole pie)?
If the AI fails at Stage 1, the whole score drops, because you can't calculate the rest of the pie if you don't know who is eating it.
🤖 The Race: Who Won?
The researchers tested five different AI models (some open-source, some commercial) to see who could solve these inheritance puzzles.
- The Champion: Gemini-2.5-flash (a commercial AI) was the clear winner. It scored about 90%. It was like a master chef who followed the recipe perfectly, chopped the ingredients right, and baked the pie without burning it.
- The Rest of the Pack: The other models (like LLaMA, Qwen, and Fanar) scored below 50%. They were like amateur bakers who often forgot to invite a guest, gave the wrong slice size, or tried to bake a pie that was bigger than the oven.
🚫 Why Did the Others Fail? (The "Hallucination" Problem)
The paper found that the AI models failed in very specific, human-like ways:
- The "Ghost Guest" Error: The AI would invent relatives who didn't exist or include people who were legally blocked from inheriting.
- Analogy: It's like inviting your neighbor's dog to the family dinner because the AI thought, "Oh, dogs are family too," even though the rulebook says only humans get a slice.
- The "Math Panic": Even when the AI knew who should get the pie, it messed up the fractions.
- Analogy: It knew the mother gets a slice, but instead of giving her 1/6, it gave her 1/3 because it forgot a specific rule about how many siblings were present.
- The "Language Confusion": The AI struggled to read complex Arabic descriptions of family trees.
- Analogy: If the text said "the son of the son's daughter," the AI might get confused and think there are two different people instead of one specific person.
🔍 The "Adjustment" Trap
There are two special rules in this game called ʿAwl and Radd.
- ʿAwl: If the slices add up to more than the whole pie, everyone's slice gets shrunk proportionally.
- Radd: If the slices add up to less than the whole pie, the extra gets redistributed to specific people.
The AI models were terrible at knowing when to use these rules. They often forgot to shrink the pie or forgot to redistribute the extra, leading to a messy, unfair distribution.
💡 The Takeaway
This paper shows that while AI is great at writing poems or answering trivia, it is still struggling with complex, rule-based logic where one small mistake ruins the whole result.
- Commercial AIs (like the one that won) seem to have better "common sense" and rule-following abilities.
- Open-source AIs need more training specifically on these strict legal rules.
The authors hope that by releasing this dataset (MAWARITH), they can help build future AIs that act like expert legal scholars, capable of solving these complex family puzzles with step-by-step accuracy, rather than just guessing the answer.
In short: They built a giant practice exam for AI to learn how to divide a family inheritance fairly, and they found that while one AI is getting an A, most others are still failing math class.