Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

Imagine you are trying to teach a brilliant but slightly overconfident student how to solve a complex mystery.

In the past, we only looked at the final answer the student wrote on the test. If they got the right name of the culprit, we gave them an A. But we didn't know how they got there. Did they actually solve the mystery step-by-step, or did they just guess the answer because they recognized a pattern from a previous story?

This paper, titled "Omanic," introduces a new way to test Large Language Models (LLMs)—the super-smart AI brains behind tools like ChatGPT. Here is the breakdown using simple analogies:

1. The Problem: The "Magic Trick" vs. Real Reasoning

Current AI models are great at math and logic, but they often take shortcuts.

The Analogy: Imagine a magician pulling a rabbit out of a hat. You see the rabbit (the correct answer), but you don't see the trick (the reasoning).
The Issue: Existing tests (like HotpotQA) only ask, "Where is the rabbit?" They don't ask, "Did you actually look in the hat, or did you just pull it out of your pocket?"
The Result: We can't tell if the AI is truly thinking or just guessing based on patterns.

2. The Solution: Omanic (The "Step-by-Step" Detective Kit)

The researchers built a new dataset called Omanic. Think of this as a specialized training manual for detectives.

The Structure: Instead of just asking one big, hard question, Omanic breaks every problem down into four smaller, connected clues.
- Clue 1: Who is the author?
- Clue 2: Where was the author born?
- Clue 3: How many years ago was that?
- Clue 4: Which political party was founded that many years ago?
The Twist: To get the final answer, you must get the first three clues right. If you get Clue 2 wrong, the rest of the chain collapses.
The "Math" Ingredient: They also added a requirement for math. You can't just guess; you have to do actual calculations (like counting committees or multiplying years) to connect the dots. This prevents the AI from just "feeling" the answer.

3. The Two Parts of the Kit

The team created two versions of this dataset:

OmanicSynth (The Practice Gym): A massive library of 10,000+ practice problems generated by computers. This is where the AI trains its muscles.
OmanicBench (The Final Exam): A smaller, very strict set of 967 problems that were checked by human experts. This is the "real test" to see if the AI actually learned.

4. What They Discovered (The "Aha!" Moments)

When they tested the smartest AI models on this new exam, they found two surprising things:

The "Knowledge Floor" Effect:
- Analogy: Imagine trying to build a house of cards. If you have a solid table (good facts) underneath, you can build a tall tower (complex reasoning). But if the table is missing a leg (a missing fact), the whole tower falls, no matter how good your card-building skills are.
- Finding: The AI's ability to reason (Chain-of-Thought) works great only if it knows the basic facts. If it doesn't know the first fact, reasoning doesn't help at all.
The "Error Avalanche":
- Analogy: Think of a game of "Telephone." If the first person whispers the wrong message, the second person repeats the wrong message, and by the time it gets to the fourth person, the message is completely garbled.
- Finding: In multi-step reasoning, errors get worse as you go. If the AI makes a small mistake in step 1, the chance of it failing in step 4 skyrockets. The later steps are much harder because they are carrying the weight of previous mistakes.

5. The Results: Training Works!

The researchers took open-source AI models (which were struggling on the exam) and trained them on the "Practice Gym" (OmanicSynth).

The Outcome: After training, these models got significantly better—not just on the Omanic test, but on other logic and math tests too.
The Takeaway: This proves that if you teach an AI how to break problems down into steps and check its own facts, it becomes a better thinker overall. It's not just memorizing answers; it's learning how to think.

Summary

Omanic is a new tool that forces AI to show its work, step-by-step. It revealed that AI is great at reasoning if it knows the facts, but it struggles when facts are missing or when errors pile up. By using this new dataset to train AI, we can build models that are less likely to guess and more likely to actually solve complex problems.

Where to find it: The researchers have released all their data and code for free, so anyone can use it to build smarter, more reliable AI.

1. Problem Statement

Despite the advancements of Large Language Models (LLMs) in reasoning tasks, current evaluation methods suffer from a critical blind spot: the lack of step-level ground truth.

The "Black Box" of Reasoning: Existing benchmarks (e.g., HotpotQA, MuSiQue) typically evaluate only the final answer. This masks whether a model arrived at the correct conclusion through rigorous logical deduction or via "reasoning shortcuts" (heuristic pattern matching).
Diagnostic Limitations: Without decomposed intermediate steps, it is impossible to pinpoint exactly where and why a reasoning chain fails (e.g., is the error due to missing knowledge, a calculation mistake, or a logical disconnect?).
Need for Granularity: There is a pressing need for a dataset that provides structural supervision (sub-questions and intermediate answers) to enable fine-grained diagnosis of multi-hop reasoning capabilities.

2. Methodology: The Omanic Framework

The authors propose Omanic, a comprehensive resource consisting of two main components: a machine-generated training set (OmanicSynth) and a human-annotated evaluation benchmark (OmanicBench).

A. Dataset Construction Pipeline

The construction process involves four stages:

Triplet Retrieval: Starting with 2-hop questions from MuSiQue, the authors retrieve corresponding (subject, relation, object) triplets from WikiData5M to serve as anchor subjects.
Constrained Synthesis: Using Claude-Sonnet-4.5, the system synthesizes 4-hop queries by merging original components with new single-hop questions derived from retrieved triplets.
- Constraints: Questions are assigned to one of eight domains (e.g., History, Art) and must include at least one mathematically grounded hop (requiring arithmetic, counting, or temporal calculation).
- Topologies: Questions follow three specific reasoning graph topologies (Bridge, Chain, Converging) to prevent shortcut solutions.
Automated Filtering: An ensemble of four models filters the dataset. Any question answered correctly by $\ge$ 2 models is discarded to ensure high difficulty. This reduced the set to 10,296 training examples.
Expert Review: 1,172 candidate instances underwent rigorous human auditing by 10 trained annotators (300+ hours).
- Annotators verified factual accuracy, logical coherence, distractor plausibility, and mathematical consistency.
- This resulted in 967 high-quality, expert-reviewed test instances for OmanicBench.

B. Dataset Characteristics

Structure: Each multi-hop question is decomposed into four cross-domain single-hop sub-questions with intermediate answers.
Complexity: Requires integrating factual retrieval with mathematical reasoning across distinct knowledge domains.
Annotation: Provides explicit step-level ground truth, enabling the evaluation of intermediate reasoning steps, not just the final output.

3. Key Contributions

Omanic Benchmark: The first open-domain 4-hop QA benchmark with structural annotations for step-level reasoning diagnosis. It contains 10,296 training and 967 test instances.
Empirical Validation of Transferability: Demonstrated that supervised fine-tuning (SFT) on OmanicSynth significantly improves reasoning capabilities across six external benchmarks (including MATH and logical reasoning), with an average gain of 7.41 points.
Novel Diagnostic Insights: Utilized the step-level annotations to empirically analyze two critical phenomena in multi-hop reasoning:
- The Knowledge Floor Effect: The efficacy of Chain-of-Thought (CoT) is strictly bounded by factual completeness.
- Error Propagation: Errors inherently amplify as the reasoning chain progresses.

4. Experimental Results

A. Benchmark Performance (OmanicBench)

High Difficulty: State-of-the-art (SOTA) proprietary LLMs achieved only 73.11% accuracy on multiple-choice questions (MCQ) using CoT prompting (e.g., Claude-Sonnet-4.6).
Open-Source vs. Proprietary: Proprietary models generally outperformed open-source counterparts. However, fine-tuning open-source models (e.g., Qwen3-8B) on OmanicSynth yielded massive improvements (e.g., MCQ accuracy rose from 25.65% to 53.77%).
Efficiency Trade-off: While CoT improved accuracy, it significantly increased inference costs (token length). For instance, some models generated outputs 4x longer than direct answering.

B. Key Observations & Analysis

Knowledge Floor Effect:
- CoT gains are highly dependent on the model's ability to retrieve correct atomic facts.
- As the number of incorrect single-hop steps increases, CoT performance gains diminish monotonically. When three steps are wrong, CoT gains drop to near zero ($-0.7$), proving that CoT cannot substitute for missing knowledge.
Error Propagation:
- Independent vs. Chain Evaluation: Even when previous steps are provided with gold answers (independent evaluation), later hops (Step 4) show significantly higher error rates than earlier steps, indicating an inherent difficulty gradient.
- Amplification: Under chain evaluation (where errors propagate), the error rate for Step 4 jumps to 33.0% (a 4.7-point increase over independent evaluation). This confirms that errors compound sequentially in multi-hop inference.

5. Significance and Future Impact

Diagnostic Tool: Omanic shifts the evaluation paradigm from "end-to-end accuracy" to "process-oriented diagnosis," allowing researchers to identify specific failure modes (knowledge gaps vs. logical flaws).
Training Data Quality: The results validate that high-quality, step-wise supervised data (OmanicSynth) can effectively transfer complex reasoning skills to open-source models, outperforming simple pattern matching.
Separation of Capabilities: The study provides evidence that reasoning and knowledge retrieval are separable capabilities in LLMs; a model can possess strong reasoning logic but fail if the underlying factual foundation is missing.
Resource Availability: The dataset and code are publicly released to facilitate future research into multi-hop reasoning, compositional generalization, and error analysis in LLMs.

Conclusion

Omanic addresses a fundamental gap in LLM evaluation by providing the first large-scale, step-annotated multi-hop benchmark. It reveals that while CoT is powerful, it is fragile and heavily dependent on factual completeness, with errors compounding rapidly in longer reasoning chains. The dataset serves as both a rigorous benchmark and a high-quality training resource to advance the reasoning capabilities of future language models.