KohakuRAG: A simple RAG framework with hierarchical document indexing

Imagine you are a brilliant but slightly forgetful librarian named LLM (Large Language Model). You know a lot of facts, but you often make things up (hallucinate) or forget details from your training data. To fix this, you are given a massive library of 32 thick technical manuals about AI energy consumption. Your job is to answer 300 specific questions about them, citing exactly which page you found the answer on.

The catch? The questions are tricky. They might use different words than the books (e.g., asking for "PUE" when the book says "Power Usage Effectiveness"), and if you can't find the answer, you must admit you don't know rather than guessing.

The team Kohaku-Lab built a system called KohakuRAG to help you do this perfectly. They won first place in a competition by solving three main problems using some clever tricks. Here is how they did it, explained with simple analogies:

1. The Problem: The "Flat Pile" vs. The "Tree House"

The Old Way: Most systems take a book, chop it into random, flat piles of paper (chunks), and throw them in a box. If you ask a question, the system grabs a few random pages.

The Issue: You lose the story. You might grab a sentence about "solar panels" without the paragraph explaining why they are efficient. Also, if you need to cite the source, you might point to a random page number that doesn't make sense.

The KohakuRAG Solution: The "Tree House" Index
Instead of a flat pile, they organized the library like a Tree House.

The Structure: The whole book is the trunk. Chapters are branches. Paragraphs are rooms. Sentences are the furniture.
The Magic: They built "elevators" (embeddings) that go from the bottom (furniture) up to the top (trunk). If you ask about a specific chair (sentence), the system automatically knows which room (paragraph) and which floor (chapter) it belongs to.
Why it helps: When you find the answer, you know exactly where it lives in the building. You can point to the specific room, not just a random floor.

2. The Problem: The "Lost in Translation" Search

The Old Way: You ask the librarian, "How much energy does a Google data center use?" The librarian looks for that exact phrase. If the book says "Power Usage Effectiveness of Google's cloud facilities," the librarian says, "I can't find it!" because the words don't match.

The KohakuRAG Solution: The "Detective Squad"
Instead of sending one librarian to search, they send a Squad of Detectives.

The Planner: Before searching, a smart AI (the Planner) takes your question and sends out 4 different detectives.
- Detective A asks: "Google data center energy."
- Detective B asks: "Google PUE metrics."
- Detective C asks: "How efficient is Google's cloud?"
- Detective D asks: "Google sustainability report."
The Reranking: All detectives bring back piles of papers. The system then looks at the piles. If three detectives found the same page, that page is probably the right one! It's like a popularity vote. This ensures the system finds the answer even if you used the wrong words.

3. The Problem: The "Wobbly Answer"

The Old Way: You ask the librarian a question. Sometimes they give you the right answer. Sometimes, because they are a bit nervous or the lighting is bad, they give a slightly different answer, or they say "I don't know" even when the answer is right there.

The KohakuRAG Solution: The "Panel of Judges"
Instead of asking one librarian, they ask 9 different librarians (or the same librarian 9 times with a slight twist).

The Vote: They write down their answers.
The "Blank" Filter: Sometimes a librarian gets scared and writes "I don't know" (abstention) even if they saw the answer. The system is smart enough to say, "Hey, 8 other people found the answer, so we'll ignore that one scared librarian."
The Majority Rule: The system takes the answer that most people agreed on. This makes the final answer very stable and reliable.

The Secret Sauce: "Don't Put the Question at the End"

The researchers discovered something funny about how AI reads. If you give the AI a long list of documents and then ask the question at the very end, the AI gets confused and forgets the beginning (like reading a long email and forgetting the first sentence).

The Fix: They put the Documents first, and the Question last. It's like reading the menu before ordering, rather than ordering and then reading the menu. This simple change improved their score by a huge amount (80% relative improvement!).

The Result

By building a Tree House index, sending a Detective Squad to search, and using a Panel of Judges to vote, KohakuRAG became the only team to stay in 1st Place on both the public and private leaderboards.

They proved that you don't need to be the biggest, most expensive AI to win; you just need a smart way to organize your library, ask the right questions, and double-check your work.

Here is a detailed technical summary of the paper "KohakuRAG: A simple RAG framework with hierarchical document indexing" by Kohaku-Lab.

1. Problem Statement

The paper addresses the limitations of standard Retrieval-Augmented Generation (RAG) systems when applied to high-precision, document-grounded question answering tasks, specifically within the context of the WattBot 2025 Challenge. This challenge requires systems to answer ~~300 technical questions about AI energy consumption based on 32 reference documents (~~500K tokens) with strict constraints:

Precision: Numeric answers must be within ±0.1% tolerance.
Citation: Exact source attribution is required.
Abstention: The system must explicitly abstain if evidence is insufficient rather than hallucinating.

Standard RAG approaches fail in this setting due to three main issues:

Flat Chunking: Partitioning documents into fixed-length segments destroys structural boundaries (sections, paragraphs), making precise citation tracking difficult.
Vocabulary Mismatch: Single-query formulations often miss relevant passages when user terminology differs from source document vocabulary (e.g., "PUE" vs. "power usage effectiveness").
Stochastic Instability: Single-pass LLM inference produces varying answers and citations across runs, leading to unreliable outputs and unnecessary abstentions when evidence exists but is hard to locate.

2. Methodology: KohakuRAG Framework

KohakuRAG introduces a three-stage pipeline designed to preserve document structure, maximize retrieval coverage, and stabilize inference.

A. Hierarchical Document Indexing

Instead of flat chunking, the framework parses documents into a four-level tree structure:

Structure: Document $\rightarrow$ Section $\rightarrow$ Paragraph $\rightarrow$ Sentence.
Bottom-Up Embedding Aggregation:
- Leaf nodes (sentences) are embedded directly using a text encoder.
- Internal nodes (paragraphs, sections) compute embeddings as a length-weighted average of their children's embeddings.
- Formula: $e_v = \frac{\sum_{c \in C(v)} |t_c| \cdot e_c}{\sum_{c \in C(v)} |t_c|}$
Benefit: This preserves semantic compositionality and allows for natural citation boundaries at any granularity level. Visual elements (figures/tables) are treated as special nodes with captions generated by Vision-Language Models (VLMs).

B. Multi-Query Retrieval with Cross-Query Reranking

To bridge the vocabulary gap, the system employs an LLM-powered Query Planner:

Query Expansion: Given a user question, the planner generates $n$ semantically related queries (rephrasing, expanding abbreviations, decomposing compound questions).
Dense Retrieval: Each query retrieves top- $k$ nodes via cosine similarity.
Cross-Query Reranking: Results are aggregated and reranked based on:
- Frequency: How many distinct queries retrieved the node.
- Score: Cumulative similarity scores.
- Strategy: A "Combined" strategy (normalized frequency + score) is used to prioritize nodes supported by multiple query formulations.
Context Expansion: Retrieved nodes are expanded with their parent nodes (providing broader context) and sibling nodes (local context) before being fed to the LLM.

C. Ensemble Inference with Abstention-Aware Voting

To mitigate stochasticity and handle uncertainty:

Multi-Run Aggregation: The system performs $m$ independent inference runs with temperature $>0$ .
Retry Mechanism: If a run outputs an abstention (blank) but retries are available, the system increases the retrieval depth ( $k$ ) and re-runs the query to find missing evidence.
Voting Strategy:
- Blank Filtering: Crucially, if any non-blank answer exists, blank responses are filtered out before voting. This prevents conservative runs from dominating when evidence is present but difficult to locate.
- Majority Voting: The final answer and citations are determined by majority vote among the non-blank runs.

3. Key Contributions

Hierarchical Indexing Scheme: A novel tree-based representation with bottom-up embedding propagation that enables precise provenance tracking and structural preservation, outperforming flat chunking.
Consensus-Driven Retrieval: An LLM-powered query planner combined with cross-query reranking that leverages consensus signals to improve retrieval coverage and handle vocabulary mismatches.
Abstention-Aware Ensemble: An inference mechanism that aggregates multiple runs while explicitly filtering out unnecessary abstentions, addressing the dominant error mode (unnecessary abstention).
Empirical Insights: Demonstrated that prompt ordering (placing context before the question) and retry mechanisms contribute more significantly to performance than hybrid sparse-dense retrieval strategies in this specific domain.

4. Experimental Results

The framework was evaluated on the WattBot 2025 Challenge (32 documents, ~500K tokens).

Leaderboard Performance: KohakuRAG achieved 1st place on both the Public and Private leaderboards with a final score of 0.861. It was the only team to maintain the top position across both partitions.
Ablation Studies:
- Prompt Ordering: Reordering context before the question yielded a +80% relative improvement.
- Retry Mechanism: Provided a +69% relative improvement at low retrieval depths by recovering from false abstentions.
- Ensemble Voting: Filtering blanks in an ensemble of 9 runs added +1.2 percentage points.
- Retrieval Strategy: Hierarchical dense retrieval alone was highly competitive; adding BM25 (sparse retrieval) only provided a marginal +3.1 percentage points, suggesting that rich structural retrieval reduces the need for keyword matching.
Error Analysis: The dominant failure modes were unnecessary abstention (26.8%), reference mismatch (23.6%), and value selection errors (22.2%). The proposed retry and ensemble mechanisms directly addressed the first two.
Model Comparison: While Grok-4.1-fast showed the highest single-run performance, the ensemble approach using GPT-oss-120B and Gemini-3-pro provided the most robust and consistent results across evaluation partitions.

5. Significance

KohakuRAG demonstrates that for high-precision, citation-heavy RAG tasks, structural preservation and inference stability are more critical than simply increasing model size or adding hybrid retrieval.

Robustness: The ensemble-based approach proved superior in generalizing to unseen data (private test set), avoiding the "overfitting" to public leaderboard characteristics seen in other top teams.
Efficiency vs. Accuracy: The paper challenges the assumption that hybrid (dense + sparse) retrieval is always necessary, showing that a well-structured dense retrieval system can suffice.
Practical Impact: The open-source release of KohakuRAG provides a reproducible framework for building reliable, citation-aware RAG systems, particularly for technical domains requiring strict adherence to source material.

The work highlights that addressing the "lost in the middle" phenomenon via prompt engineering and mitigating stochasticity via ensemble voting are low-cost, high-impact strategies for advancing RAG systems.