Graphs RAG at Scale: Beyond Retrieval-Augmented Generation With Labeled Property Graphs and Resource Description Framework for Complex and Unknown Search Spaces

The Big Problem: The "Lost in the Library" Dilemma

Imagine you have a super-smart librarian (an AI) who has read every book ever written. This librarian is great at writing stories and answering questions. However, there's a catch: the librarian only remembers what was in the books they read years ago. If you ask them about something that happened yesterday, they might make up a plausible-sounding story because they don't actually know the truth. This is called hallucination.

To fix this, we give the librarian a "cheat sheet" (Retrieval-Augmented Generation, or RAG). When you ask a question, the librarian quickly looks up relevant pages in a massive library and reads them before answering.

The Traditional Problem:
Most libraries today are organized like a giant pile of unsorted papers. To find the right page, the librarian uses a "vibe check" (embedding search). They guess which papers sound like your question.

The Flaw: If you ask, "Which funds have a specific manager?" and the search space is huge and messy, the librarian might grab 50 random papers, miss the right one, or get confused by the noise. They have to guess how many papers to grab, and if they grab too many, they get overwhelmed; too few, and they miss the answer.

The Solution: Building a "Museum" Instead of a "Pile"

The authors of this paper say, "Let's stop treating data like a pile of papers and start treating it like a Museum."

In a museum, every exhibit (a fund, a manager, a benchmark) is a specific object. Every object is connected to others by clear, labeled ropes (relationships). This is called a Graph.

They tested two ways to build this museum:

The RDF Museum (The Triplets): A strict, scientific way of labeling everything as "Subject -> Predicate -> Object" (e.g., "AMCAP Fund -> has manager -> John Doe").
The LPG Museum (The Labeled Property Graph): A more flexible, visual way where objects have names, colors, and tags, and you can walk from one object to another easily.

How They Did It (The Magic Tricks)

The researchers took complex, messy data (JSON files, which are like nested boxes inside boxes) and turned them into these museum maps.

The Translator (Text-to-Cypher):
Imagine you walk up to the museum guard and ask, "Show me all funds managed by Sarah."
- Old Way: The guard guesses which papers to pull.
- New Way (LPG): The guard has a special translator. You speak English, and the translator instantly turns your question into a precise map route (a "Cypher query") that walks directly to the "Sarah" exhibit and follows the ropes to the funds. The paper claims this translator is over 90% accurate.
The Dynamic Search:
In the old system, you had to tell the librarian, "Grab me 5 papers." If the answer needed 10, you failed. If it needed 2, you got noise.
In the Graph RAG system, you don't need to guess the number. The map knows exactly how many steps to take to find the answer. It's like following a GPS route rather than guessing how many miles you need to drive.

The Results: Who Won the Race?

The team tested three methods on 200 difficult questions about investment funds:

Agentic RAG (The Old Way): The librarian guessing with a pile of papers.
RDF Graph (The Scientific Museum): Very accurate, but a bit rigid.
LPG Graph (The Flexible Museum): The clear winner.

The Scoreboard:

Agentic RAG: Struggled badly with "Search" questions (finding lists of items). It got confused by the noise.
RDF Graph: Did very well, beating the old way significantly.
LPG Graph: Crushed it. It was the most accurate, especially for complex questions like "List all funds with this manager" or "Compare these two funds."

Why Did the "Museum" Win?

Think of the difference between searching a haystack vs. walking a maze.

The Haystack (Old RAG): You throw a needle in a haystack and hope it sticks to the right one. If the haystack is huge, you might miss the needle entirely.
The Maze (Graph RAG): The maze has walls and doors. If you want to get to the "Manager" room, there is a specific door. You don't need to guess; you just follow the path.

The paper found that when data is structured (like financial funds with specific managers, types, and benchmarks), the LPG (Labeled Property Graph) approach is like having a perfectly designed map. It allows the AI to "walk" from one fact to another without getting lost in a sea of text.

The Takeaway

This paper proves that for complex, structured data (like finance, healthcare, or legal records), we shouldn't just rely on AI guessing which text is relevant. Instead, we should build structured maps (Graphs) of that data.

By turning messy data into a connected map and teaching the AI to read that map with a special language (Cypher), we get answers that are:

More accurate (less lying/hallucinating).
More complete (finding all the pieces, not just a few).
Faster (no need to guess how many documents to read).

It's the difference between asking a friend to "find me a good restaurant" in a city they've never visited, versus giving them a GPS that knows exactly where every restaurant is and how to get there.

1. Problem Statement

Traditional Retrieval-Augmented Generation (RAG) systems rely heavily on dense vector embeddings and similarity search over unstructured text. While effective for general queries, these systems face critical limitations when applied to semi-structured or structured data (e.g., JSON, key-value pairs, tables) and scenarios where the search space is unknown.

Fixed Retrieval Constraints: Traditional RAG requires pre-specifying the number of documents ( $k$ ) to retrieve. In complex domains, it is difficult to determine the optimal $k$ ; too few leads to missing information, while too many introduces noise and hallucinations.
Inefficiency: Heavy reliance on reranking models to filter retrieved chunks adds latency and computational cost without guaranteeing accuracy.
Loss of Structure: Converting structured data (like nested JSON) into unstructured text for embedding often results in context loss, hallucination, and the inability to capture complex multi-hop relationships between entities.
Domain Specificity: General embedding models often fail to capture fine-grained distinctions in specialized domains (e.g., finance), leading to retrieval of irrelevant or incorrect entities.

2. Methodology

The authors propose a novel Graph RAG framework that leverages two distinct graph architectures: Resource Description Framework (RDF) and Labeled Property Graphs (LPG). The system is designed to handle semi-structured JSON data (specifically financial fund data) without converting it into unstructured narratives.

A. Data Representation

RDF Approach:
- Converts JSON objects into subject-predicate-object triplets.
- Uses the abbreviatedName of a fund as the unique subject.
- Recursively handles nested JSON structures to ensure all hierarchical data is captured as triplets.
- Stores data in Amazon Neptune and utilizes SPARQL for querying.
- Implements a Predicate Clustering strategy (using supervised classification or unsupervised clustering) to categorize the 8,000+ unique relationships, streamlining retrieval.
LPG Approach:
- Models data as nodes and relationships with properties using openCypher.
- Employs a schema-centric design: Entities (Funds) are nodes; attributes (e.g., Fund Type, Benchmark) are modeled as separate nodes connected by explicit relationships (e.g., HAS_BENCHMARK).
- This normalization reduces the graph complexity and enables efficient multi-hop traversal.
- Utilizes Text-to-Cypher translation, where an LLM translates user queries into executable Cypher queries by leveraging a rich schema and metadata context.

B. Pipeline Architectures

RDF Pipeline:
- Node Selection: An agent combines LLM extraction and deterministic mapping to identify relevant funds.
- Relationship Selection: Uses a hybrid approach: supervised classification to narrow candidate relationships, followed by embedding-based similarity and LLM-based detection to select the top relevant relationships.
- Traversal: Executes SPARQL queries on the Neptune cluster to retrieve specific triplets.
LPG Pipeline (Text-to-Cypher):
- Generates a comprehensive schema (node labels, edge types, properties) to prompt the LLM.
- The LLM translates natural language queries directly into Cypher queries.
- Executes the query to retrieve precise subgraphs, eliminating the need for vector search or reranking.
Agentic RAG (Baseline):
- Converts JSON to unstructured text (RAG1 - discarded due to scalability/hallucination issues).
- Uses RAG2: Converts predicates to natural language, chunks text, embeds with BGE-m3, and retrieves via vector search (FAISS/Azure) with reranking (BGE-reranker, Cross-Encoder).

C. Evaluation Metrics

The study evaluates performance based on Accuracy (factual correctness, no hallucination) and Completeness (providing all required information), rather than automated metrics like ROUGE or BLEU.

3. Key Contributions

End-to-End Graph RAG Framework: A scalable system that processes semi-structured JSON data directly into RDF and LPG formats, bypassing the lossy conversion to unstructured text.
Text-to-Cypher Innovation: Development of a high-accuracy (90%+) text-to-Cypher translation module that dynamically generates executable queries based on graph schema and metadata, suitable for online applications.
RDF Triplet Conversion: An efficient, deterministic method to convert complex, nested JSON into RDF triplets, preserving data integrity and enabling semantic querying without LLM extraction noise.
Comprehensive Empirical Comparison: A rigorous evaluation comparing Graph RAG (RDF/LPG) against traditional embedding-based RAG, demonstrating the superiority of graph-based approaches in unknown search spaces.

4. Results

The experiments were conducted on a dataset of 1,104 financial funds (1104 records, ~650k triplets) using 200 diverse queries (Search, Compare, Detail, Other).

Overall Performance:
- LPG (Graph RAG): Scored 185.5/200 (Highest).
- RDF (Graph RAG): Scored 172.5/200.
- Agentic RAG (Vector-based): Scored 116/200.
Performance by Intent:
- Search/Listing: LPG significantly outperformed others (93/100 vs. 80 for RDF and 38.5 for Agentic RAG). This highlights LPG's ability to handle "unknown search spaces" where the number of results is not predetermined.
- Compare & Detail: Both LPG and RDF performed strongly (41/45 and 43.5/45 respectively), while Agentic RAG lagged significantly.
Key Findings:
- Agentic RAG Limitations: Struggled with setting the retrieval parameter $k$ and suffered from embedding model confusion (e.g., mistaking similar fund codes like CGCP vs. CGCB).
- Graph Advantages: Graph-based methods eliminated the need for reranking and handled multi-hop reasoning (e.g., "List funds managed by Manager X") deterministically.
- LPG Superiority: The explicit schema design of LPG allowed for more efficient traversal and better handling of complex queries compared to the triplet-based RDF approach.

5. Significance

This paper establishes Graph RAG as a transformative solution for next-generation retrieval systems, particularly in domains with complex, semi-structured, and highly interconnected data.

Scalability: It demonstrates that graph-based retrieval scales effectively without the computational overhead of massive reranking pipelines or the hallucination risks of converting structured data to text.
Precision: By leveraging explicit relationships and schema-aware querying (Cypher/SPARQL), the system achieves higher factual accuracy and completeness, crucial for high-stakes domains like finance, healthcare, and law.
Future Direction: The work suggests a shift away from purely embedding-based retrieval toward schema-aware, graph-native architectures for tasks where the search space is dynamic and the relationships between entities are critical.

In conclusion, the authors prove that for complex, structured data, Labeled Property Graphs (LPG) combined with Text-to-Cypher represent the state-of-the-art approach, outperforming traditional vector-based RAG in accuracy, reasoning capability, and reliability.