RAGPerf: An End-to-End Benchmarking Framework for Retrieval-Augmented Generation Systems

Imagine you have a brilliant, super-smart assistant (an AI) who knows everything about the world because they read millions of books. But, there's a catch: this assistant doesn't know about your specific company, your private documents, or the news that happened yesterday. If you ask them a question about your business, they might guess or make things up because they lack that specific context.

To fix this, we use something called RAG (Retrieval-Augmented Generation). Think of RAG as giving your smart assistant a giant, magical library right next to their desk. When you ask a question, the assistant first runs to the library, finds the exact pages they need, reads them, and then answers you.

The problem is, building this "library system" is incredibly complex. It involves:

Reading the books (Embedding).
Shelving them so they can be found quickly (Indexing).
Running to the right shelf (Retrieval).
Double-checking if the book is actually relevant (Reranking).
Writing the final answer (Generation).

If the system is slow, is it because the librarian is slow? Is the library too big? Is the assistant taking too long to write? Until now, developers had no good way to measure exactly where the bottleneck was.

Enter RAGPerf: The "Car Mechanic's Diagnostic Tool" for AI

The paper introduces RAGPerf, a new tool designed to be a benchmarking framework. Think of it like a high-tech diagnostic computer a mechanic plugs into your car. Instead of just guessing why the car is sputtering, it tells you exactly how much fuel the engine is using, how fast the wheels are spinning, and if the brakes are dragging.

Here is how RAGPerf works, broken down into simple concepts:

1. The "Traffic Simulator" (Workload Generator)

Real life isn't static. Sometimes everyone rushes to the library at once (high traffic); sometimes people are just browsing (low traffic). Sometimes the library gets new books every minute (updates); sometimes old books are thrown away (deletions).

RAGPerf has a simulator that acts like a traffic controller. It can create realistic scenarios:

"Let's pretend 1,000 people are asking questions at once."
"Let's pretend we are adding 50 new documents every second."
"Let's pretend everyone is asking about the same popular topic (like a viral news story)."

This helps developers see how their system handles stress, just like a crash test dummy helps engineers see how a car handles a collision.

2. The "Modular Lego Set" (Configurable Pipeline)

Most AI systems are built like a black box—you put data in, and an answer comes out. You can't see the gears inside.

RAGPerf treats the system like a Lego set. It breaks the process down into separate, interchangeable blocks:

The Embedder: The translator that turns words into numbers.
The Vector Database: The giant filing cabinet.
The Reranker: The editor who checks if the found documents are actually good.
The Generator: The writer who makes the final answer.

Because it's modular, you can swap out one Lego block for another. "What if we use a faster filing cabinet?" "What if we use a smarter translator?" RAGPerf lets you swap these parts and instantly see how it changes the speed and quality of the answer.

3. The "Stopwatch and Fuel Gauge" (Metrics)

RAGPerf doesn't just tell you if the answer is good; it tells you how expensive it is to get there. It measures two things:

Quality: Did the assistant get the facts right? (Did it find the right page in the library?)
Performance: How long did it take? How much electricity (GPU power) did it use? How much memory (RAM) did it eat?

It's like a fuel gauge that tells you, "Hey, your car is getting 30 miles per gallon, but if you switch to this other tire, you'll get 40."

Why is this a big deal?

Before RAGPerf, developers were flying blind. They knew their AI was slow, but they didn't know why.

The "Aha!" Moment: The paper's experiments showed some surprising things. For example, in text-based systems, the writing part (Generation) is usually the slowest part, not the searching part. But in systems that handle images (like PDFs), the searching part can become a huge bottleneck if the library isn't organized well.
The "Update" Problem: The tool showed that constantly adding new books to the library slows things down. It found that using a "temporary shelf" for new books helps keep things fast, but if that shelf gets too full, the whole system grinds to a halt.

The Bottom Line

RAGPerf is a tool that helps developers build better, faster, and cheaper AI assistants. It takes the mystery out of the "black box" by letting them tweak the settings, simulate real-world chaos, and see exactly where the traffic jams are.

Instead of guessing, they can now say: "Okay, we need more memory for the filing cabinet, or we need to switch to a faster writer," and they can prove it with data. It's the ultimate toolkit for making sure your AI doesn't just sound smart, but actually works smart.

Here is a detailed technical summary of the paper "RAGPerf: An End-to-End Benchmarking Framework for Retrieval-Augmented Generation Systems."

1. Problem Statement

Retrieval-Augmented Generation (RAG) has become critical for integrating external knowledge into Large Language Models (LLMs), yet optimizing RAG pipelines for production is challenging due to:

System Complexity: RAG pipelines involve multiple interacting components (embedding, indexing, retrieval, reranking, generation), making it difficult to isolate bottlenecks.
Lack of Holistic Benchmarks: Existing benchmarks (e.g., BEIR, RAGBench) focus primarily on semantic metrics (accuracy, hallucination) or individual components (vector DBs), failing to capture end-to-end system performance, resource contention, and runtime behaviors in integrated scenarios.
Static Evaluation: Most benchmarks use static datasets, ignoring the dynamic nature of real-world knowledge bases where data is continuously inserted, updated, or deleted, which significantly impacts indexing overhead and query latency.
Configuration Space: The vast number of configuration options (embedding dimensions, indexing algorithms, batch sizes, hardware resources) makes it hard for developers to make data-driven decisions without a reproducible framework.

2. Methodology: RAGPerf Framework

RAGPerf is a modular, extensible, end-to-end benchmarking framework designed to characterize RAG system behaviors. Its architecture consists of three main pillars:

A. Modular Pipeline Decomposition

The framework decouples the RAG workflow into independent, configurable components with standardized interfaces:

Embedding: Supports various chunking strategies (fixed-length, separator-based, semantic) and models (text, multimodal like ColPali/CLIP).
Indexing & Vector Database: Supports major backends (LanceDB, Milvus, Qdrant, Chroma, Elasticsearch) and indexing methods (HNSW, IVF, DiskANN, quantization).
Retrieval & Reranking: Allows configuration of retrieval depth and reranking models (Bi-encoders, Cross-encoders, LLM-based).
Generation: Integrates with vLLM for high-throughput LLM inference, supporting various model sizes and parallelism strategies.

B. Dynamic Workload Generator

To simulate real-world scenarios, RAGPerf includes a workload generator that supports:

Operations: Concurrent Queries, Inserts, Updates, and Removals.
Update Logic: Uniquely, it generates synthetic ground truths for updates using an LLM. It modifies specific facts in documents and generates corresponding questions to verify if the system retrieves the updated information rather than stale data.
Access Patterns: Supports Uniform and Zipfian distributions to model "hotspot" access behaviors common in production.
Multi-modal Support: Handles text, PDFs (via OCR or visual embedding), and audio (via ASR).

C. Profiling and Metrics

RAGPerf collects two complementary sets of metrics with negligible overhead (<0.26% CPU usage):

Performance Metrics: End-to-end latency, throughput (QPS), GPU/CPU utilization, memory footprint (Host/GPU), I/O bandwidth, and PCIe throughput.
Quality Metrics: Context recall, factual consistency, and query accuracy (evaluated using the Ragas framework).

3. Key Contributions

First End-to-End RAG Benchmark: Unlike prior works focusing on semantic quality or single components, RAGPerf evaluates the entire pipeline, capturing interactions and resource contention between components.
Dynamic Update Simulation: Introduces a novel mechanism to benchmark the impact of continuous data updates (inserts/updates/deletes) on retrieval freshness and latency, a gap in existing static benchmarks.
Granular Resource Profiling: Provides fine-grained visibility into hardware utilization (e.g., distinguishing between compute-bound and memory-bound stages) to identify specific bottlenecks.
Extensibility & Reproducibility: Offers a modular Python-based architecture with YAML configurations, supporting diverse datasets, models, and vector databases out-of-the-box.

4. Key Results & Findings

The authors evaluated RAGPerf across text, PDF, and audio pipelines using various hardware configurations (AMD EPYC CPUs, NVIDIA H100 GPUs). Key findings include:

Bottleneck Identification:
- Text Pipelines: The LLM generation stage is the dominant bottleneck (75–91% of latency), making the choice of vector database less critical for end-to-end latency.
- Multimodal Pipelines: Format conversion (OCR for PDFs, ASR for audio) and indexing are major latency contributors. For PDFs, reranking can account for up to 87% of iteration time due to high lookup costs.
- Resource Contention: CPU utilization spikes during retrieval and index building, while GPU memory is the primary constraint for generation throughput.
Update Operations:
- Using a temporary flat index allows immediate searchability of new updates but increases query latency as the flat index grows.
- Zipfian update distributions (skewed updates) result in lower latency overhead compared to uniform distributions because fewer unique entries accumulate in the flat index.
Resource Configuration Impact:
- CPU Cores: Have minimal impact on throughput as retrieval/indexing are not compute-intensive.
- Host Memory: Insufficient host memory forces disk-based indexing, degrading throughput by up to 85% (e.g., Milvus drops to 15.3% throughput with only 32GB RAM).
- GPU Memory: The primary hardware bottleneck. Limiting GPU memory restricts batch sizes and prevents loading larger LLMs, drastically reducing throughput.
Accuracy vs. Efficiency Trade-offs:
- Context Recall $\neq$ Accuracy: High retrieval recall does not guarantee high answer accuracy if the generation model lacks the capacity to utilize the retrieved context (e.g., small vision-language models).
- Batch Size: Increasing batch size improves throughput up to a point, but excessive batching consumes GPU memory needed for KV caches, forcing sequential decoding and reducing performance.
- Indexing: Quantized indices (IVF-PQ) offer the best balance of throughput, build time, and memory efficiency compared to HNSW or GPU-accelerated indices.

5. Significance

RAGPerf addresses a critical gap in the AI infrastructure landscape by providing a system-level perspective on RAG optimization. Its significance lies in:

Guiding Deployment: It enables developers to make informed trade-offs between system efficiency (latency, cost) and output quality (accuracy, freshness).
Hardware Planning: It highlights that host memory and GPU memory are often more critical constraints than CPU cores for RAG performance.
Future-Proofing: By supporting dynamic workloads and multi-modal data, it prepares the community for the evolving complexity of real-world RAG applications.
Open Source: The framework is open-sourced, fostering reproducibility and standardization in RAG system evaluation.

The paper concludes that RAGPerf incurs negligible overhead while providing the necessary granularity to optimize RAG pipelines for diverse, production-grade scenarios.