QCFuse: Query-Centric Cache Fusion for Efficient RAG Inference

Imagine you are a brilliant librarian (the AI) who has been hired to answer questions based on a massive, dusty library of books (the Knowledge Base).

The Problem: The "Re-Reading" Bottleneck

Every time a customer asks a question, you have to run to the shelves, find the relevant books, and read the specific pages out loud to answer them.

The Old Way (Full Computation): Even if 100 people ask about the same book, you read the entire book from page 1 to the end every single time. It's incredibly slow and wastes your energy.
The "Smart" Way (Standard Caching): You realize that if the first few pages of the book are the same, you can just remember them. But, if the customer asks about a chapter in the middle of the book, you can't use your memory of the beginning. You have to start reading from page 1 again. This is a huge waste because, in a real library, 70% of the books people ask about overlap!

The Current "Fix": Guessing What to Skip

Some smart librarians tried to fix this by saying, "Let's just skip the first 10% of the book and re-read the rest," or "Let's skip the pages that look different on the cover."

The Flaw: They are guessing based on the book's structure (local clues), not the customer's specific question (global awareness). They might skip the one paragraph that actually answers the question, leading to a bad answer, or they might re-read pages the customer didn't care about, wasting time.

The Solution: QCFuse (The "Query-Centric" Librarian)

QCFuse is a new system that changes how the librarian works. Instead of guessing, it asks: "What does the customer actually care about?"

Here is how it works, using a simple analogy:

1. The "Cliff Notes" (Semantic Summary Anchors)

Before the customer even arrives, the librarian creates a tiny, 3-sentence "Cliff Notes" summary for every single book in the library.

How QCFuse does it: It takes a few key "anchor" words from the context that act like a compressed summary.
The Magic: When the customer asks a question, the librarian reads the question along with these tiny summaries. This gives the librarian a "gut feeling" about which parts of the book are important, without having to read the whole book first.

2. The "Spotlight" (Critical Layer Attention)

Now, the librarian needs to decide which pages to re-read.

Old Way: They might check the first page or the last page to guess what's important.
QCFuse Way: It shines a "spotlight" on the middle of the book (a specific layer in the AI's brain). This is the sweet spot where the librarian understands the meaning of the question best. It looks at the connection between the question and the book only at this specific level.
The Result: It instantly identifies the exact sentences (tokens) that matter most and ignores the boring filler text.

3. The "Assembly Line" (Pipelined Fusion)

Usually, if you have to go back and re-read a specific page, you have to stop the whole process, go get the page, read it, and then continue. This causes a traffic jam.

QCFuse Way: It uses a super-efficient assembly line. While the librarian is re-reading the important page on the current level, a helper is already running to the shelf to grab the next page for the next level.
The Result: The process never stops. It's smooth, fast, and continuous.

The Result: Faster and Smarter

Because QCFuse knows exactly what the customer cares about:

It's 40% Faster: It skips the boring stuff and only re-reads what matters.
It's More Accurate: By ignoring irrelevant pages (noise), the librarian doesn't get confused. In fact, sometimes it answers better than reading the whole book because it focuses purely on the signal, not the noise.
It Saves Energy: The computer (GPU) doesn't have to do unnecessary math.

Summary

Think of QCFuse as a librarian who doesn't just memorize books, but understands the question so well that they can instantly point to the exact paragraph that matters, re-read only that paragraph, and hand you the answer before you've even finished blinking. It's the difference between reading a whole encyclopedia to find one fact versus using a smart search engine that knows exactly where to look.

1. Problem Statement

Large Language Models (LLMs) equipped with Retrieval-Augmented Generation (RAG) are essential for enterprise knowledge applications but face severe performance bottlenecks in high-concurrency environments.

The Bottleneck: While retrieved context chunks often overlap significantly (over 70%) across different queries, traditional prefix caching cannot reuse them due to strict prefix-matching policies. This forces LLMs to fully recompute (prefill) redundant contexts, causing the Time to First Token (TTFT) to grow quadratically with context length.
Limitations of Existing Solutions: Current "Cache Fusion" methods (e.g., CacheBlend, EPIC) attempt to merge historical KV caches and selectively recompute tokens. However, they suffer from a lack of global awareness:
- They rely on local cues (e.g., static positional heuristics or first-layer KV deviations) rather than the user query.
- They often waste computational budget on irrelevant tokens while ignoring critical ones, leading to accuracy drops.
- Technical Challenges:
  1. Context-Aware Query Representation: Obtaining query representations that understand the context is costly. Naive methods yield ungrounded results, while loading full context KV caches disrupts the pipelined execution required for efficiency.
  2. Pipeline-Friendly Attention Analysis: Analyzing attention across all layers (like ProphetKV) causes pipeline stalls due to cross-layer dependencies. Relying solely on the final layer (like FusionRAG) provides an incomplete semantic view.

2. Methodology: QCFuse System

QCFuse is a query-centric KV cache fusion system implemented on the SGLang framework. It introduces a four-phase workflow designed to balance accuracy and pipeline efficiency:

A. Offline Pre-computation and Anchor Extraction

Process: Before online processing, the system pre-computes KV caches for all context chunks in the RAG database and stores them on SSDs.
Anchor Tokens: It extracts a small fraction of tokens with the highest key-norm magnitudes from each chunk. These "anchors" serve as compressed semantic summaries.
Storage: These lightweight anchors are stored in CPU memory to minimize latency, avoiding the need to load full KV caches during the query phase.

B. RAG Retrieval and Context-Aware Query Probing

Process: When a query arrives, the system retrieves relevant chunks. Instead of a context-free forward pass, it injects the CPU-resident anchor tokens as lightweight prefixes alongside the query into the GPU.
Benefit: This creates a context-enhanced query representation without triggering massive data transfers from the SSD, maintaining pipeline efficiency.

C. Critical-Layer Attention Analysis

Process: The system performs attention analysis using only the Key (K) cache of a single "critical" middle layer (identified empirically to offer superior semantic localization).
Mechanism: It computes the attention weights between the query and this specific layer's K cache.
Selection: The resulting weights identify the Top- $N$ context tokens most relevant to the query. These indices guide the subsequent recomputation.

D. Pipelined Cache Reconstruction

Process: The GPU performs discrete token recomputation for the selected Top- $N$ tokens.
Optimization: This follows a strict pipeline: while the GPU reconstructs tokens for layer $i$ , the system simultaneously prefetches the KV cache for layer $i+1$ from the SSD.
Output: The updated, context-enriched KV matrix is fed into the SGLang decoding engine for low-latency response generation.

3. Key Contributions

Anchor-Based Lightweight Query Probing: A novel method to inject semantic summaries (anchors) into the query forwarding process. This enables context-aware query representations without breaking the pipeline or incurring high I/O costs.
Semantic Localization via Critical-Layer Profiling: The identification of a specific middle layer whose attention distribution serves as a reliable proxy for global token importance. This avoids the pipeline stalls of cross-layer analysis and the semantic incompleteness of last-layer analysis.
System Implementation: A high-performance system built on SGLang featuring a custom location-aware sparse attention kernel (implemented in Triton) that supports discrete token recomputation while adhering to causal constraints.

4. Experimental Results

Evaluations were conducted on an A100 GPU using Llama3.1-8B, Qwen3-8B, and Mistral-v0.3-7B across Musique, 2WikiMQA, and HotpotQA datasets.

Efficiency:
- 2× Speedup in TTFT compared to full computation.
- 40% latency reduction compared to existing cache fusion baselines (CacheBlend, EPIC, etc.) while maintaining equivalent accuracy.
Accuracy:
- Achieves ROUGE-L scores 2.3 to 3.5 points higher than CacheBlend.
- Matches the accuracy of full computation at a 40% recomputation ratio.
- Attention Denoising: On the HotpotQA dataset, QCFuse outperformed full computation by 0.8 points by effectively removing attention interactions with irrelevant tokens.
Comparison: Outperforms query-centric baselines like QCAll (full layer analysis) in latency and QCLast (last-layer only) in accuracy.

5. Significance

QCFuse represents a significant advancement in optimizing RAG inference for enterprise environments. By shifting from local heuristics to query-centric global awareness, it solves the fundamental trade-off between computational efficiency and generation accuracy.

Scalability: It enables near-real-time responses over massive document collections by effectively utilizing KV cache reuse in dynamic RAG scenarios.
Practicality: The system is designed for seamless integration into existing LLM frameworks (SGLang) and supports flexible configuration of recomputation ratios to balance speed and quality.
Future Impact: The "attention denoising" capability suggests that selective recomputation can actually improve model reasoning by filtering out noise, offering a new direction for LLM optimization beyond simple caching.