Hierarchical Embedding Fusion for Retrieval-Augmented Code Generation

Imagine you are a master chef trying to cook a complex dish (writing code) for a specific restaurant (a software project). To cook perfectly, you need to know the restaurant's secret recipes, the specific brands of ingredients they use, and the layout of their kitchen.

In the world of AI coding, this "restaurant knowledge" is the code repository—thousands of files containing the project's history, rules, and style.

The Problem: The "Too Much Information" Trap

Traditional AI coding assistants try to solve this by reading the entire restaurant's recipe book before they start cooking.

The Old Way (Snippet Injection): The AI grabs huge chunks of raw text from the project and pastes them into its prompt. It's like the chef trying to read 500 pages of a cookbook while simultaneously chopping onions. It's slow, the kitchen gets messy (noise), and the chef often forgets what they were doing because there's too much to read.
The Graph Way: Other systems try to map out the kitchen like a subway map (a graph) to find connections. This is accurate but requires building a new map every time you order a dish, which takes forever.

The Solution: HEF (The "Smart Summary" System)

The paper introduces Hierarchical Embedding Fusion (HEF). Think of this as hiring a super-efficient sous-chef who prepares a "cheat sheet" for the main chef.

Here is how HEF works, broken down into three simple steps:

1. The Offline Prep: Building the "Cheat Sheet"

Before the main chef ever starts cooking, the sous-chef (the Fuser) goes through the entire restaurant's recipe book.

Instead of copying the whole book, the sous-chef reads a few pages, summarizes them into a single "flavor note" (a dense vector), and then summarizes those notes into a "menu summary."
They keep doing this, creating a hierarchy:
- Level 1: Summaries of individual functions (like "How to chop an onion").
- Level 2: Summaries of whole files (like "The entire Salad Station").
- Level 3: Summaries of the whole project (like "The Restaurant's Vibe").
This entire process happens offline. It's done once, stored away, and doesn't slow down the actual cooking.

2. The Online Order: The "Pseudo-Token" Magic

Now, a customer orders a dish (the AI needs to write a line of code).

The main chef looks at what they are currently writing and asks the sous-chef: "What do I need to know from the rest of the restaurant to finish this?"
The sous-chef instantly grabs the most relevant "flavor notes" from the cheat sheet.
The Magic Trick: Instead of handing the chef 500 pages of text, the sous-chef converts those notes into 32 "magic tokens" (pseudo-tokens).
- Imagine these tokens are like compressed flavor packets. One packet contains the essence of a whole file. The chef doesn't need to read the file; they just taste the packet and instantly "know" the context.

3. The Result: Fast and Accurate

Because the chef only has to process 32 magic packets instead of thousands of words, the cooking happens in sub-seconds.

Speed: It's 13 to 26 times faster than the old graph-based methods.
Quality: Even though the chef isn't reading the whole book, the "flavor packets" are so rich in information that the dish tastes just as good as if they had read the whole thing.

Why This is a Big Deal

No More "Context Window" Anxiety: You don't have to worry about the AI forgetting things because the prompt got too long. The "cheat sheet" handles the memory.
Robustness: If the sous-chef grabs a slightly irrelevant flavor packet (a bad piece of context), it doesn't ruin the dish. The system is designed to ignore the noise, whereas the old methods would get confused by it.
Scalability: Whether the restaurant is a small café or a massive hotel chain, the chef only ever has to read 32 packets. The size of the project doesn't slow down the cooking.

In a Nutshell

HEF is like upgrading from a librarian who hands you a stack of 500 books to a genius assistant who reads all 500 books, distills the wisdom into 32 sticky notes, and hands those to you. You get all the knowledge you need, instantly, without the headache of reading the whole library.

Here is a detailed technical summary of the paper "Hierarchical Embedding Fusion for Retrieval-Augmented Code Generation".

1. Problem Statement

Repository-level code completion requires predicting code based on both the current file context and cross-file information (e.g., imported classes, type definitions, and project-wide APIs) from the surrounding repository. Existing Retrieval-Augmented Code Completion (RACC) systems face a fundamental trade-off between relevance and latency:

Snippet Injection: Retrieving raw code chunks and concatenating them into the prompt introduces thousands of tokens. This couples online latency to the repository size and injects noise when irrelevant fragments enter the context window.
Structure-Aware/Iterative Methods: Approaches using graph traversals (e.g., DRACO, GraphCoder) or iterative retrieval improve relevance but require expensive, multi-step model calls or graph construction at query time, resulting in high latency (often >10 seconds).
Dense Caching: While dense retrieval compresses context, it has rarely been systematically evaluated for repository-level completion using a pseudo-token interface that fully replaces raw snippets.

The goal is to decouple the online prompt length from the repository size while retaining access to repository-level information, enabling sub-second latency.

2. Methodology: Hierarchical Embedding Fusion (HEF)

HEF is a two-stage pipeline that replaces raw text retrieval with a hierarchy of dense vectors mapped to learned pseudo-tokens.

A. Offline Stage: Hierarchical Cache Construction

The system processes a repository once to build a reusable, compressed representation:

Chunking: Source files are split into semantically coherent chunks (≤512 tokens) using AST-aware slicing.
Embedding: A frozen large embedding model (Qwen3-Embedding-8B) maps each chunk to a dense vector.
Hierarchical Fusion: A small "Fuser" model (Qwen-2.5-Coder-0.5B) recursively merges child vectors into parent vectors.
- Chunks $\to$ File Embeddings $\to$ Module Embeddings $\to$ Repository Embeddings.
- This creates a tree structure where higher-level nodes summarize the semantic content of lower-level nodes.
Indexing: All nodes in the hierarchy are indexed using HNSW (Hierarchical Navigable Small World) for efficient retrieval.

B. Online Stage: Query Processing

At inference time, the system avoids processing raw text:

Query Formation: The last 512 tokens of the user's prefix are embedded using the frozen encoder.
Retrieval: The system retrieves the top- $K$ (e.g., 32) relevant nodes from the hierarchy (which can be at any level: chunk, file, or repo).
Pseudo-Token Projection: The retrieved dense vectors are passed through a Projector (a 2-layer MLP) to convert them into pseudo-tokens (continuous vectors) compatible with the generator's embedding space.
Generation: The code generator (Qwen-2.5-Coder-1.5B) receives the original prefix plus the fixed number of pseudo-tokens (e.g., 32 tokens) instead of thousands of raw tokens.

C. Training Regimes & Data Filtering

Utility-Weighted Likelihood (UWL): An unsupervised filtering mechanism is used during training. It calculates the log-likelihood gain of a context chunk; only contexts that improve the model's prediction of the ground truth are used for training.
Training Strategies:
1. Contrastive Pre-training: The Fuser is trained to maximize similarity between a query chunk and its fused repository representation.
2. Separate Fine-tuning: The Fuser is frozen; the Projector and Generator are trained jointly.
3. End-to-End Optimization: The Fuser, Projector, and Generator are trained jointly (excluding the frozen Embedder), yielding the best results.

3. Key Contributions

Hierarchical Dense Cache: A novel pipeline that compresses an entire repository into a multi-resolution vector hierarchy, decoupling online prompt length from repository size.
Pseudo-Token Interface: Replaces thousands of retrieved tokens with a fixed budget of pseudo-tokens (e.g., 32), maintaining repository-level context without context window bloat.
End-to-End Integration: Combines strong code embedders, lightweight fusion models, and pseudo-token conditioning into a unified system, validated with an unsupervised data construction procedure (UWL).
Comprehensive Evaluation: Provides extensive ablation studies on training regimes, pseudo-token budgets, embedding backbones, and robustness to harmful retrieval.

4. Experimental Results

The system was evaluated on RepoBench and RepoEval using a single A100 GPU.

Accuracy:
- HEF (End-to-End) achieved 61.3% Exact-Match (EM) on RepoBench and 42.7% on RepoEval.
- This outperforms the low-latency baseline RepoFusion (39.8% / 33.2%) by a significant margin.
- It rivals high-latency, complex systems like GraphCoder (16B parameters, 64.1% EM) and DRACO (7.1B parameters, 46.4% EM) while using a much smaller generator (1.8B total pipeline).
Latency:
- HEF operates with a median latency of 0.68s.
- This is 13× faster than DRACO (11.0s) and 26× faster than GraphCoder (17.5s).
- It achieves sub-second latency while maintaining repository awareness, a feat previously only possible with raw snippet injection (which is slow) or without retrieval (which is inaccurate).
Ablation Insights:
- Token Budget: Performance peaks at ~30–40 pseudo-tokens; increasing beyond 60 yields diminishing returns.
- Robustness: HEF is significantly more robust to "harmful" or irrelevant retrieved context than raw snippet injection, which degrades performance when noise is introduced.
- Fuser Size: A small 0.5B Fuser is sufficient; larger models increase offline build time without significant accuracy gains, confirming the Fuser's role is compression, not reasoning.

5. Significance

This paper demonstrates that hierarchical dense caching is a viable and highly effective mechanism for low-latency, repository-aware code completion. By shifting the computational cost to an offline caching phase and using a fixed pseudo-token budget during inference, HEF solves the latency-relevance trade-off that plagues current RAG systems.

It offers a practical alternative for scenarios where responsiveness is critical (e.g., IDE autocomplete), proving that complex repository context can be distilled into compact vector representations without sacrificing the quality of code generation. The work establishes a new recipe for integrating repository context without streaming massive volumes of raw code into the generator.