Hierarchical Embedding Fusion for Retrieval-Augmented Code Generation

This paper introduces Hierarchical Embedding Fusion (HEF), a two-stage framework that compresses repository code into a reusable hierarchy of dense vectors and maps them to learned pseudo-tokens, enabling low-latency, repository-aware code generation with accuracy comparable to traditional retrieval methods while significantly reducing inference costs.

Nikita Sorokin, Ivan Sedykh, Valentin Malykh

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you are a master chef trying to cook a complex dish (writing code) for a specific restaurant (a software project). To cook perfectly, you need to know the restaurant's secret recipes, the specific brands of ingredients they use, and the layout of their kitchen.

In the world of AI coding, this "restaurant knowledge" is the code repository—thousands of files containing the project's history, rules, and style.

The Problem: The "Too Much Information" Trap

Traditional AI coding assistants try to solve this by reading the entire restaurant's recipe book before they start cooking.

  • The Old Way (Snippet Injection): The AI grabs huge chunks of raw text from the project and pastes them into its prompt. It's like the chef trying to read 500 pages of a cookbook while simultaneously chopping onions. It's slow, the kitchen gets messy (noise), and the chef often forgets what they were doing because there's too much to read.
  • The Graph Way: Other systems try to map out the kitchen like a subway map (a graph) to find connections. This is accurate but requires building a new map every time you order a dish, which takes forever.

The Solution: HEF (The "Smart Summary" System)

The paper introduces Hierarchical Embedding Fusion (HEF). Think of this as hiring a super-efficient sous-chef who prepares a "cheat sheet" for the main chef.

Here is how HEF works, broken down into three simple steps:

1. The Offline Prep: Building the "Cheat Sheet"

Before the main chef ever starts cooking, the sous-chef (the Fuser) goes through the entire restaurant's recipe book.

  • Instead of copying the whole book, the sous-chef reads a few pages, summarizes them into a single "flavor note" (a dense vector), and then summarizes those notes into a "menu summary."
  • They keep doing this, creating a hierarchy:
    • Level 1: Summaries of individual functions (like "How to chop an onion").
    • Level 2: Summaries of whole files (like "The entire Salad Station").
    • Level 3: Summaries of the whole project (like "The Restaurant's Vibe").
  • This entire process happens offline. It's done once, stored away, and doesn't slow down the actual cooking.

2. The Online Order: The "Pseudo-Token" Magic

Now, a customer orders a dish (the AI needs to write a line of code).

  • The main chef looks at what they are currently writing and asks the sous-chef: "What do I need to know from the rest of the restaurant to finish this?"
  • The sous-chef instantly grabs the most relevant "flavor notes" from the cheat sheet.
  • The Magic Trick: Instead of handing the chef 500 pages of text, the sous-chef converts those notes into 32 "magic tokens" (pseudo-tokens).
    • Imagine these tokens are like compressed flavor packets. One packet contains the essence of a whole file. The chef doesn't need to read the file; they just taste the packet and instantly "know" the context.

3. The Result: Fast and Accurate

Because the chef only has to process 32 magic packets instead of thousands of words, the cooking happens in sub-seconds.

  • Speed: It's 13 to 26 times faster than the old graph-based methods.
  • Quality: Even though the chef isn't reading the whole book, the "flavor packets" are so rich in information that the dish tastes just as good as if they had read the whole thing.

Why This is a Big Deal

  • No More "Context Window" Anxiety: You don't have to worry about the AI forgetting things because the prompt got too long. The "cheat sheet" handles the memory.
  • Robustness: If the sous-chef grabs a slightly irrelevant flavor packet (a bad piece of context), it doesn't ruin the dish. The system is designed to ignore the noise, whereas the old methods would get confused by it.
  • Scalability: Whether the restaurant is a small café or a massive hotel chain, the chef only ever has to read 32 packets. The size of the project doesn't slow down the cooking.

In a Nutshell

HEF is like upgrading from a librarian who hands you a stack of 500 books to a genius assistant who reads all 500 books, distills the wisdom into 32 sticky notes, and hands those to you. You get all the knowledge you need, instantly, without the headache of reading the whole library.