One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States

Here is an explanation of the paper "One Model Is Enough," using simple language and everyday analogies.

The Problem: The "Double-Check" Bottleneck

Imagine you are a highly intelligent assistant (an AI) helping a customer. The customer asks a complex question, and you realize you need to look up some facts in a giant library to answer correctly.

The Old Way (Current Standard):

Think: You (the AI) think about the question and write down a search query on a piece of paper, like "best hiking trails in Colorado."
Translate: You hand that paper to a separate translator (an Embedding Model). This translator reads your sentence and turns it into a secret code (a vector) that the library's computer can understand.
Search: The library computer uses that code to find the right books.

The Flaw:
This is like writing a letter, then hiring a second person to rewrite that same letter in a different language just so a third person can read it. The first person (the AI) already understood the meaning perfectly while they were thinking. Writing the sentence down and then re-translating it is a waste of time and energy. It adds a "middleman" that slows everything down.

The Solution: "Native Retrieval"

The authors propose a clever shortcut: Why hire the translator at all?

Instead of writing a sentence and handing it to a translator, they attach a tiny, lightweight "adapter" (a projection head) directly to the AI's brain.

The New Way:

Think: The AI thinks about the question.
Direct Access: As the AI is thinking, its internal "brain waves" (hidden states) are already full of the meaning. The tiny adapter instantly grabs those brain waves and converts them directly into the secret code the library needs.
Search: The library computer gets the code immediately.

The Result: The "translator" (the separate embedding model) is fired. The AI does the thinking and the searching on its own, using its own internal language.

How Did They Teach the AI to Do This?

You can't just tell the AI, "Hey, use your brain waves." You have to teach it how to translate its own thoughts into the library's secret code.

The authors used a method called Knowledge Distillation (like a master chef teaching an apprentice).

The Teacher: The old, separate translator model (which is very good at making the secret code).
The Student: The tiny adapter attached to the AI.

They trained the student using three specific "lessons" (Loss Functions):

Alignment (The Mirror): "Make your code look exactly like the Teacher's code." (This ensures the basic meaning is right).
Contrastive (The Sorting Hat): "Make sure your code for 'hiking' is very different from your code for 'swimming'." (This ensures the AI keeps different ideas distinct).
Rank Distillation (The Librarian's Preference): "If the Teacher thinks Book A is better than Book B, you should think Book A is better than Book B too." (This teaches the AI how to prioritize search results).

The Results: Fast and Almost Perfect

They tested this on a conversational search benchmark (QReCC), where the AI has to handle multi-turn conversations.

Speed: Because they cut out the middleman, the search became 21 times faster. It went from taking 43 milliseconds to just 2 milliseconds.
Quality: The search results were 97% as good as the old, slow method. It's a tiny drop in quality, but a massive gain in speed.

The Catch (Limitations)

Training Cost: To teach the AI this new trick, you still need the "Teacher" model during the training phase. You can't get rid of the second model until the AI has learned its lesson.
Family Ties: They tested this using two models from the same "family" (Qwen). It might be harder to teach an AI from one family to speak the language of a totally different family's library.
Not Perfect Yet: While 97% is great, it's not 100%. In very rare or weird situations, the old method is still slightly better.

The Big Takeaway

This paper proves that you don't need two models to do one job. By teaching the AI to use its own internal thoughts as a search query, we can make AI agents significantly faster and simpler, without needing a separate "translator" model to slow us down. It's like realizing you don't need a dictionary to speak your own language; you just need to learn how to use the words you already know.

Here is a detailed technical summary of the paper "One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States" by Bo Jiang.

1. Problem Statement

Current Retrieval-Augmented Generation (RAG) systems for Large Language Model (LLM) agents typically follow a two-model pipeline:

Generation: The LLM generates a search query as text based on conversational context.
Encoding: A separate, dedicated embedding model encodes this text into a dense vector for retrieval.

The Redundancy: This architecture is inefficient because the LLM has already processed the full conversational context (user intent, history, task requirements) and encoded this understanding in its internal hidden states. The generated text is merely a "lossy, discrete projection" of this rich internal representation. The separate embedding model then re-processes this text from scratch, discarding the LLM's internal understanding only to approximate it again. This adds infrastructure complexity, latency, and computational cost.

The Goal: Eliminate the separate embedding model by enabling the LLM agent to perform retrieval natively using its own hidden states, thereby reducing latency and simplifying the inference pipeline.

2. Methodology

The authors propose a lightweight Projection Head that maps the LLM's hidden states directly into the embedding space of a pre-trained teacher embedding model.

A. Architecture

The projection head $f$ transforms the variable-length sequence of hidden states $H = [h_1, \dots, h_n]$ (extracted during the LLM's autoregressive generation) into a fixed-dimensional vector:

Input Projection: A linear layer maps hidden states ( $d_h$ ) to an internal dimension ( $d_m$ ).
Transformer Encoder: A stack of $L$ transformer layers with self-attention aggregates information across the generated token sequence.
Pooling: Mean pooling is applied over non-padding positions to compress the sequence into a single vector.
Output Projection & Normalization: A final linear layer projects to the target embedding dimension ( $d$ ), followed by L2 normalization to ensure dot-product similarity equals cosine similarity.

Note: The hidden states are extracted as a byproduct of generation, adding negligible overhead.

B. Training Objectives (Knowledge Distillation)

The projection head is trained via knowledge distillation from a teacher embedding model ( $E$ ) using a combination of three complementary losses:

Alignment Loss ( $L_{align}$ ): Minimizes the angular distance (cosine distance) between the projected vector and the teacher's embedding for the same query. This anchors the student's output to the teacher's space.
Contrastive Loss ( $L_{contra}$ ): An InfoNCE loss that ensures the projected embeddings preserve the relative discriminative structure among different queries within a batch.
Rank Distillation Loss ( $L_{rank}$ ): Transfers the teacher's document ranking preferences. It uses KL divergence to align the student's document ranking distribution (based on dot products with top- $K$ candidate documents) with the teacher's distribution.

Total Loss: $L = \lambda_a L_{align} + \lambda_c L_{contra} + \lambda_r L_{rank}$

3. Experimental Setup

Dataset: QReCC (Question Rewriting in Conversational Context), a large-scale conversational search benchmark.
Models:
- LLM Agent (Student): Qwen3-8B.
- Teacher: Qwen3-Embedding-8B (Same-family setting).
Baseline: Standard "generate-then-encode" pipeline (LLM generates text $\to$ Separate Embedding Model encodes text).
Metrics: Recall@10, MRR@10, nDCG@10, and Inference Latency.

4. Key Results

The proposed method achieves near-parity with the standard two-model baseline while drastically reducing latency.

Metric	Baseline (Two-Model)	Ours (Native Projection)	Improvement
Recall@10	0.637	0.607	-3.0%
MRR@10	0.329	0.293	-3.6%
nDCG@10	0.402	0.367	-3.5%
Latency (p50)	43.5 ms	2.0 ms	21.8x faster

Quality: The method retains 97% of the baseline retrieval quality.
Significance: The performance gap is statistically significant (McNemar's test, $p=0.0005$ ), but the methods agree on 84.2% of retrieval triggers.
Efficiency: The projection head adds only ~25M parameters (negligible compared to the 8B LLM) and eliminates the forward pass of the embedding model entirely.

5. Ablation Studies & Insights

Loss Components:
- Alignment is the most critical component (single loss achieves 0.567 Recall@10).
- Rank Distillation fails completely without alignment (Recall drops to 0.001) because it lacks an anchor in the embedding space.
- Combination: Using all three losses yields the best results.
Training Dynamics:
- The task is highly sensitive to learning rates. A high rate ($5 \times 10^{-4}$) causes training collapse.
- Extended training (80 epochs) with a lower learning rate ($2 \times 10^{-4}$) provided the largest performance boost, suggesting the geometric mapping between hidden states and embedding space requires slow, thorough convergence.
Failure Analysis: Failures tend to cluster in specific conversational patterns involving rare terms, complex coreference chains, or domain-specific vocabulary, indicating the projection head struggles with the "long tail" of query types.

6. Key Contributions

Formalization of Redundancy: Identified and formalized the inefficiency of the standard two-model RAG pipeline where the LLM's internal understanding is discarded.
Native Retrieval Mechanism: Proposed a lightweight projection head that allows LLM agents to search using their own hidden states, removing the need for a separate embedding model at inference.
Three-Loss Training Objective: Designed a robust training strategy combining alignment, contrastive, and rank distillation losses to effectively map hidden states to an embedding space.
Empirical Validation: Demonstrated via comprehensive experiments (12 ablation configurations) that native retrieval can achieve competitive performance (97% of baseline) with a 21.8x reduction in latency.

7. Significance and Limitations

Significance:
This work challenges the dominant paradigm of RAG by proving that a separate embedding model is not strictly necessary for high-quality retrieval. It offers a path toward simpler, faster, and more cost-effective RAG systems, particularly for organizations deploying single-model families.

Limitations:

Generalization: Evaluated only on QReCC; generalization to other domains is unproven.
Same-Family Dependency: The success relies on a "same-family" setting (Qwen LLM + Qwen Embedding). Cross-family projection (e.g., Llama LLM to BERT embedding) may be more difficult.
Training Requirement: The embedding model is still required during the training phase to generate teacher signals, though it is removed during inference.
Statistical Gap: While small, the performance gap is statistically significant, meaning the method is not yet a perfect replacement for the baseline.