One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States

This paper proposes a method to equip LLM agents with native retrieval capabilities by projecting their hidden states directly into the embedding space via a lightweight head, thereby eliminating the need for a separate embedding model while retaining 97% of baseline retrieval quality.

Bo Jiang

Published Tue, 10 Ma
📖 4 min read☕ Coffee break read

Here is an explanation of the paper "One Model Is Enough," using simple language and everyday analogies.

The Problem: The "Double-Check" Bottleneck

Imagine you are a highly intelligent assistant (an AI) helping a customer. The customer asks a complex question, and you realize you need to look up some facts in a giant library to answer correctly.

The Old Way (Current Standard):

  1. Think: You (the AI) think about the question and write down a search query on a piece of paper, like "best hiking trails in Colorado."
  2. Translate: You hand that paper to a separate translator (an Embedding Model). This translator reads your sentence and turns it into a secret code (a vector) that the library's computer can understand.
  3. Search: The library computer uses that code to find the right books.

The Flaw:
This is like writing a letter, then hiring a second person to rewrite that same letter in a different language just so a third person can read it. The first person (the AI) already understood the meaning perfectly while they were thinking. Writing the sentence down and then re-translating it is a waste of time and energy. It adds a "middleman" that slows everything down.

The Solution: "Native Retrieval"

The authors propose a clever shortcut: Why hire the translator at all?

Instead of writing a sentence and handing it to a translator, they attach a tiny, lightweight "adapter" (a projection head) directly to the AI's brain.

The New Way:

  1. Think: The AI thinks about the question.
  2. Direct Access: As the AI is thinking, its internal "brain waves" (hidden states) are already full of the meaning. The tiny adapter instantly grabs those brain waves and converts them directly into the secret code the library needs.
  3. Search: The library computer gets the code immediately.

The Result: The "translator" (the separate embedding model) is fired. The AI does the thinking and the searching on its own, using its own internal language.

How Did They Teach the AI to Do This?

You can't just tell the AI, "Hey, use your brain waves." You have to teach it how to translate its own thoughts into the library's secret code.

The authors used a method called Knowledge Distillation (like a master chef teaching an apprentice).

  • The Teacher: The old, separate translator model (which is very good at making the secret code).
  • The Student: The tiny adapter attached to the AI.

They trained the student using three specific "lessons" (Loss Functions):

  1. Alignment (The Mirror): "Make your code look exactly like the Teacher's code." (This ensures the basic meaning is right).
  2. Contrastive (The Sorting Hat): "Make sure your code for 'hiking' is very different from your code for 'swimming'." (This ensures the AI keeps different ideas distinct).
  3. Rank Distillation (The Librarian's Preference): "If the Teacher thinks Book A is better than Book B, you should think Book A is better than Book B too." (This teaches the AI how to prioritize search results).

The Results: Fast and Almost Perfect

They tested this on a conversational search benchmark (QReCC), where the AI has to handle multi-turn conversations.

  • Speed: Because they cut out the middleman, the search became 21 times faster. It went from taking 43 milliseconds to just 2 milliseconds.
  • Quality: The search results were 97% as good as the old, slow method. It's a tiny drop in quality, but a massive gain in speed.

The Catch (Limitations)

  • Training Cost: To teach the AI this new trick, you still need the "Teacher" model during the training phase. You can't get rid of the second model until the AI has learned its lesson.
  • Family Ties: They tested this using two models from the same "family" (Qwen). It might be harder to teach an AI from one family to speak the language of a totally different family's library.
  • Not Perfect Yet: While 97% is great, it's not 100%. In very rare or weird situations, the old method is still slightly better.

The Big Takeaway

This paper proves that you don't need two models to do one job. By teaching the AI to use its own internal thoughts as a search query, we can make AI agents significantly faster and simpler, without needing a separate "translator" model to slow us down. It's like realizing you don't need a dictionary to speak your own language; you just need to learn how to use the words you already know.