SE-Search: Self-Evolving Search Agent via Memory and Dense Reward

Imagine you are trying to solve a very tricky riddle. You could try to guess the answer from your own memory, but you might get it wrong. Or, you could ask a librarian (a search engine) for help.

The Old Way (Traditional Search Agents):
Imagine you ask the librarian, "Who won the 1998 World Cup?" The librarian hands you a stack of 100 books.

The Problem: Most of those books are about soccer history, but 90 of them are about the 1990s, the wrong country, or just random noise. You have to read through all of them, get confused, and maybe still get the answer wrong.
The Training: The teacher (the AI trainer) only tells you, "Good job" or "Bad job" after you write your final answer. They don't tell you why you picked the wrong books or why your question to the librarian was too vague. You learn slowly and inefficiently.

The New Way (SE-Search):
The authors of this paper created a smarter agent called SE-Search. Think of it as a super-intelligent detective who has learned how to be a master researcher. It uses three main tricks to get better at finding answers:

1. The "Mental Filter" (Memory Purification)

Instead of dumping a messy stack of 100 books on your desk, SE-Search acts like a strict editor.

How it works: Every time the librarian brings back new information, SE-Search immediately asks: "Is this actually useful for solving the riddle?"
The Analogy: If the librarian brings a book about "Soccer in 1998," SE-Search keeps the page about the final match and throws away the pages about the weather or the players' birthdays. It writes down only the key facts in a clean, organized notebook (its "Memory"). This prevents the detective from getting overwhelmed by junk.

2. The "Atomic Question" Strategy (Atomic Query)

In the old days, detectives might ask one giant, confusing question like, "Tell me everything about the 1998 World Cup winner, the coach, and the score." The librarian gets confused and gives a messy answer.

How it works: SE-Search breaks the big problem into tiny, simple pieces. It asks, "Who won in 1998?" Then, "Who was the coach?" Then, "What was the score?"
The Analogy: Instead of trying to eat a whole pizza in one bite (which makes you choke), SE-Search takes small, bite-sized pieces. This ensures it gets the right ingredients for every part of the answer without getting lost.

3. The "Gold Star" System (Dense Rewards)

This is the biggest change in how the AI learns.

The Old Way: The teacher waits until the very end to give a grade. If you got the answer wrong, you don't know if it was because you asked the wrong question, read the wrong book, or just wrote the answer poorly.
The New Way: SE-Search gets a "Gold Star" (a reward) for every single good step it takes.
- Did you ask a clear, short question? Gold Star!
- Did you successfully filter out the junk from the books? Gold Star!
- Did you follow the rules of the game? Gold Star!
- Did you get the final answer right? Big Gold Star!
The Result: Because it gets feedback constantly, it learns much faster how to be a good researcher. It stops wasting time on bad questions and starts focusing on what actually matters.

The Results

When they tested this new detective (SE-Search) against the old ones:

It was faster: It asked fewer questions to get the same answer.
It was smarter: It got the right answer much more often, especially on hard, multi-step riddles (like "Who was the coach of the team that won the 1998 World Cup?").
It grew with size: Just like a human gets smarter as they get older, this AI got even better when they made it bigger (using more powerful computer brains).

In Summary:
SE-Search is like upgrading from a chaotic intern who dumps a pile of papers on your desk to a professional research assistant who filters the noise, asks precise questions, and learns from every small mistake along the way. It doesn't just "search"; it evolves into a better searcher every time it tries.

Here is a detailed technical summary of the paper "SE-Search: Self-Evolving Search Agent via Memory and Dense Reward".

1. Problem Statement

While Retrieval-Augmented Generation (RAG) mitigates hallucinations in Large Language Models (LLMs) by incorporating external knowledge, traditional fixed RAG pipelines lack the flexibility to dynamically decide when to search and what to search for. Recent autonomous search agents attempt to address this but face three critical challenges:

Noisy Search Results: Agents often retrieve top-K documents containing irrelevant or noisy information, which degrades reasoning quality.
Limited Search Diversity & Frequency: Existing methods tend to generate repetitive queries or fail to adjust search frequency based on question complexity, leading to inefficient evidence acquisition.
Sparse Evolutionary Feedback: Current Reinforcement Learning (RL) approaches (e.g., Search-R1) often rely on sparse, binary rewards at the final answer level. They lack fine-grained feedback on query formulation, memory management, and search frequency, resulting in suboptimal training signals.

2. Methodology: SE-Search

The authors propose SE-Search, a self-evolving search agent that adopts a "Think-Search-Memorize" strategy. The system is trained using Group Relative Policy Optimization (GRPO) with a novel Dense Reward framework.

Core Components

Memory Purification:
- Mechanism: Instead of forwarding all retrieved documents directly to the LLM for reasoning, the agent extracts salient evidence and updates a dedicated internal memory state ( $m_t$ ) after each search step.
- Process: The agent uses a prompt template to filter, consolidate, and revise previous memory with new retrieved knowledge ( $k_t$ ).
- Goal: This prevents the accumulation of noise and ensures the reasoning trajectory relies on a coherent, evolving knowledge base rather than raw, unfiltered documents.
Atomic Query Training:
- Mechanism: To improve evidence acquisition, the agent is guided to generate multiple distinct, short "atomic" queries rather than long, complex ones.
- Constraints: A counting algorithm enforces query length bounds and diversity (measured by embedding similarity).
- Goal: This encourages diverse exploration of the search space and prevents redundant searches.
Dense Rewards:
Unlike sparse rewards, SE-Search employs a composite reward function ( $R_{Dense}$ ) comprising four fine-grained components to stabilize training:
- Outcome Reward ( $R_{ans}$ ): Uses F1 score (set overlap) between predicted and ground-truth answers rather than strict Exact Match, providing graded feedback.
- Memory Reward ( $R_{mem}$ ): Measures "Cover Exact Match" (CEM) between the agent's stored memory and the ground-truth answer, incentivizing the extraction of relevant facts.
- Query Reward ( $R_{query}$ ): Penalizes excessive or redundant queries if the answer is correct, and encourages more diverse queries if the answer is incorrect.
- Format Reward ( $R_{format}$ ): Penalizes structural violations (e.g., missing tags, exceeding max search turns) to prevent mode collapse.
- Decay Strategy: The influence of the query reward is gradually reduced via a cosine decay schedule during training to shift focus from exploration to exploitation.

Optimization Objective

The agent optimizes a policy $\pi_\theta$ to maximize the likelihood of the correct answer while simultaneously optimizing the quality of search queries and memory content. The training utilizes GRPO, which avoids the need for a separate value estimator (critic) by normalizing rewards across a group of sampled trajectories.

3. Key Contributions

Self-Evolving Agent Architecture: Proposes SE-Search, which autonomously adapts its search behavior through a "Think-Search-Memorize" loop, effectively filtering noise and evolving its internal knowledge state.
Three Novel Mechanisms:
1. Memory Purification: A template-driven method to distill useful evidence from noisy retrievals.
2. Atomic Query Strategy: A mechanism to enforce query diversity and appropriate search frequency.
3. Dense Reward System: A multi-component reward function (Outcome, Memory, Query, Format) that provides fine-grained supervision for RL training.
Empirical Validation: Demonstrates significant performance gains across seven diverse QA benchmarks, proving the effectiveness of the approach in both single-hop and multi-hop scenarios.

4. Experimental Results

The authors evaluated SE-Search (specifically the 3B parameter version, SE-Search-3B) against strong baselines (including Search-R1, AutoRefine, and O2-Searcher) on seven benchmarks: NQ, TriviaQA, PopQA (Single-hop) and HotpotQA, 2Wiki, Musique, Bamboogle (Multi-hop).

Performance: SE-Search-3B achieved an average Exact Match (EM) accuracy of 0.420, outperforming the previous state-of-the-art Search-R1 by 10.8 absolute points (33.8% relative gain).
Multi-Hop Gains: The model showed particularly strong improvements on complex multi-hop tasks. For example, it improved HotpotQA by 4.5 percentage points and Bamboogle by 8 percentage points over AutoRefine.
Ablation Studies:
- Memory Purification significantly boosted performance on multi-hop datasets (e.g., +71.43% on Musique).
- Atomic Queries further enhanced diversity and coverage.
- Dense Rewards were crucial for training stability and convergence.
Scaling Laws: The method scales effectively with model size. Moving from 3B to 14B parameters consistently improved both EM and F1 scores.
Behavioral Analysis:
- Efficiency: As training progressed, the average number of search calls decreased (from 1.53 to 1.32) while accuracy increased, indicating the agent learned to search more efficiently.
- Adaptivity: The agent dynamically adjusted search frequency, performing ~1.54 searches for complex multi-hop questions but ~1.0 for simple single-hop questions.
- Query Quality: SE-Search generated shorter, more diverse queries (lower similarity ratio) compared to baselines, leading to higher-quality retrieval.

5. Significance

SE-Search represents a significant advancement in autonomous search agents by addressing the "noise accumulation" and "sparse feedback" problems inherent in current RAG and RL-based search systems.

Paradigm Shift: It moves away from fixed retrieval pipelines to a dynamic, self-correcting loop where the agent actively manages its own memory and search strategy.
Training Efficiency: The dense reward system provides richer learning signals, allowing the agent to learn complex search behaviors (like when not to search) more effectively than binary reward systems.
Generalizability: The approach is model-agnostic (demonstrated across 3B, 7B, and 14B models) and improves performance across a wide range of question complexities, making it a robust solution for real-world information-seeking tasks.

Limitations: The current implementation relies on a static corpus (no live web search), struggles with extremely complex browsing tasks (e.g., code execution or deep page navigation), and requires manual tuning of reward hyperparameters.

SE-Search: Self-Evolving Search Agent via Memory and Dense Reward

1. The "Mental Filter" (Memory Purification)

2. The "Atomic Question" Strategy (Atomic Query)

3. The "Gold Star" System (Dense Rewards)

The Results

1. Problem Statement

2. Methodology: SE-Search

Core Components

Optimization Objective

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries

Markovian Generation Chains in Large Language Models