OpenSeeker: The Open-Source "Search Detective" That Shatters the Corporate Wall

Imagine the internet as a massive, chaotic library with billions of books, but most of them are written in riddles, filled with fake news, or hidden behind locked doors. For a long time, only a few giant tech companies (like Google, OpenAI, and Alibaba) had the "Master Keys" to find the right answers quickly. They built their own super-smart search assistants, but they kept their secret recipe (the training data) locked in a vault, saying, "You can use our robot, but you can't see how we built it."

This paper introduces OpenSeeker, a project by a team of university researchers who decided to say, "No more secrets." They built a search agent that is just as good as the corporate giants, but they are giving away the entire recipe, including the ingredients and the cooking instructions, to the public.

Here is how they did it, explained through simple analogies:

1. The Problem: The "Black Box" Kitchen

Until now, building a top-tier search agent was like trying to bake a Michelin-star cake without ever seeing the recipe.

The Giants: They have massive kitchens, millions of dollars, and secret ingredients (data). They make great cakes, but they won't tell you what's in them.
The Academics: They were trying to bake with flour they found on the street. The result? The cakes were dry, or they tasted like cardboard. The community was stuck because they lacked high-quality, transparent data.

2. The Solution: OpenSeeker's Two Secret Weapons

The researchers didn't just scrape random questions off the internet. They invented two clever tricks to generate their own "perfect training data."

Trick #1: The "Web Map Puzzle" (Fact-Grounded QA Synthesis)

Imagine you want to teach a detective how to solve a complex mystery. If you just say, "Find the killer," they might guess. If you say, "The killer was at the park, then went to the bakery, and the baker saw him," it's too easy.

OpenSeeker creates a digital treasure hunt:

The Map: They start with a real page on the internet (a seed).
The Expansion: They look at all the links connected to that page, then the links connected to those pages, building a small "web map" of related facts.
The Obfuscation (The Magic): This is the genius part. They take the clear facts on the map and blur them. Instead of saying "The baker is named Bob," they say, "The person who makes the bread."
The Result: They create a question that forces the AI to hop from page to page (multi-hop reasoning) to connect the dots. It's like giving the AI a puzzle where the pieces are scattered across the whole internet, and it has to figure out how they fit together.

Why this matters: It stops the AI from just guessing or memorizing facts. It forces it to actually search and think.

Trick #2: The "Noise-Canceling Headphones" (Denoised Trajectory Synthesis)

When a human searches the web, they get hit with a wall of noise: pop-up ads, irrelevant articles, and confusing text. If you teach an AI by showing it the raw, messy internet, it gets confused.

OpenSeeker uses a two-step teaching method:

The Teacher (The Clean Version): First, a smart AI acts as a teacher. It looks at the messy search results, summarizes the important parts, and throws away the junk. It then writes down the perfect steps to solve the problem based on this clean summary.
The Student (The Messy Version): Now, the student AI (OpenSeeker) is trained. But here's the twist: The student is shown the messy, raw version of the search history, but it has to predict the perfect steps the teacher wrote.

The Analogy: Imagine learning to drive in a car with a foggy windshield (the raw data). The teacher tells you exactly when to turn, but they are looking through a clear window. You have to learn to ignore the fog and figure out the turn yourself. By the end, your brain learns to "see through the noise" automatically.

3. The Results: The Underdog Wins

The team trained their model (OpenSeeker) using only 11,700 of these specially crafted puzzles. That's a tiny amount of data compared to the billions used by big companies.

The Scoreboard: In tests, OpenSeeker didn't just do well; it beat the competition.
- It beat other open-source models by a huge margin.
- It even beat Tongyi DeepResearch (a model from Alibaba that used massive resources, complex training, and Reinforcement Learning) on Chinese search tasks.
- It did all this with a simple training method (SFT) and a single run, while the giants used supercomputers and years of work.

4. Why This Changes Everything

This paper is a game-changer for three reasons:

Democratization: It proves you don't need billions of dollars to build a world-class search agent. You just need smart data.
Transparency: They released the code, the model, and the data. Anyone can look at how they made the puzzles, anyone can train their own version, and anyone can improve it.
The End of the "Data Moat": The big companies thought their secret data was their only advantage. OpenSeeker showed that if you synthesize data intelligently (like a master chef making a stock from scratch), you can beat them even without their secret ingredients.

The Bottom Line

OpenSeeker is like a group of students in a garage who figured out how to build a Ferrari engine using a blueprint they made themselves, while the big car companies were still guarding their blueprints. They are handing that blueprint to the world, saying, "Here, build your own. Let's make the future of search open, transparent, and collaborative for everyone."

1. Problem Statement

The development of high-performance "Deep Search" agents (LLMs capable of autonomous web navigation and multi-hop reasoning) has been dominated by industrial giants (e.g., OpenAI, Google, Alibaba). This dominance is primarily due to a "data moat": these corporations possess proprietary, high-quality training data and complex training pipelines (Continual Pre-Training, RL, extensive SFT) that are not accessible to the academic community.

Existing open-source efforts suffer from three main limitations:

Lack of Transparency: Many "open-weights" models do not release their training data.
Data Scarcity/Quality: Available open datasets are often small, lack complex reasoning paths, or fail to achieve competitive performance.
Performance Gap: Open-source agents significantly lag behind closed-source industrial agents on benchmarks like BrowseComp and xbench-DeepSearch.

Goal: To bridge this gap by creating the first fully open-source search agent (model + data) that achieves frontier-level performance, thereby democratizing access to high-quality search intelligence.

2. Methodology

OpenSeeker addresses the data scarcity and quality issues through two core technical innovations: Fact-Grounded Scalable Controllable QA Synthesis and Denoised Trajectory Synthesis.

A. Fact-Grounded Scalable Controllable QA Synthesis

This framework generates complex, multi-hop reasoning questions directly from the real-world web graph ( $G$ ), ensuring factual grounding and controllable difficulty. The pipeline involves:

Graph Expansion: Starting from a random seed node, the system traverses outgoing edges to form a local dependency subgraph ( $G_{sub}$ ).
Entity Extraction: A generator distills key entities from the subgraph into a structured Entity Subgraph ( $G_{entity}$ ), removing textual noise while preserving topological connections.
Question Generation: A generator creates an initial question ( $q_{init}$ ) conditioned on $G_{entity}$ . A hard constraint ensures the answer requires traversing multiple edges (multi-hop reasoning) rather than simple retrieval.
Obfuscation: To prevent agents from shortcutting via keyword matching, specific entities in the graph are obfuscated (mapped to vague descriptions). The final question ( $\tilde{q}$ ) is generated using these fuzzy entities, forcing the agent to perform deep graph traversal to disambiguate and solve.
Dual-Criteria Verification (Rejection Sampling):
- Difficulty Check: A base model attempts to answer in a "closed-book" setting. If it succeeds, the question is discarded (it wasn't hard enough).
- Solvability Check: The model is given the full Entity Subgraph as context (oracle setting). If it fails to derive the answer, the sample is rejected (the logic path is broken).

B. Denoised Trajectory Synthesis

Raw web search results are often noisy. OpenSeeker employs an Asymmetric Context Training strategy to teach agents to filter noise:

Teacher Phase (Synthesis): The "Teacher" model generates the reasoning trajectory using a Denoised Context. After each tool call, the raw observation is summarized by a secondary LLM. The history window contains these summaries for long-term memory, ensuring the teacher generates high-quality, logical actions without distraction.
Student Phase (Training): The "Student" model (OpenSeeker) is trained to predict the Teacher's optimal actions and reasoning, but it is conditioned on the Raw, Noisy Context (full, uncompressed tool responses).
Result: This forces the student model to implicitly learn intrinsic denoising capabilities, learning to extract essential signals from unstructured, noisy web data to replicate the expert's performance.

3. Key Contributions

First Fully Open-Source Frontier Agent: OpenSeeker is the first work by a purely academic team to achieve State-of-the-Art (SOTA) performance on search benchmarks while releasing the entire training pipeline, dataset (QA pairs + full trajectories), and model weights.
Novel Data Synthesis Techniques:
- Fact-Grounded QA Synthesis: Enables scalable generation of complex, multi-hop reasoning tasks anchored in real web topology.
- Denoised Trajectory Synthesis: A teacher-student framework that decouples generation context from training context to teach robust information extraction.
High Efficiency: Achieved SOTA results using only 11.7k synthesized samples and a single Supervised Fine-Tuning (SFT) run, without Reinforcement Learning (RL) or Continual Pre-Training (CPT).

4. Experimental Results

The model (based on Qwen3-30B-A3B) was evaluated on four benchmarks: BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch.

Key Performance Highlights:

vs. Industrial Giants: On BrowseComp-ZH, OpenSeeker (48.4%) surpassed Tongyi DeepResearch (46.7%), a model trained with extensive CPT, SFT, and RL.
vs. Open-Source Peers: On BrowseComp, OpenSeeker (29.5%) significantly outperformed the second-best open-source agent, DeepDive (15.3%), despite DeepDive using RL.
Data Efficiency: OpenSeeker achieved superior performance with only 11.7k samples compared to competitors using 10k–15k samples (e.g., WebSailor, WebLeaper) or massive datasets (e.g., MiroThinker with 147k samples).
Benchmark Scores:
- BrowseComp-ZH: 48.4% (SOTA among ~30B models).
- xbench-DeepSearch: 74.0%.
- WideSearch: 59.4% (Item F1).

Data Difficulty Analysis:
The synthesized Chinese dataset, despite having only ~1.4k samples, exhibited significantly higher complexity than the BrowseComp-ZH benchmark, requiring an average of 46.35 tool calls per trajectory compared to the benchmark's 26.98.

5. Significance and Impact

Breaking the Data Monopoly: OpenSeeker dismantles the "data moat" held by corporations, proving that strategic data synthesis can bridge the performance gap between academic and industrial efforts.
Democratization: By releasing the full dataset and synthesis code, it enables the global research community to replicate, audit, and build upon frontier search capabilities without needing massive proprietary resources.
Paradigm Shift: Demonstrates that data quality (via denoising and fact-grounding) is more critical than sheer data volume or complex training pipelines (RL/CPT) for achieving high-performance search agents.
Future Catalyst: Serves as a foundation for a more transparent, collaborative ecosystem in autonomous agent development, encouraging further innovation in multi-hop reasoning and tool use.

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data