Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs

Imagine you have a brilliant but slightly overconfident student named Alex. Alex is great at writing essays and solving logic puzzles, but he has a fatal flaw: he thinks he knows everything.

If you ask Alex, "Who is the CEO of X Corp right now?" and he doesn't actually know, he won't say, "I don't know." Instead, he will confidently make up a name, like "Bob Smith," because he wants to give you a complete answer. This is what AI researchers call a hallucination.

The paper you shared introduces a new way to train AI to stop making things up. They call it "Distilling Reasoning Without Knowledge."

Here is how it works, explained through a simple analogy.

The Problem: The "Know-It-All" Student

Most AI models today are like Alex. They try to do three things at once:

Think about the problem.
Remember facts from their training.
Answer the question.

When the answer isn't in their memory, they panic and guess. Even if you tell them, "Go look it up on Google," they often get confused about what to search for, or they mix their own guesses with the search results.

The Solution: The "Project Manager" and the "Research Team"

The authors propose breaking the AI into two distinct roles, like a construction site:

The Student Planner (The Project Manager): This is a small, fast, and cheap AI. Its only job is to figure out what needs to be done. It does not know any facts. It doesn't even know who the CEO of X Corp is. It just knows how to ask the right questions.
The Tools (The Research Team): These are external tools like Google Search, calculators, and databases. They do the actual fetching of facts.

The Magic Trick: How They Train the Student

Usually, when you teach a student, you give them the answer key. "The answer is Elon Musk."
This paper does the opposite.

They use a super-smart "Teacher" AI to create a training dataset, but they delete the answers.

Teacher: "Here is a question: 'Who is the CEO of X Corp?'"
Teacher's Output (The Plan):
- Step 1: Search Google for "Current CEO of X Corp."
- Step 2: Check if the result matches "Elon Musk."
- Step 3: If yes, calculate how long he's been there.
- Step 4: If no, say "I don't know."
Crucial Detail: The Teacher never tells the Student who the CEO actually is. It only teaches the Student how to ask the question.

The "Student" AI is then trained to copy this planning style. It learns to break a big question into small, searchable steps, but it is strictly forbidden from memorizing the answers.

The Workflow: How It Works in Real Life

When you ask the new system a question, here is the dance it performs (see Figure 1 in the paper):

The Plan: The Student Planner looks at your question and writes a JSON "to-do list." It says, "I need to search for X, then calculate Y." It doesn't know the answer yet.
The Search: The system takes that list and runs the searches on Google (using a tool called SerpAPI).
The Extraction: A separate module reads the messy Google results and pulls out the clean facts (e.g., "Elon Musk is the CEO").
The Assembly: A final module takes the plan and the facts and writes the answer.

If the search comes back empty, the system says, "I couldn't find the answer," instead of making one up.

Why Is This Better?

The authors tested this on SEAL-0, a "nightmare" benchmark designed to break AI. These are questions so tricky that even the smartest AI models usually get 0% right because they hallucinate or get stuck in loops.

Old Way (Monolithic AI): Tries to think and remember everything at once. Gets confused, guesses, and fails. (Accuracy: ~1.8%)
Prompted AI (Asking a normal AI to "think step-by-step"): Better, but still gets confused about what to search for. (Accuracy: ~6.3%)
The New Framework (The Specialized Planner): Because the planner was trained only on how to ask questions, not on facts, it is incredibly efficient. It knows exactly what to search for and doesn't waste time guessing.
- Result: It jumped to 10.8% accuracy (a huge win on a near-impossible test) and was 3x faster than the others.

The Takeaway

Think of this framework as teaching an AI to be a great librarian rather than a great encyclopedia.

An Encyclopedia tries to hold all the answers in its head. If it forgets, it lies.
A Librarian knows exactly which books to pull off the shelf to find the truth. If the book isn't there, the Librarian admits they don't know.

By separating the "thinking" (planning) from the "knowing" (retrieving), the authors created an AI that is more reliable, faster, and much less likely to lie to you. They proved that for AI to be truly trustworthy, it needs to learn how to look for answers, not just memorize them.

1. Problem Statement

Large Language Models (LLMs) struggle with fact-seeking question answering when information is up-to-date, conflicting, or outside their parametric memory. Current approaches face two primary issues:

Hallucinations: Models often generate plausible but factually incorrect answers to fill knowledge gaps rather than acknowledging uncertainty or seeking external evidence.
Inefficient Tool Usage: Existing Retrieval-Augmented Generation (RAG) and tool-using systems often rely on monolithic reasoning, where a single LLM simultaneously decides what to search for, how to retrieve it, and how to synthesize the answer. This coupling makes error diagnosis difficult and leads to inefficient, often redundant, tool usage and "reasoning loops."

The core challenge is to decouple reasoning/planning from factual knowledge acquisition to create a system that relies on verifiable external evidence rather than internal memory.

2. Methodology

The authors propose a modular framework that explicitly separates the inference process into four distinct stages: Planning, Retrieval, Factual Extraction, and Answer Aggregation.

A. Teacher-Student Training Strategy

The core innovation is a training paradigm that distills reasoning structure without transferring factual knowledge.

Teacher Role: A powerful LLM (GPT-5.2) acts as a teacher. Given a question, it generates a structured plan consisting of:
1. Abstract Reasoning Steps ( $R$ ): High-level logic on how to solve the problem.
2. Atomic Fact Requests ( $F$ ): Specific, searchable queries (e.g., "CEO of X Corp") and computational steps.
- Constraint: The teacher is explicitly forbidden from providing the actual answers or retrieved evidence. It only outputs what needs to be known.
Student Role: A lightweight open-weight model (QWEN3-8B) is fine-tuned to mimic the teacher's planning behavior.
- Supervision Signal: The student is trained only on the planning traces ( $R$ ) and fact requests ( $F$ ). It receives no factual answers or retrieved evidence during training.
- Goal: The student learns to decompose problems and formulate precise search queries without internalizing factual content.

B. Inference-Time Execution Pipeline

At inference, the system operates as a loop (illustrated in Figure 1 of the paper):

Planning: The fine-tuned Student Planner generates a structured JSON plan containing reasoning steps and fact requests.
Dependency Resolution & Retrieval:
- Fact requests are executed sequentially.
- Web: Uses SerpAPI (Google Search) for real-time data.
- Compute: Uses a prompted LLM for analytical reasoning (e.g., date calculations, arithmetic) based only on retrieved data.
- Placeholders (e.g., <RESULT_1>) are dynamically replaced with stored facts to resolve dependencies.
Factual Extraction: A prompt-engineered module extracts concise, atomic facts from noisy raw search results (JSON), explicitly instructed to avoid guessing.
Answer Aggregation: A final module synthesizes the answer using only the retrieved facts and the original plan. If evidence is insufficient, it explicitly states the answer is unknown.
Repair Mechanism: A dedicated module fixes JSON syntax errors if the planner's output is malformed, ensuring pipeline robustness.

3. Key Contributions

Decoupling Planning from Knowledge: The framework successfully separates the "how to think" (planning) from "what to know" (retrieval), preventing the model from hallucinating facts to satisfy its own internal reasoning.
Knowledge-Free Distillation: Unlike traditional Knowledge Distillation (KD) which transfers answers or probabilities, this work distills information-seeking behavior. The student learns how to ask questions, not what the answers are.
Modular Architecture: By replacing the monolithic reasoning loop with specialized, prompt-engineered modules for retrieval and extraction, the system becomes more transparent, debuggable, and efficient.
Efficient Student Planner: The use of a small, fine-tuned model (QWEN3-8B) for planning allows for faster inference compared to using large frontier models for the entire reasoning chain.

4. Experimental Results

The framework was evaluated on SEAL-0, an extremely challenging benchmark for search-augmented LLMs where even top-tier models typically achieve near-zero accuracy due to noisy, conflicting, or unhelpful web evidence.

Accuracy:
- Monolithic Base (QWEN3-8B): 1.8% (Hallucinates or loops).
- Prompted Framework (No fine-tuning): 6.3% (Improvement via tools, but limited by poor planning).
- Prompted Framework (With internal reasoning): 3.6% (Reasoning loops degrade performance).
- Proposed Framework (Student Planner): 10.8%. This is the highest among evaluated configurations, outperforming most open-source models and approaching the performance of specialized closed-source agents (e.g., O3-MEDIUM at 17.1%).
Latency:
- Monolithic: 159.9s (Slow due to long internal reasoning).
- Prompted Framework: 41.1s (No reasoning) to 107.9s (With reasoning).
- Proposed Framework: 27.8s. The student planner generates valid, concise plans immediately, eliminating the need for costly JSON repair steps and reducing end-to-end time significantly.

5. Significance and Conclusion

The paper demonstrates that explicitly learned planning structures are more critical for reliable fact-seeking LLMs than simply increasing model size or providing search access.

Reliability: By forcing the system to ground claims in external evidence and refusing to rely on parametric memory for facts, the framework drastically reduces hallucinations.
Efficiency: Training a small model to be a specialized "planner" is significantly faster and cheaper than running large models through complex, iterative reasoning loops.
Future Direction: The work suggests a paradigm shift from "Knowledge Distillation" to "Reasoning Distillation," offering a scalable path to building controllable, reliable agentic systems.

Limitations: The system's performance is still bounded by the quality of external search results (e.g., if the search engine returns bad data, the framework cannot fix it). Additionally, latency is currently dominated by external API calls, which may impact high-throughput real-time applications.