Gradually Excavating External Knowledge for Implicit Complex Question Answering

Imagine you are trying to solve a very tricky riddle, like: "Did any citizen of San Antonio vote for Boris Johnson?"

If you ask a standard AI (a Large Language Model or LLM) this question, it might get confused. It knows who Boris Johnson is, and it knows where San Antonio is, but it doesn't immediately connect the dots that San Antonio is in the US and Boris Johnson is in the UK, meaning a US citizen couldn't vote for him. The AI might just guess or say, "I don't know," because it's trying to answer everything in one giant leap of logic without checking its facts.

This paper introduces a new method called GEEK (Gradually Excavating External Knowledge) to fix this. Think of GEEK not as a super-smart genius who knows everything instantly, but as a diligent detective who solves cases step-by-step.

Here is how GEEK works, using some everyday analogies:

1. The Detective vs. The Oracle

The Old Way (Standard AI): Imagine an Oracle sitting on a mountain. You ask a question, and it tries to spit out the answer immediately from its memory. If the answer isn't in its memory, or if the question is too complex, it fails. It's like trying to solve a math problem in your head without writing anything down.
The GEEK Way: Imagine a detective with a notepad and a library card. When asked a hard question, the detective doesn't guess. Instead, they break the problem down:
1. "Who is Boris Johnson?" (Checks the library).
2. "Where is San Antonio?" (Checks the library).
3. "Can US citizens vote in UK elections?" (Checks the library).
4. "Okay, now I can solve the riddle."

2. The Three Tools in the Detective's Kit

GEEK uses three specific "tools" (modules) that work together:

The Brain (Core Model): This is the detective's brain. It looks at the question and decides, "Do I know the answer? No? Then I need to break this down into smaller questions." It picks the next step, like deciding to look up a fact or do a logic check.
The Librarian (Retriever): When the Brain says, "I need to know about Boris Johnson," the Librarian runs to the massive library (the internet/Wikipedia) and grabs the top 10 most relevant pages.
The Summarizer (Extractor): The Librarian brings back 10 long, boring pages. The Summarizer reads them quickly and writes down just the one sentence that matters: "Boris Johnson is the former Prime Minister of the United Kingdom."

3. The "Gradual Excavation" Process

The name "Gradually Excavating" is a perfect metaphor. Imagine digging for gold.

Step 1: You dig a small hole. You find a rock. You realize, "This isn't gold, but it tells me I'm in the right mountain range."
Step 2: You dig a bit deeper based on that rock. You find a map.
Step 3: The map tells you exactly where to dig next.
Result: You find the gold (the answer).

GEEK does this with information. It digs up a fact, uses that fact to change its strategy, digs up another fact, and keeps going until the answer becomes obvious.

4. Exploring Different Paths (Strategy Exploration)

Sometimes, a detective might think, "Maybe I should check the library first," while another thought says, "No, let's check the police records first."
GEEK is smart enough to try multiple paths at once. It creates a few different "what-if" scenarios (strategies) and follows them all. If one path leads to a dead end, it abandons it. If another path leads to the answer, it takes that one. It's like sending out four different scouts to find the treasure; eventually, one of them will find the right map.

Why is this a Big Deal?

It's Smarter, Not Bigger: Usually, to make AI smarter, companies make the AI "bigger" (giving it more memory and processing power), which costs a fortune. GEEK shows you can get super-smart results with a smaller, cheaper model just by giving it a better process (the detective workflow).
It's Honest: Because GEEK shows its work (the sub-questions and the facts it found), you can see why it gave the answer. It doesn't just hallucinate (make things up); it builds its answer on real evidence.
The Results: On a tough test called "StrategyQA," GEEK got 78.17% accuracy. That's amazing because it used a model that is less than 6% the size of the giant models used by competitors. It's like a compact car winning a race against a massive truck.

In Summary

GEEK is a framework that teaches AI to stop guessing and start investigating. Instead of trying to be a god who knows everything instantly, it acts like a curious human who asks small questions, checks the facts, and slowly builds up the knowledge needed to solve the hardest puzzles.

Here is a detailed technical summary of the paper "Gradually Excavating External Knowledge for Implicit Complex Question Answering" (GEEK).

1. Problem Definition

The paper addresses the challenge of Open-Domain Implicit Complex Question Answering.

Implicit Complexity: Unlike standard QA, these questions require multi-step logical reasoning where the decomposition strategy is not explicitly stated in the question text. The model must infer sub-questions and the logical path to solve them.
Knowledge Limitations: Large Language Models (LLMs) often fail because:
1. Outdated/Uncovered Knowledge: Their pre-trained parameters lack specific, up-to-date, or niche domain facts.
2. One-Shot Generation: Generating a final answer in a single pass restricts the model's ability to comprehensively explore the solution space or correct intermediate reasoning errors.
The Gap: Existing methods often rely on static prompts or assume a linear progression of reasoning, failing when the question requires dynamic strategy adjustment based on newly acquired external facts.

2. Methodology: The GEEK Framework

The authors propose GEEK (Gradually Excavating External Knowledge), an iterative framework where an LLM actively acquires external information and refines its reasoning strategy step-by-step.

Core Architecture

GEEK consists of three interacting modules:

Core Model (Controller): A pretrained LLM (Flan-T5-11B) that acts as the decision-maker. It observes the current "Question State" ( $Q_t$ ) and selects an action from an action space.
Retriever: A neural retriever (DPR) that fetches relevant paragraphs from an external corpus (e.g., Wikipedia) based on the current sub-question.
Extractor: A specialized model (FiD architecture) that condenses retrieved paragraphs into concise, factual sentences to update the knowledge base.

The Iterative Process

The system operates in a loop where the Question State ( $Q_t$ ) is updated with historical sub-questions and their corresponding facts. At each step, the Core Model selects one of four actions:

AddDecomp: Generates a new sub-question ( $d_t$ $d_{t}$ ) to decompose the problem.
- Innovation: Uses a "pre-answer trick" where the model generates a full chain of future sub-questions and pseudo-answers to ensure strategy coherence, though only the immediate step is executed.
Retrieve & Extract: If the sub-question requires external knowledge, the Retriever fetches top- $k$ paragraphs, and the Extractor summarizes them into a fact ( $f_t$ ).
SelfAnswer: If the sub-question is purely logical or the answer is already known from accumulated facts, the model answers directly without retrieval.
FinalAnswer: Once sufficient knowledge is gathered, the model synthesizes all facts to output the final answer (Yes/No).

Strategy Exploration (SE)

To handle the ambiguity of multiple valid reasoning paths, GEEK employs Strategy Exploration.

Instead of a single linear path, the model uses beam search to branch into multiple decomposition strategies (e.g., $n=4$ ) at key decision points.
These branches evolve independently, exploring a "latent solution tree."
The final answer is determined by a majority vote across the successful branches, significantly improving robustness.

3. Key Contributions

Novel Pipeline: Introduction of GEEK, a framework that dynamically adjusts solving strategies by progressively excavating external knowledge, rather than relying on static prompts or one-shot generation.
Strategy Space Exploration: The ability to branch into multiple reasoning paths during the solving process, allowing the model to explore different strategies and select the most accurate one via voting.
Efficiency and Performance: Demonstrating that a significantly smaller model (11B parameters) can outperform much larger models (up to 540B) by effectively leveraging external knowledge and iterative reasoning.

4. Experimental Results

The method was evaluated on the StrategyQA dataset, a benchmark for open-domain multi-step implicit questions.

Accuracy: GEEK achieved 78.17% accuracy.
Comparison:
- It sets a new State-of-the-Art (SOTA) for LLMs in the ~10B parameter scale.
- It outperforms competitors with ~300B parameters (e.g., PaLM, Gopher) and even surpasses methods using massive models like PaLM2 (340B) in specific configurations, despite using less than 6% of the parameters of its largest competitors.
- It significantly outperforms vanilla LLMs (e.g., ChatGPT with CoT) and other retrieval-augmented baselines (e.g., RR, Visconde).
Ablation Studies:
- Removing Retrieval & Extraction dropped accuracy to ~71%.
- Removing Strategy Exploration dropped accuracy from 78.17% to 75.98%, proving the value of exploring multiple paths.
Human Assessment: A ChatGPT-based evaluation showed that 62.45% of GEEK's generated reasoning steps were preferred over human-annotated baselines in terms of informativeness and faithfulness.

5. Significance and Conclusion

Paradigm Shift: The paper challenges the notion that solving complex reasoning tasks requires massive model scaling. Instead, it advocates for organic knowledge excavation and strategic planning.
Explainability: Unlike black-box generation, GEEK provides a transparent, step-by-step reasoning process supported by retrieved evidence, making the decision-making traceable.
Future Impact: The work suggests that future advancements in QA may rely less on pre-training data volume and more on the ability of models to iteratively plan, retrieve, and refine strategies dynamically.

Limitations: The authors acknowledge that hallucination is still possible due to the black-box nature of neural networks, logic errors can occur even with correct answers, and the scarcity of open-domain complex QA datasets limits broader validation.