SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

Imagine you are trying to test if a student is actually smart (good at reasoning) or just well-read (good at memorizing facts).

Usually, when we give a student a test, they can cheat by using their memory. For example, if you ask, "Who is the president of the USA?", a smart student might reason through the election process, but a memorizer just says "Joe Biden" because they've heard it a thousand times. It's hard to tell if they figured it out or just recalled it.

The paper "SYNTHWORLDS" introduces a brilliant way to solve this problem. Here is the simple breakdown:

1. The Problem: The "Memory Cheat"

Current AI models are like students who have read the entire internet. When we test them, they often win not because they are good at logic, but because they remember the answer from their training data.

Real World: You ask, "Who is the CEO of Apple?" The AI knows this because it memorized it.
The Issue: We don't know if the AI can reason its way to the answer if it didn't already know it.

2. The Solution: Building Two Parallel Universes

The researchers built a framework called SYNTHWORLDS. Think of it as creating two identical video games side-by-side:

World A (The Real World): This is our normal world. The characters are real people (like Elon Musk), the cities are real (like New York), and the facts are true. The AI can use its "memory" here.
World B (The Synthetic World): This is a mirror world that looks and feels exactly the same, but everything has been renamed.
- Elon Musk is now "Zog the Space Traveler."
- New York is now "Metro City."
- Apple is now "FruitTech."
- Crucially: The relationships are identical. Zog still owns FruitTech. Metro City is still in the USA.

In World B, the AI has zero memory of "Zog" or "Metro City" because it never saw them in its training data. If the AI gets the answer right in World B, it must be using pure reasoning.

3. The Experiment: The "Knowledge Advantage Gap"

The researchers gave the AI the same puzzles in both worlds.

In World A: The AI might say, "I know Zog is the CEO because I read it online!" (Memorization).
In World B: The AI has to look at the clues provided in the text and think, "Okay, the text says Zog owns FruitTech, so Zog is the CEO." (Reasoning).

They measured the difference in scores between the two worlds. This difference is called the "Knowledge Advantage Gap."

Big Gap: The AI is mostly just reciting facts. It fails in the new world because it can't think without its memory crutch.
Small Gap: The AI is actually good at reasoning. It can solve the puzzle even when the names are fake.

4. The Findings: The Crutch is Still There

The researchers tested this with advanced AI models using two types of tasks:

Multi-hop Question Answering: "Who is the boss of the person who invented the lightbulb?" (Requires connecting dots).
Page Navigation: "Click through these links to get from 'Zog' to 'FruitTech'." (Requires planning a path).

What they found:

The Gap Persists: Even when they gave the AI a "cheat sheet" (retrieving documents to read during the test), the AI still performed much better in the Real World than the Synthetic World.
The "Shortcut" Effect: In the Real World, the AI often took shortcuts. It didn't read the whole document; it just recalled the fact. In the Synthetic World, it had to actually read and think, which is harder.
Augmentation Helps, But Doesn't Fix It: Giving the AI access to search engines or extra text helped it a little, but it didn't close the gap completely. The AI still relied too much on what it had memorized before.

The Big Takeaway

SYNTHWORLDS is like a "lie detector" for AI intelligence. It proves that even our smartest AI models are often just parrots with a massive memory, not true thinkers. They struggle to reason when they can't rely on their "cheat sheet" of memorized facts.

This framework gives scientists a controlled way to build better AI that can actually think in new, unfamiliar situations, rather than just reciting what it learned in school.

Here is a detailed technical summary of the paper "SYNTHWORLDS: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models."

1. Problem Statement

Evaluating the reasoning capabilities of Large Language Models (LMs) is currently confounded by their extensive parametric world knowledge. Standard benchmarks often measure factual recall (memorization) rather than genuine reasoning.

The Core Issue: When an LM answers a question correctly, it is unclear whether it derived the answer through logical inference or simply retrieved a memorized fact from its training data.
Limitations of Existing Approaches:
- Manual Curation: Creating "clean" evaluation sets is costly, slow to scale, and quickly becomes obsolete as models memorize new benchmarks (e.g., ToolQA, MuSiQue).
- Synthetic Data: Existing synthetic datasets often rely on overly simplistic templates or use real-world content (e.g., novels), leading to "parametric knowledge leakage" where models still recognize entities.
- Perturbation: Simple paraphrasing or adversarial substitution often fails to cleanly separate reasoning difficulty from knowledge requirements.

The paper argues that without controlling both task difficulty and the availability of parametric knowledge, evaluations cannot isolate the contributions of reasoning versus memorization.

2. Methodology: The SYNTHWORLDS Framework

The authors propose SYNTHWORLDS, a fully automatic and scalable framework that constructs parallel corpora representing two distinct worlds with identical interconnected structures but different surface-level entities.

A. Corpus Construction Pipeline

The framework generates two parallel datasets from a single source Knowledge Graph (KG), specifically Wikidata in this study:

Universe Construction: A connected subgraph of facts (triplets: subject $\to$ relation $\to$ object) is sampled from the KG to ensure factual consistency and complex interconnectivity.
Surface-Form Perturbation (The "Synthetic" World):
- Entities are systematically renamed to obscure real-world knowledge while preserving ontological types and name-derivation consistency.
- Example: Geoffrey Hinton (Person) $\to$ Caleb Ardent; Toronto (City) $\to$ Metrovale; University of Toronto $\to$ University of Metrovale.
- Crucially, the renaming preserves semantic cues (e.g., a university name still looks like a university) to ensure the task requires reasoning over the structure of the text, not just pattern matching.
- Timestamps are shifted by a fixed offset to prevent temporal leakage.
Document Generation:
- Synth-Mapped (SM): Documents are generated first using the synthetic entities. This ensures the LM cannot rely on pre-trained knowledge about the entities.
- Real-Mapped (RM): The synthetic documents are converted back to real-world entities using symbolic references (e.g., entity IDs) to ensure the sentence structure and logical flow remain identical to the SM version.

B. Task Design

Two reasoning-intensive tasks are instantiated on both corpora to maintain parallel difficulty:

Multi-Hop Question Answering (QA): Questions require chaining facts across multiple documents (e.g., "Who taught the mother of the head of state?"). Difficulty is controlled via graph motifs (2 to 4 hops).
Page Navigation: An agent must navigate a hyperlink graph from a source page to a target page using only link text (or full content). Difficulty is controlled by the expected random walk distance between nodes.

C. Evaluation Metric: Knowledge Advantage Gap (KA)

The framework defines the Knowledge Advantage Gap as:
$KA = P_R - P_S$
Where:

$P_R$ : Performance on the Real-Mapped corpus (where parametric knowledge is useful).
$P_S$ : Performance on the Synthetic-Mapped corpus (where parametric knowledge is useless).
Interpretation: A high $KA$ indicates the model relies heavily on memorized facts. A low $KA$ suggests the model is reasoning based on the provided context.

3. Key Contributions

Scalable Framework: A fully automated pipeline to generate rich, interconnected corpora and tasks that disentangle reasoning from parametric knowledge, addressing the scalability issues of manual benchmark creation.
SYNTHWORLD-RM/SM Datasets: The release of two parallel corpora derived from Wikidata containing:
- 6,920 documents covering 161K facts.
- 1,200 multi-hop QA instances.
- 1,000 page navigation instances.
Empirical Analysis: A systematic study quantifying the knowledge advantage gap across different model families (GPT-5, Gemini, Kimi, gpt-oss) and settings (Closed-book, RAG, Agentic workflows).

4. Key Results

The authors evaluated six models under three conditions: Closed-book (parametric only), One-step RAG (retrieval), and IRCoT + RAG (interleaved reasoning and retrieval).

Persistent Knowledge Gap:
- Closed-book: Models achieved significant performance on Real-Mapped tasks ( $P_R \approx 20\%$ F1 for QA) but near-zero on Synthetic-Mapped tasks ( $P_S \approx 0\%$ ). This confirms a large baseline $KA$ ( $\approx 20$ ), proving models rely heavily on memorization.
- Page Navigation: Similar gaps were observed ( $KA \approx 20-30\%$ success rate difference), indicating models use parametric knowledge to "shortcut" navigation.
Impact of Knowledge Augmentation:
- One-step RAG: While absolute performance improved for both RM and SM, the gap widened (e.g., $KA$ increased from 21.4 to 25.4 for GPT-5-mini in QA). This suggests that standard retrieval disproportionately benefits models with parametric knowledge, likely because the retriever itself (often an LM) struggles to generalize to novel synthetic entities.
- IRCoT + RAG: Interleaving reasoning with retrieval reduced the gap (e.g., $KA$ dropped to 16.2 for GPT-5-mini). This indicates that explicit reasoning chains help models integrate external knowledge more effectively, even in novel environments.
- Content + Links (Navigation): Providing full page content narrowed the gap significantly compared to "Links Only," showing that access to textual evidence helps models in synthetic settings, though a gap remains.
Behavioral Analysis:
- In the Real-Mapped setting, agents frequently referenced external entities not present in the current context (e.g., mentioning "Belgium" when navigating a synthetic map), relying on stored facts.
- In the Synthetic-Mapped setting, this behavior was impossible, forcing the model to rely solely on the provided text.

5. Significance and Future Directions

Diagnostic Tool: SYNTHWORLDS provides a precise mechanism to measure how much an LM's performance is driven by "reciting" vs. "reasoning."
System Improvement: The findings highlight that current Retrieval-Augmented Generation (RAG) systems are not robust enough to fully substitute for parametric knowledge in novel environments. The retriever itself may be biased toward familiar entities.
Generalizability: The framework is domain-agnostic. It can be applied to code generation (renaming libraries), mathematics (changing notation systems), or any domain where a Knowledge Graph exists.
Future Work: The authors suggest using this framework to study long-context methods, multi-agent workflows, and more sophisticated knowledge integration schemes to close the gap between reasoning and memorization.

In conclusion, SYNTHWORLDS demonstrates that even with advanced retrieval and reasoning strategies, LMs still exhibit a significant reliance on parametric knowledge, and current augmentation methods do not fully eliminate this dependency when facing truly novel information.