HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

Here is an explanation of the paper HypoSpace, broken down into simple concepts with creative analogies.

The Big Problem: "The Mystery with Many Answers"

Imagine you walk into a room and find a broken vase on the floor. You see the pieces, but you didn't see who broke it.

Hypothesis A: The cat knocked it over.
Hypothesis B: The wind blew the curtain and hit it.
Hypothesis C: A burglar sneaked in and dropped it.

All three stories fit the evidence perfectly. In science, this is called underdetermination. The data (the broken vase) doesn't point to just one truth; it allows for many different truths.

The problem with current AI (Large Language Models or LLMs) is that they are great at finding one good answer, but terrible at finding all the possible answers. They tend to get stuck on the first idea that comes to mind and keep repeating it, even if other valid ideas exist.

The Solution: HypoSpace (The "Idea Explorer" Benchmark)

The authors created a new test called HypoSpace. Instead of asking an AI, "What is the answer?" they ask, "Show me every possible answer you can think of that fits the clues."

Think of HypoSpace as a giant, locked treasure chest where the key is a specific set of clues. The AI is a treasure hunter. The goal isn't just to find a key; it's to find every single key that opens the chest.

To measure how good the AI is, they use three simple scores:

Validity (Is it a real key?): Does the AI's idea actually fit the clues? If the AI says "The moon broke the vase," that's invalid. If it says "The cat," that's valid.
Uniqueness (Is it a new key?): Did the AI just repeat the same idea 10 times? Or did it come up with 10 different ideas?
Recovery (Did it find the whole chest?): If there are exactly 100 keys that fit, did the AI find all 100? Or did it only find 5 and then stop?

The Three "Training Gyms"

To test the AI, they built three specific puzzle rooms where they know the exact number of correct answers in advance:

The Detective Game (Causal Inference): You see a graph of events (like "Pressing button A makes light B turn on"). The AI has to draw all the possible wiring diagrams that could explain this.
The Gravity Puzzle (3D Reconstruction): You see a shadow of a 3D block structure. The AI has to guess all the different ways blocks could be stacked to create that exact shadow, obeying the laws of gravity.
The Genetic Recipe (Boolean Logic): You see how mixing two ingredients creates a result (e.g., "Red + Blue = Purple"). The AI has to write all the possible "recipes" (math formulas) that explain why that happens.

What They Found: The "Echo Chamber" Effect

When they tested the smartest AI models (like GPT-5, Claude, etc.), they found a consistent flaw:

The AI is a "One-Hit Wonder": It almost always gets the Validity score high (it finds a correct answer).
But it gets stuck in a loop: As the puzzle gets harder (more possible answers), the Uniqueness and Recovery scores crash.

The Analogy: Imagine a DJ playing music. A good DJ plays a whole album with different songs. These AIs are like a DJ who finds one great song, plays it, and then just plays that same song over and over again, even though the record store is full of other great tracks. They get "stuck" on a few popular ideas and ignore the rest.

Why Does This Happen?

The paper explains that AI models are trained to predict the most likely next word. This makes them "peaked." They are like a person who only eats their favorite food. Even if there are 1,000 delicious dishes in the world, they will only order the one they know they like.

When the "hypothesis space" (the number of possible answers) gets huge, the AI's favorite answers become a tiny drop in the ocean. It keeps sampling the same few drops and never explores the rest of the ocean.

The Fix: "Complexity Stratified Decoding"

The authors tried a simple trick to fix this. Instead of letting the AI pick whatever it wants, they forced it to organize its search by difficulty.

The Old Way: "Give me 100 ideas." (The AI gives 100 variations of the same simple idea).
The New Way: "Give me 10 simple ideas, 10 medium ideas, and 10 complex ideas."

This forced the AI to look in the "complex" part of the room where it usually ignores. It helped the AI find more unique answers, proving that the problem wasn't that the AI couldn't think of them, but that it was too lazy to look for them.

The Real-World Test

They even tested this on real genetic data from yeast (tiny organisms). They found that even in real science, there are often dozens of valid explanations for how genes interact. The AI models showed the same "stuck" behavior here: they found one valid explanation but missed the other 99.

The Takeaway

HypoSpace isn't about making AI smarter at solving puzzles; it's about diagnosing how they think. It reveals that current AIs are excellent at finding one truth, but terrible at exploring the landscape of all possible truths.

For science to advance, we need AI that doesn't just give us the first answer it thinks of, but acts like a curious scientist who explores every corner of the room to make sure we haven't missed a better explanation.

Here is a detailed technical summary of the paper "HypoSpace: A Diagnostic Benchmark for Set-Valued Hypothesis Generation under Underdetermination and Sublinear Coverage Bounds."

1. Problem Statement

Many scientific inference problems are underdetermined, meaning a single set of observations can be explained by multiple, mechanistically distinct hypotheses (e.g., EEG source imaging or causal graph inference). Current Large Language Model (LLM) benchmarks typically evaluate single-answer correctness, rewarding the generation of one valid solution while ignoring the model's ability to systematically explore the full space of admissible hypotheses.

The core problem addressed is: Can LLMs systematically explore and cover the set of all valid hypotheses under underdetermination, or do they suffer from "mode collapse" (repeatedly generating the same few valid solutions)?

2. Methodology: The HypoSpace Framework

The authors introduce HypoSpace, a diagnostic benchmark suite designed to treat LLMs as samplers over finite hypothesis spaces rather than single-answer generators.

A. Core Metrics

The framework evaluates models using three complementary metrics, measured against an exactly enumerated ground truth ( $H_O$ ) to eliminate rater subjectivity:

Validity Rate (VR): Measures appropriateness. The fraction of generated hypotheses that are consistent with the observations.
Uniqueness Rate (NR): Measures originality. The fraction of generated hypotheses that are non-redundant (distinct from previous samples after canonicalization).
Recovery Rate (RR): Measures fluency/coverage. The fraction of the total enumerated admissible set ( $H_O$ $H_{O}$ ) that is successfully recovered by the model's unique valid proposals.
- Formula: $RR = \frac{|\{h \in P \mid \text{valid}(h) \land \text{unique}(h)\}|}{|H_O|}$

B. Task Domains

HypoSpace instantiates three structured domains with deterministic validators and controllable difficulty (scaling the size of $|H_O|$ ):

Causal Graph Inference: Inferring all Directed Acyclic Graphs (DAGs) consistent with single-node intervention observations.
3D Voxel Reconstruction: Reconstructing 3D spatial configurations from top-down 2D projections under physical gravity constraints.
Boolean Genetic Interaction Modeling: Proposing Boolean expressions that map parental phenotypes to offspring outcomes, using a mechanistic canonicalizer to collapse algebraically equivalent expressions.

C. Theoretical Analysis

The paper provides a theoretical explanation for coverage collapse. It demonstrates that if an LLM induces a "peaked" probability distribution over the hypothesis space (concentrating mass on a few modes), the sampling budget required to cover the entire admissible set grows exponentially. Even with high Validity, Recovery remains low because the model repeatedly samples the "head" of the distribution while ignoring the "tail" of valid but less probable hypotheses.

3. Key Contributions

Theoretical Formulation: The first systematic framework to evaluate LLMs on set-valued inference under underdetermination, decoupling correctness (Validity) from exploration (Recovery).
Controlled Diagnostic Suite: Three tasks with exact enumeration of ground truth, enabling precise, model-agnostic measurement of coverage without relying on "LLM-as-a-judge."
Methodological Innovation: Introduction of Complexity-Stratified Decoding, a training-free baseline that forces models to generate hypotheses across different structural complexity levels (e.g., number of edges, operators, or voxels) to mitigate simplicity bias.
Empirical Findings: A comprehensive study revealing consistent failure modes in frontier models.

4. Experimental Results

The study evaluated a mix of instruction-tuned and "reasoning" (thinking) models (e.g., GPT-5, Gemini-2.5-Pro, Claude-Opus-4, DeepSeek-R1, LLaMA-3.3).

The "High Validity, Low Recovery" Regime:
- Frontier models consistently maintain high Validity (VR) (often near 100%) even as task difficulty increases.
- However, Uniqueness (NR) and Recovery (RR) degrade sharply as the hypothesis space size ( $|H_O|$ ) grows.
- This indicates mode collapse: models find a valid answer but fail to explore the space of valid answers.
Task-Specific Performance:
- Causal Inference: Most reasoning models perform near-ceiling in simple/medium settings but show gaps in hard settings.
- 3D Reconstruction: Gaps widen significantly in multi-view settings; non-reasoning models fail to maintain coverage.
- Boolean Genetic Interactions: This is the most discriminative task. Even top reasoning models show significant drops in NR and RR as the program space expands, revealing limited exploration of distinct mechanisms.
Impact of Complexity-Stratified Decoding:
- Applying stratified decoding (forcing generation across complexity levels) partially mitigates collapse.
- It significantly improves Recovery for complex hypotheses (e.g., Grok-4 improved from 0% to 17.2% on complex cases).
- However, it comes with trade-offs: some models lose performance on simple hypotheses as the budget shifts to complex regions.
Real-World Validation:
- Tested on anonymized yeast vesicle-trafficking data. Stronger models recovered up to 100% of the valid hypothesis set, while weaker models failed to produce any consistent hypothesis (VR=0%) despite generating diverse outputs, confirming the metrics capture distinct dimensions of reasoning.

5. Significance and Implications

Beyond Leaderboards: HypoSpace shifts the focus from "getting the right answer" to "understanding the solution space," which is critical for scientific discovery where multiple mechanisms may explain data.
Diagnostic Utility: The benchmark serves as a probe to identify sampling biases in LLMs. The consistent "peaked distribution" behavior suggests that simply increasing sampling budget is insufficient; reshaping the sampling distribution (e.g., via stratified decoding or diversity-promoting objectives) is necessary.
Safety and Reliability: By quantifying coverage failure, the framework helps identify when models might miss critical alternative explanations in high-stakes scientific domains, reducing the risk of ungrounded claims.

In conclusion, HypoSpace demonstrates that while current frontier LLMs are excellent at finding one valid explanation, they struggle to systematically enumerate the full set of possibilities, a limitation that becomes pronounced as the complexity of the scientific problem increases.