X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes

Imagine you are trying to figure out how good a new student is at math. You give them a worksheet with 100 problems. They get 98 right. You might think, "Wow, this student is a genius!"

But here's the catch: What if the student didn't actually do the math? What if they just memorized the look of the questions? If you change the numbers slightly, or ask the same question in a different way, they might fail completely because they were just matching patterns, not actually reasoning.

This is the problem with testing current Large Language Models (LLMs) like GPT-4 or Claude. They are great at getting high scores on standard tests, but we don't really know why they get them right or where their limits actually are.

Enter X-RAY.

The authors of this paper built a special "medical scanner" for AI brains. Instead of just giving the AI a test and seeing if it passes, X-RAY dissects the test itself to see exactly what kind of thinking the AI is doing.

Here is how it works, broken down into simple concepts:

1. The "Recipe" vs. The "Dish"

Most AI tests are like serving a dish. You give the AI a problem (the dish), and it gives you an answer. If the answer is right, you give it a gold star.

X-RAY changes the game. It treats every problem like a recipe.

The Old Way: "Make a cake." (The AI might guess the ingredients based on what it's seen before).
The X-RAY Way: The researchers use computer code to build the recipe step-by-step. They can tweak the recipe precisely.
- Recipe A: "Add 2 eggs."
- Recipe B: "Add 2 eggs and 1 cup of sugar."
- Recipe C: "Add 2 eggs, 1 cup of sugar, and bake at a temperature that changes every minute."

Because the recipe is built by a computer, they know for a fact that the answer is correct. They can then ask the AI to solve Recipe A, then B, then C, and see exactly where it starts to get confused.

2. The Two Types of "Hard"

The paper discovered that AI models break in two very different ways, like a car hitting a bump versus a car hitting a wall.

Type 1: Tightening the Constraints (The "Bump")
Imagine you are looking for a specific person in a crowd.
- Easy: "Find a man."
- Harder: "Find a man wearing a red hat."
- Hardest: "Find a man wearing a red hat, holding a blue umbrella, and standing near a tree."
  The AI gets better at this. It just narrows down the search. It's like adding more rules to a game; the AI can usually handle it.
Type 2: Restructuring the Solution (The "Wall")
Now, imagine the rules of the game change entirely.
- Easy: "Find a man."
- Hard: "Find a man, but the man is actually a robot, and you have to find the battery inside him."
  This isn't just adding a rule; it changes the shape of the problem. The paper found that AI models crash hard here. They can handle "more of the same," but they struggle when the fundamental logic of the problem changes.

3. The "Checkerboard" Effect

When the researchers mapped out how different AI models performed, they saw something weird.
Some models (like the newer "reasoning" models) looked like a checkerboard.

They would get a problem right.
Change one tiny thing in the problem.
They would get it wrong.
Change it back slightly.
They would get it right again.

This is like a person who is great at driving on a sunny day, but if a single cloud passes overhead, they panic and stop. It shows their reasoning is "brittle"—it relies on specific patterns rather than a deep understanding of how things work.

4. Why This Matters (The "Contamination-Free" Kitchen)

A big problem in AI today is "cheating." Because so many people use the internet, AI models have likely "read" the test questions before they were even given to them. It's like a student memorizing the answer key.

X-RAY solves this by generating brand new problems on the fly using math and logic rules that no human has ever seen. It's like a chef who invents a new dish every time you order, so the AI can't possibly have memorized the answer.

5. Training the AI to Think Better

The paper also showed that you can use X-RAY to teach AI better.
Instead of just showing the AI the final answer, they showed the AI the step-by-step logic (the "verified recipe") that a computer solver used to get the answer.

Result: The AI learned to actually follow the logic, not just guess the answer. It became more robust and less likely to hallucinate (make things up).

The Big Takeaway

The paper argues that we need to stop treating AI like a black box that just spits out answers. We need to understand the structure of its thinking.

Current Benchmarks: "Did you get an A?" (Yes/No)
X-RAY: "Here is exactly where your logic broke, and here is the specific type of thinking you are missing."

By using this "X-Ray" vision, we can build AI that doesn't just mimic human reasoning but actually understands the deep structure of problems, making it safer and more reliable for things like medicine, engineering, and science.

1. Problem Statement

While Large Language Models (LLMs) demonstrate impressive performance on standard reasoning benchmarks (e.g., GSM8K, MATH), their true reasoning capabilities remain poorly understood. Current evaluations suffer from three critical limitations:

Conflation of Pattern Matching and Reasoning: High accuracy often stems from matching surface-level patterns or memorized templates rather than extracting and manipulating latent structural constraints.
Lack of Structural Granularity: Existing benchmarks report aggregate accuracy, failing to isolate why models fail or how performance degrades under specific structural pressures (e.g., increased constraint depth vs. solution space restructuring).
Contamination and Ambiguity: Static datasets are prone to data contamination (models seeing test data during pretraining) and annotation noise, making it difficult to distinguish genuine capability improvements from artifacts.

The core question addressed is: What is the specific structural boundary of an LLM's reasoning capability, and how does performance evolve as task structure becomes more complex?

2. Methodology: The X-RAY Framework

The authors propose X-RAY (eXplainable Reasoning Analysis sYstem), a framework that treats reasoning capability not as a scalar score but as a function of extractable structural information. The system operates through five tightly coupled components:

A. Autoformalization

Natural language problems are transformed into executable formal artifacts (e.g., Z3 SMT encoding, symbolic algebra).

Process: An LLM-based autoformalizer extracts constraints, decision variables, and logic into a formal language.
Verification: A binding map aligns natural language entities with formal variables. A composite verifier ( $R = R_{static} \circ R_{dynamic} \circ R_{semantic}$ ) ensures syntactic correctness, dynamic solvability (via Z3, CVC5, Mathematica), and semantic alignment.

B. Difficulty Quantification

Instead of relying on empirical success rates, difficulty is defined by a structural descriptor $\theta = (c, d, \kappa, \ell)$ :

$c$ (Conjunctive Width): Number of constraints to satisfy simultaneously.
$d$ (Compositional Depth): Nesting, branching, or conditional structure depth.
$\kappa$ (Cross-Constraint Coupling): Shared variables or derived quantities linking constraints.
$\ell$ (Dependency Chain Length): Minimum steps required to derive the output.

C. Controlled Calibration

X-RAY generates probe families by applying structural operators to a base problem while preserving semantic correctness:

Constraint Refinement: Tightening existing constraints (e.g., adding a condition) without changing the solution topology.
Solution-Space Restructuring: Altering the fundamental geometry of the solution (e.g., changing variable dependencies or nesting depth).
This allows for the isolation of specific structural dimensions to test model robustness.

D. Formal Verification

Every generated probe is verified for Existence (at least one solution exists) and Uniqueness (the target answer is unique) using formal solvers. This guarantees a "ground truth" free from human annotation bias.

E. Online Evaluation & Capability Mapping

Models are evaluated across a grid of structural parameters. Performance is mapped to visualize "capability frontiers," identifying where models transition from stable reasoning to failure (phase transitions).

3. Key Contributions

Reformulation of Evaluation: Shifts the paradigm from measuring "how many questions a model gets right" to "how much structural information a model can extract and manipulate."
Formally Calibrated Probes: Introduces a pipeline for generating tasks where difficulty is explicitly parameterized and verified by solvers, eliminating surface-level confounders and dataset contamination.
Structural Asymmetry Discovery: Reveals that LLMs exhibit a systematic asymmetry: they are robust to constraint refinement (shrinking a solution space) but degrade sharply under solution-space restructuring (altering the underlying structural form).
Contamination-Free Training Substrate: The framework generates verified Chain-of-Thought (CoT) traces that can be used for fine-tuning, enabling models to internalize structural dependencies rather than just mimicking patterns.

4. Experimental Results

The authors evaluated state-of-the-art models (GPT-5, o4-mini, GPT-4o, Claude-3.5, Qwen variants, DeepSeek-V3) across Mathematics, Physics, and Chemistry.

Capability Geometries:
- Universal Bottleneck: The interaction between Reasoning Depth and Expression Complexity is the hardest dimension pair. Accuracy collapses in a "cliff-like" fashion when both are high, regardless of the model.
- Asymmetry: Models like o4-mini and GPT-5 remain stable under constraint refinement but show significant degradation when the solution space requires restructuring.
- Checkerboard Instability: Models like QwQ and o4-mini exhibit "checkerboard" patterns in heatmaps (alternating success/failure on adjacent difficulty bins), indicating brittle reasoning strategies that rely on specific structural templates rather than generalizable logic.
- Domain Specificity: Specialized models (e.g., Qwen2-MATH) excel in their domain but fail to transfer reasoning capabilities to Physics or Chemistry, suggesting reasoning skills are often domain-locked.
Training with Verified Structure:
- Fine-tuning models (DeepSeek-R1, GLM-4, Qwen3) on solver-verified CoT traces resulted in consistent improvements across all domains (e.g., GLM-4.1V-9B improved from 43.0% to 77.0% on GSM8K).
- Crucially, these gains persisted even without access to formal solvers during inference, proving that models can internalize structural reasoning dependencies.
Error Analysis:
- GPT-4o failures were dominated by "Reasoning Chain Disruption" (51%), where the model loses track of global constraints during multi-step derivation.
- o4-mini showed a mix of arithmetic errors and logical hallucinations, but its reasoning chains were more stable on simpler structural tasks.

5. Significance and Impact

Beyond Aggregate Metrics: X-RAY demonstrates that standard benchmarks (like GSM8K) are saturated and mask structural fragility. It provides a diagnostic tool to pinpoint where and why a model fails.
Safety and Reliability: By identifying "capability frontiers" and phase transitions, X-RAY helps determine the limits of LLMs in safety-critical applications (e.g., physics or chemistry) where structural restructuring is common.
Training Guidance: The framework offers a principled way to generate curricula for training reasoning models, focusing on "brittle" structural operations rather than just increasing data volume.
Contamination Resistance: By generating tasks programmatically and verifying them formally, X-RAY offers a solution to the growing problem of data contamination in LLM evaluation.

In conclusion, X-RAY moves the field from asking "Is the model smart?" to "How does the model reason structurally?", providing a rigorous, interpretable, and contamination-free methodology for mapping the true boundaries of LLM intelligence.