Evaluating Code Reasoning Abilities of Large Language… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are testing a student’s ability to solve math problems. To make it easy, you only give them simple addition and subtraction: $2 + 2$ or $10 - 5$. The student gets 100% every time! You might conclude, "This student is a mathematical genius!"

But then, you take that same student into a real-world engineering office. Suddenly, they are faced with complex calculus, physics equations, and multi-step architectural blueprints. They freeze. They fail.

This paper argues that we have been doing exactly that to Artificial Intelligence.

The Problem: The "Easy Math" Trap

Currently, when scientists test how well AI (like ChatGPT) can "reason" through computer code, they use "toy problems." These are tiny, isolated snippets of code that are very simple—like the $2 + 2$ math problems mentioned above. Because the AI can solve these easily, we mistakenly believe they are ready to handle real-world software engineering.

The researchers found that current benchmarks are missing the "messiness" of the real world:

The "Family Tree" Problem: In real code, one function calls another, which calls another, creating a long chain of dependencies.
The "Mystery Box" Problem: Real code doesn't just use simple numbers; it uses complex "objects" (think of a "User" object that contains a name, an age, a list of friends, and a history of purchases).
The "Instruction Manual" Problem: Real code uses external tools and libraries (APIs) that the AI has to understand on the fly.

The Solution: RE2-Bench (The "Real-World Simulator")

The researchers created a new testing ground called RE2-Bench. Instead of using "toy" code, they went to GitHub—the massive library where real programmers store their work—and pulled out actual, working code from real projects.

To make sure the test was fair and scientific, they did something clever: they categorized the problems into two levels:

Lower Complexity (LC): The "warm-up" problems.
Higher Complexity (HC): The "boss level" problems that include all those messy real-world details.

The Shocking Result: The Great Performance Drop

When they put the world’s best AI models through this "Real-World Simulator," the results were a wake-up call.

As the code got more realistic, the AI's performance didn't just dip—it plummeted. For certain tasks, the AI's accuracy dropped by as much as 48% when moving from simple code to real-world code.

It turns out that while the AI is great at "reading" a single line of code, it struggles immensely when it has to "trace" a logic path through a complex web of interconnected parts. It’s like the difference between reading a single sentence and trying to follow a complex, branching plot in a 500-page mystery novel.

Why Does This Matter?

If we want to use AI to help write software, fix bugs, or build apps, we can't keep testing them with "toy" problems. If we do, we are building on a false sense of security.

This paper provides a new, much harder "final exam" for AI. It forces developers to stop celebrating "easy wins" and start focusing on teaching AI how to navigate the beautiful, tangled, and complex reality of actual programming.

Technical Summary: Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

1. Problem Statement

Current evaluations of Large Language Models (LLMs) regarding code reasoning often rely on simplistic, standalone, or LLM-generated programs (e.g., CRUXEval, HumanEval). These benchmarks typically focus on primitive data types and lack the complexities found in production software, such as:

Inter- and intra-procedural dependencies (complex call chains).
Non-primitive complex types (custom objects, nested dictionaries, etc.).
Third-party API calls and deeply nested control structures.

The authors argue that evaluating LLMs on these "Lower Complexity" (LC) tasks leads to inflated performance metrics, failing to reflect how these models will actually perform when integrated into real-world software engineering workflows.

2. Methodology

To address this gap, the authors developed RE2-Bench (Realistic Reasoning benchmark), a dataset of 1,200 reasoning problems. Their approach involves several innovative technical components:

Data Collection & Problem Selection: Problems were sourced from existing benchmarks (Avatar, ClassEval, etc.) and freshly mined from popular GitHub Python repositories and SWE-bench. For complex real-world code, they use dynamic slicing to extract a method along with its necessary callees (the "reasoning problem").
Variable Serialization Pipeline: To handle complex custom objects that do not have standard string representations, the authors built a pipeline using static and dynamic analysis. This pipeline recursively decomposes complex objects into a structured JSON format, allowing LLMs to "see" the internal state of custom types.
Principled Complexity Categorization: Instead of arbitrary grouping, they use nine interpretable metrics (e.g., Cyclomatic Complexity, Nested Constructs, Third-party APIs, Inter-class Dependencies). They apply a majority-vote mechanism combined with Silhouette Analysis and the Davies–Bouldin Index to mathematically ensure that problems are categorized into two well-separated groups: Lower Complexity (LC) and Higher Complexity (HC).
Task Definitions: The benchmark evaluates four specific reasoning tasks:
1. Input Prediction: Backward reasoning (given output, find input).
2. Output Prediction: Forward reasoning (given input, find output).
3. Loop Prediction: Predicting loop variable states and iterables.
4. Branch Prediction: Predicting whether conditional branches are taken.
Evaluation Metrics: They introduce $RS_{partial}$ (Partial Reasoning Success) to provide a fairer assessment of models that get parts of a complex object correct, preventing the "all-or-nothing" penalty of strict success metrics.

3. Key Contributions

RE2-Bench & RE2-Bench-lite: A new, high-quality dataset that bridges the gap between algorithmic puzzles and real-world software.
Automated Serialization/Deserialization: A robust method for representing complex, real-world Python objects in LLM prompts.
Complexity Taxonomy: A systematic way to categorize code complexity using nine distinct metrics.
Reasoning Failure Taxonomy: A detailed classification of 18 reasoning failure categories (e.g., Call Stack Confusion, Variable Tracking Overload, Incorrect Type Resolution) to guide future model development.

4. Results

The evaluation of ten LLMs (including reasoning-focused models like Claude-Haiku-4.5 and DeepSeek-V3.2) revealed a drastic performance degradation when moving from LC to HC problems:

Performance Drop: Average reasoning success ($RS$) dropped by 37.36% for input prediction, 36.16% for output prediction, 20.90% for loop prediction, and 48.60% for branch prediction.
Reasoning Effort: Models with higher reasoning effort (longer chains of thought) generally outperform general-purpose models, though they are susceptible to "Inverse Scaling" (overthinking simple problems).
Correlation: There is a strong negative correlation between code complexity metrics and LLM reasoning success.
Forward vs. Backward Reasoning: LLMs are significantly better at output prediction (forward reasoning) than input prediction (backward reasoning), especially as call chains grow longer.

5. Significance

This paper serves as a critical "reality check" for the AI community. It demonstrates that current state-of-the-art LLMs are not as proficient at code reasoning as existing benchmarks suggest. By providing a more rigorous, complexity-aware evaluation framework, the authors offer a roadmap for developers to build the next generation of LLMs that can truly handle the intricacies of real-world software engineering.

Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings