Agentic Code Reasoning

This paper introduces "semi-formal reasoning," a structured prompting methodology that enables LLM agents to accurately analyze code semantics and verify patches without execution, significantly improving performance across fault localization, patch equivalence, and code question answering tasks.

Shubham Ugare, Satish Chandra

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you are a detective trying to solve a mystery in a massive, ancient library (the codebase). You have two different maps (patches) that claim to fix a broken door. Your job is to figure out: Do these two maps lead to the same destination, or will one of them get you stuck in a wall?

Usually, to know for sure, you would have to physically walk the path, try the door, and see what happens. In the world of software, this means running the code and testing it. But running tests is slow, expensive, and sometimes impossible (like trying to test a bridge before it's built).

This paper introduces a new way for AI detectives to solve these mysteries without ever leaving the library. They call this "Agentic Code Reasoning."

Here is the simple breakdown of how they did it and why it matters.

The Problem: The "Guessing Game"

Previously, AI agents tried to solve these puzzles using "Chain of Thought." Think of this as a detective talking to themselves: "Hmm, this looks like that, so it probably works."

The problem? The AI is too confident. It often guesses based on how things look rather than how they work.

  • The Mistake: In the paper's example, the AI saw two code snippets that looked like they did the same math. It said, "Yes, they are the same!"
  • The Reality: One snippet used a special tool from the library that only worked on "dates," while the other used a standard tool. The AI didn't check the tool's manual; it just guessed. The result? One map led to a dead end (a crash), and the other worked.

The Solution: The "Semi-Formal Certificate"

The authors realized that if they forced the AI to act like a lawyer instead of a guesser, it would get much smarter.

They introduced "Semi-Formal Reasoning." Instead of letting the AI ramble, they gave it a strict fill-in-the-blank template (a certificate). To win the case, the AI must provide:

  1. Premises: "Here is exactly what this code changes."
  2. The Trace: "I followed the path step-by-step. I saw that function A calls function B, which calls function C."
  3. The Conclusion: "Therefore, the result is X."

The Analogy:

  • Standard Reasoning is like a student saying, "I think the answer is 42 because it feels right."
  • Semi-Formal Reasoning is like a student saying, "I know the answer is 42 because I showed my work: $10 \times 4 = 40,plus, plus 2 = 42$. Here is the calculator proof."

If the AI tries to skip a step or make a claim without evidence, the template forces it to stop and look again. It acts as a certificate of truth.

The Three Tests

The researchers tested this "Lawyer AI" on three different types of cases:

  1. The Patch Equivalence Test (Are these fixes the same?):

    • The Challenge: Two developers fix the same bug. Are their solutions identical?
    • The Result: The "Lawyer AI" got it right 93% of the time, compared to 78% for the "Guesser AI." It caught subtle differences that the other AI missed.
  2. The Fault Localization Test (Where is the broken part?):

    • The Challenge: A program crashes. Find the exact line of code causing it without running the program.
    • The Result: The structured approach helped the AI find the broken line much faster and more accurately, improving its success rate by about 12%.
  3. The Code Question Test (What does this code do?):

    • The Challenge: Answer complex questions about how a specific piece of software behaves.
    • The Result: The AI became much better at answering correctly, jumping from 78% to 87% accuracy.

Why This Matters (The "So What?")

Imagine you are training a robot to be a software engineer. Usually, to teach the robot, you have to let it try a fix, run the tests, and see if it passes. This takes hours and requires powerful computers.

With this new method, the AI can simulate the test results in its head with high accuracy.

  • Speed: It doesn't need to wait for the computer to run the tests.
  • Cost: It saves money on computing power.
  • Safety: It can verify code before it's even built, preventing disasters.

The Bottom Line

This paper shows that if you give an AI a structured checklist and force it to show its work like a lawyer, it stops guessing and starts truly understanding code. It's a way to make AI smarter, more reliable, and capable of doing deep analysis without needing to "run" the software first.

In short: Don't just ask the AI, "Is this right?" Ask it, "Show me the evidence, step-by-step, and then tell me." The answer will be much better.