CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

Here is an explanation of the paper CODETASTE using simple language and creative analogies.

The Big Picture: The "Messy House" Problem

Imagine you hire a super-smart, incredibly fast robot to clean your house.

The Good News: The robot is amazing at fixing specific problems. If you say, "The kitchen sink is leaking, please fix it," the robot finds the leak, tightens the pipe, and stops the water. It works perfectly.
The Bad News: Over time, the robot starts making a mess. It leaves tools on the counter, stacks dirty dishes in the living room, and builds a weird, wobbly tower of boxes in the hallway just to store a single shoe. The house still functions (you can still live there), but it's becoming a chaotic nightmare.

In the world of software, this robot is an AI Coding Agent. It can write code to fix bugs or add features, but it often creates "technical debt"—messy, duplicated, or confusing code that makes the software hard to maintain later.

Human developers have a special skill called Refactoring. This isn't about fixing a broken pipe; it's about reorganizing the whole kitchen so the pots are easier to reach, the spices are labeled, and the floor is clear. It's about making the house better, not just functional.

The big question this paper asks is: Can these AI robots learn to clean up their own messes and reorganize the house the way a human expert would?

The Solution: CODETASTE (The "Taste Test" for Code)

To answer this, the researchers built a benchmark called CODETASTE. Think of this as a rigorous "Taste Test" for AI chefs.

Instead of just asking the AI to cook a meal, they gave it a specific challenge: "Here is a messy kitchen. Please reorganize it exactly how a human chef would, without changing the taste of the food."

How They Built the Test

The Source Material: They looked at thousands of real-world software projects (like GitHub repositories) and found 100 examples where human developers did a massive, complex reorganization (refactoring) of the code.
The Setup: For each example, they created a "sandbox" (a safe, isolated digital room) where the AI could try to fix the code.
The Rules: They didn't just check if the code "worked." They used special "detective rules" (static analysis) to see if the AI actually removed the bad patterns (like clutter) and added the good patterns (like organization).

The Two Rounds of the Game

The researchers tested the AI in two different ways:

Round 1: The "Follow the Recipe" Track (Instructed)

The Scenario: You give the AI a detailed, step-by-step recipe: "Move all the spices to the left cabinet, label them alphabetically, and throw away the broken jars."
The Result: The AI did pretty well! The best models (like GPT-5) followed the instructions about 70% of the time. They could execute the plan if told exactly what to do.

Round 2: The "Clean Up This Mess" Track (Open)

The Scenario: You walk into the messy kitchen and just say, "This place is a disaster. Please make it better." You don't tell them how to do it.
The Result: The AI completely failed. It scored less than 8%.
- Instead of reorganizing the whole kitchen, the AI might just wipe a single counter or fix a tiny typo on a label.
- It couldn't figure out what the human would have chosen to fix. It lacked the "judgment" to see the big picture.

Key Findings & Surprises

1. The "Plan First" Trick
The researchers discovered that if they forced the AI to write a plan before it started cleaning, it got much better.

Analogy: Imagine telling the robot, "Don't just start moving boxes. First, draw a map of how the kitchen should look, then show me the map. If the map looks good, then start moving."
This "Plan-then-Act" approach helped the AI understand the goal better, doubling its success rate in the messy kitchen scenario.

2. The Cost of Perfection
The AI that did the best job (GPT-5) was also the most expensive. It spent a lot of "money" (computing power) trying to be precise. The cheaper, faster models were lazy and often just did a "search and replace" (like using a giant sledgehammer to fix a loose screw), which broke things.

3. The "Human Gap"
Even the smartest AI models are still far from human-level judgment when it comes to deciding what needs to be fixed. They are great at following orders but terrible at spotting problems on their own.

Why Does This Matter?

If we want AI to be a true partner in software development, it can't just be a "fix-it" bot. It needs to be a "gardener" that knows how to prune the bushes, water the plants, and keep the garden beautiful over the long term.

CODETASTE shows us that:

Current AI is good at doing what it's told.
Current AI is bad at deciding what needs to be done.

The paper concludes that for AI to truly replace or assist human developers in the long run, we need to teach it not just how to write code, but how to think like an architect—to see the mess, understand the structure, and make the hard choices to keep the software healthy for years to come.

The Takeaway

The paper is a reality check. AI is a powerful tool, but right now, it's like a very fast, very obedient intern who needs a manager to tell them exactly what to do. It hasn't yet learned to look at a messy room, sigh, and say, "I know exactly how to fix this," all on its own. CODETASTE is the measuring stick to help us teach it that skill.

Here is a detailed technical summary of the paper "CODETASTE: Can LLMs Generate Human-Level Code Refactorings?" by Thillen et al.

1. Problem Statement

While Large Language Model (LLM) coding agents excel at generating functional code patches for specific issues, they struggle to maintain long-term code quality. Current agents tend to accumulate technical debt, leading to code that is verbose, structurally eroded, and difficult to extend.

The Gap: Existing benchmarks (e.g., RefactorBench, SWE-Refactor) focus on small, single-file, or syntactically simple refactorings. They fail to evaluate whether agents can autonomously identify necessary refactorings in complex, multi-file codebases or execute them with the structural insight of human developers.
The Challenge: Can agents not only implement a specified refactoring but also discover the correct refactoring strategy when given only a vague goal (e.g., "improve maintainability")?

2. Methodology: The CODETASTE Benchmark

The authors introduce CODETASTE, a scalable benchmark pipeline designed to evaluate agents on real-world, large-scale, multi-file refactoring tasks mined from open-source repositories.

A. Data Curation Pipeline

Mining: The pipeline queries GitHub archives (2023–2025) to find commits involving large-scale changes (filtering for keywords like "refactor," "reorganize," and excluding bug fixes/features).
Filtering: Candidates are pre-filtered for accessibility and non-triviality (removing documentation-only changes and outliers).
LLM Scoring: A small LLM (Claude 4.5 Haiku) scores commit messages and diffs based on "refactor likelihood," complexity, and behavior-preserving nature.
Selection: The top 100 instances are selected, covering 87 repositories and 6 programming languages (Go, Java, Python, JS/TS, Rust, C/C++).
- Scale: Average instance involves editing 91.52 files and 2,605 lines of code.

B. Evaluation Infrastructure

For each instance, the system constructs a reproducible environment:

Task Generation: An LLM synthesizes a GitHub-issue-style description from the commit message and PR details.
Static Analysis Rules (OpenGrep): Instead of relying solely on diff matching, the system generates semantic static analysis rules using OpenGrep. These rules use AST patterns and symbolic dataflow reasoning to define:
- Additive Rules ( $\Gamma^+$ ): Patterns that must appear in the refactored code (e.g., a new abstraction).
- Reductive Rules ( $\Gamma^-$ ): Patterns that must not appear (e.g., legacy anti-patterns).
Execution Environment: A containerized environment runs the repository's test suite to ensure functional correctness.

C. Evaluation Tracks

The benchmark features two distinct tracks:

Instructed Track: The agent is given a detailed, step-by-step description of the exact refactoring to perform. This tests execution capability and long-context adherence.
Open Track: The agent is given only a high-level focus area (e.g., "Improve internal organization") without specific instructions on how to refactor. This tests autonomous discovery and alignment with human architectural choices.
- Modes: Direct (implement immediately), Plan (propose a plan then implement), and Oracle Multiplan (generate multiple plans, an oracle selects the best one).

D. Metrics

Performance is measured using an Alignment Score ( $A$ ), which combines:

Functional Correctness (PASS): The test suite must pass.
Instruction Following Rate (IFR): The percentage of additive rules matched and reductive rules avoided.
Precision: The extent to which changes are limited to the relevant scope (avoiding unrelated modifications).
Formula: $A(\hat{X}) = \text{PASS}(\hat{X}) \times \text{IFR}(\hat{X})$ .

3. Key Results

A. Instructed Track (Specified Refactoring)

Performance: Frontier models perform well but show significant gaps. GPT-5.2 achieves the highest alignment score of 69.6%.
Comparison: GPT-5.1 Mini (34.6%) and Claude Sonnet 4.5 (32.4%) lag significantly, primarily due to lower functional correctness (PASS rates of ~47% and 43% vs. 76% for GPT-5.2).
Precision: All frontier models achieve high precision (~56–59%), comparable to human baselines, indicating they can limit changes to the correct scope when instructions are clear.
Cost: High-performing models (GPT-5.2) are more expensive ($5.17/task) due to targeted patching, whereas others use cheaper but less effective search-and-replace commands.

B. Open Track (Autonomous Discovery)

Performance Collapse: When agents must discover the refactoring themselves, performance drops drastically. The best alignment score is only 7.7% (GPT-5.2 in Direct mode).
Planning Helps: Introducing a "Plan" step (Propose-then-Implement) improves alignment significantly. For GPT-5.2, the score nearly doubles to 14.1%.
Oracle Selection: Using an oracle to select the best plan from multiple candidates further boosts GPT-5.2 to 19.4%.
Failure Modes: Agents often fixate on trivial issues (typos), apply lazy workarounds (compatibility shims), or perform destructive global replacements that break the codebase. They struggle to identify the correct architectural transformation without explicit guidance.

4. Key Contributions

CODETASTE Benchmark: The first benchmark to evaluate LLM agents on large-scale, multi-file, behavior-preserving refactorings with semantic verification (static analysis rules) rather than just syntactic diffs.
Semantic Verification: Introduction of OpenGrep-based static rules with dataflow reasoning to verify refactoring intent, overcoming the limitations of simple string matching.
Discovery Gap Analysis: Empirical evidence that while LLMs can execute complex refactorings if told exactly what to do, they fail to identify the necessary architectural changes autonomously (alignment < 8% vs. >69% when instructed).
Strategy Insights: Demonstration that Propose-then-Implement strategies and Oracle Selection significantly improve alignment in open-ended tasks, suggesting that explicit reasoning steps are crucial for complex architectural decisions.

5. Significance and Future Directions

Limitation of Current Agents: The results highlight that current coding agents are "patch generators" rather than "architects." They lack the autonomous judgment required to navigate technical debt and restore code clarity without explicit, detailed instructions.
Training Signal: CODETASTE provides a rigorous evaluation target and a preference signal for training agents to align with human refactoring decisions, which is critical for sustainable software engineering.
Future Work: The authors suggest expanding the dataset, strengthening rule coverage, and developing better mechanisms for agents to propose and validate architectural plans autonomously.

In conclusion, while LLMs are becoming proficient at coding tasks, they currently lack the human-level intuition to autonomously refactor complex codebases. Bridging this gap requires not just better models, but new training paradigms that emphasize architectural reasoning and planning.