Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents

This paper presents a comprehensive survey of 178 benchmarks for Code Large Language Models and Agents through a tiered Software Development Life Cycle (SDLC) framework, revealing a significant imbalance that heavily favors the implementation phase while neglecting requirements and design, alongside critical gaps in anti-contamination strategies that necessitate future research to bridge the gap between theoretical capabilities and practical effectiveness.

Kaixin Wang, Tianlin Li, Xiaoyu Zhang, Chong Wang, Weisong Sun, Yang Liu, Aishan Liu, Xianglong Liu, Chao Shen, Bin Shi

Published Mon, 09 Ma
📖 4 min read☕ Coffee break read

Imagine the world of software development as a massive, complex construction project. You have the Software Development Life Cycle (SDLC), which is like the blueprint for building a skyscraper. It has distinct stages:

  1. Requirements: Figuring out what the building needs to do.
  2. Design: Drawing the blueprints.
  3. Implementation: Actually laying the bricks and pouring the concrete (writing the code).
  4. Testing: Checking for cracks and making sure the lights work.
  5. Maintenance: Fixing leaks and renovating years later.

Now, enter the Code Large Language Models (CodeLLMs). Think of these as super-intelligent, AI-powered construction workers. They can read your instructions and build things incredibly fast.

This paper is essentially a report card audit for the "tests" we use to see how good these AI workers are. The authors looked at 178 different tests (benchmarks) used by researchers and found some major problems with how we are grading these AI workers.

Here is the breakdown in simple terms:

1. The "Gym Class" Imbalance

The biggest finding is that our tests are completely unbalanced.

  • The Problem: Imagine if you only tested a construction worker on how fast they can lay bricks (Implementation), but never asked them to read the blueprints (Design), figure out what the client wants (Requirements), or fix a broken pipe later (Maintenance).
  • The Reality: The paper found that 61% of all tests focus only on "laying bricks" (writing code).
  • The Neglect: Only 5% of tests check if the AI can understand what the client actually wants, and a tiny 3% check if they can design the building.
  • The Metaphor: It's like a driving school that only tests if you can parallel park, but never tests if you can read a map, follow traffic signs, or handle a flat tire.

2. The "Cheating" Problem (Data Contamination)

  • The Problem: To get good at a test, you have to study for it. But in the AI world, the "study material" is the internet.
  • The Metaphor: Imagine a student taking a math test. If the teacher accidentally left the answer key on the desk, and the student memorized it before the test, they would get a perfect score. But they didn't actually learn math; they just memorized the answers.
  • The Reality: Many of these AI tests use old code that the AI models have already seen while they were being trained. The AI isn't "thinking"; it's just recalling what it saw before. This makes the AI look smarter than it really is. The paper notes that very few tests have good "anti-cheating" strategies to stop this.

3. The "Robot vs. Human" Gap

  • The Problem: Most tests are set up like a simple game of "Question and Answer." You ask a question, the AI gives an answer, and you grade it.
  • The Metaphor: Real construction isn't a quiz. It's a conversation. You say, "Build a wall," the worker builds it, you say, "No, make it higher," and they fix it. They might need to grab a hammer, check a level, or talk to the electrician.
  • The Reality: Current tests mostly check if the AI can give a single correct answer. They rarely test if the AI can work in a team, use tools, or fix mistakes over time (which is what "Agents" are supposed to do).

4. The Language Bias

  • The Problem: The tests are heavily biased toward Python (a popular programming language).
  • The Metaphor: It's like a driving test that only happens on sunny days with smooth asphalt. But what if the AI has to drive in the rain, on gravel, or in a different country?
  • The Reality: We don't know if these AI workers are good at building with "Rust" or "Go" (other modern languages) because we haven't tested them enough in those environments.

What Should We Do Next?

The authors suggest we need to upgrade our "driving tests" for AI:

  • Test the whole cycle: Don't just test coding; test planning, designing, and fixing.
  • Stop the cheating: Create fresh, new tests that the AI hasn't seen before.
  • Test the real world: Instead of simple quizzes, give the AI complex projects where they have to use tools, talk to other systems, and fix their own mistakes over time.
  • Include privacy: Make sure the AI doesn't accidentally leak secret company data while working.

In a nutshell: We have built some amazing AI construction workers, but we are currently only testing them on how fast they can stack blocks. We need to start testing them on how well they can design, build, and maintain entire skyscrapers in the real world.