CONCUR: Benchmarking LLMs for Concurrent Code Generation

This paper introduces CONCUR, a novel benchmark comprising 115 concurrency-specific problems designed to evaluate and highlight the limitations of Large Language Models in generating complex concurrent code, addressing a critical gap left by existing benchmarks that focus solely on sequential code.

Jue Huang, Tarek Mahmud, Corina Pasareanu, Guowei Yang

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you have a team of incredibly smart, super-fast robots (Large Language Models, or LLMs) that are learning to write computer code. We've already tested them on writing sequential code—which is like telling a robot to make a sandwich step-by-step: Get bread, put peanut butter, put jelly, close sandwich. If the robot follows the steps, the sandwich is good.

But what happens when you ask these robots to write concurrent code?

The Problem: The Busy Kitchen Analogy

Concurrent code is like a busy restaurant kitchen where five chefs are trying to cook a meal at the same time.

  • They all need to use the same stove.
  • They all need to grab the same knife.
  • They are all shouting orders at once.

If they aren't perfectly coordinated, chaos ensues:

  • Deadlock: Chef A is waiting for Chef B to put down the knife, but Chef B is waiting for Chef A to turn off the stove. They both freeze. The meal never gets made.
  • Race Condition: Two chefs try to add salt to the soup at the exact same millisecond. One adds a teaspoon, the other adds a cup. The soup is ruined, and no one knows who did it.
  • Starvation: One chef is so busy that the other four never get a chance to cook.

The Gap: Existing tests for these robots only check if they can make a sandwich (sequential code). They don't test if the robots can manage a chaotic kitchen (concurrent code). A robot might write code that looks perfect on paper but causes a kitchen disaster when actually running.

The Solution: CONCUR (The "Kitchen Simulator")

The authors of this paper created a new test called CONCUR. Think of it as a high-tech kitchen simulator designed specifically to see if these AI robots can handle the chaos of a multi-chef environment.

Here is how they built it:

  1. The Menu (The Dataset): They took 43 classic "kitchen coordination" problems from a textbook (like "How to manage a shared pot of soup") and created 72 slightly different versions of them. This gives them 115 unique challenges to test the robots.
  2. The Rules (The Prompt): They gave the robots very strict instructions: "You must use only standard tools (Java 8), you must have exactly 3 chefs working, and you must finish in 10 minutes."
  3. The Simulator (The Evaluation): This is the most important part. Instead of just reading the code to see if it looks right, they ran the code through a super-powerful simulator called Java PathFinder (JPF).
    • Imagine a time-traveling inspector who runs the kitchen scenario thousands of times in a split second.
    • He tries every possible order in which the chefs could move: Chef A moves first, then B... or B moves first, then A...
    • If the code crashes, freezes, or messes up in any of these scenarios, the robot fails.

What They Found (The Results)

They tested 23 of the smartest AI robots on this new test. Here is what happened:

  • The "Good" News: Many robots could write code that looked correct and didn't crash immediately.
  • The "Bad" News: When the simulator ran the code thousands of times, most robots failed.
    • Many robots forgot to actually create multiple chefs (they wrote code for one chef doing everything, which isn't "concurrent").
    • Many robots created "deadlocks" where the chefs got stuck waiting for each other.
    • Many robots created "race conditions" where the data got corrupted.
  • The "Fake" Score: They checked a popular scoring system called CodeBLEU. This system is like a spellchecker that compares the robot's code to a human's code. It checks if the words and grammar are similar.
    • The Shock: A robot could get a high score on CodeBLEU (looking very similar to a human) but still have a kitchen disaster (a deadlock) when running. The spellchecker couldn't see the chaos.

The Big Takeaway

This paper tells us that just because code looks right, doesn't mean it works right when multiple things happen at once.

  • Old Way: Check if the code compiles and looks similar to human code. (Like checking if a sandwich recipe has the right words).
  • New Way (CONCUR): Simulate the code running in a chaotic, multi-threaded environment to see if it actually survives the chaos. (Like actually cooking the meal with five chefs to see if the kitchen explodes).

The authors conclude that we need better tools to test AI on complex, multi-tasking jobs. We can't just trust the "spellcheckers" anymore; we need the "kitchen simulators" to ensure our AI isn't just writing pretty code, but code that actually works in the real, messy world.