An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

Imagine you are trying to build a complex piece of furniture with a very smart, but slightly forgetful, robot assistant. You give it instructions step-by-step.

Turn 1: "Build a table with four legs."
Turn 2: "Actually, make the legs out of oak, and paint them blue."
Turn 3: "Oh, and by the way, the table needs to be round, not square."

In a perfect world, the robot would remember the oak legs, the blue paint, and the round shape, building a perfect blue oak round table.

But in the real world, this paper argues that today's AI coding assistants (like the ones in GitHub Copilot or Cursor) often act like that forgetful robot. They might build a square table, forget the blue paint, or suddenly decide to use pine wood instead of oak, even though you told them otherwise three steps ago.

The authors of this paper call these mistakes "Interaction Smells." Just like a bad smell in a room tells you something is wrong without you seeing the source, these "smells" in a conversation with an AI tell you the collaboration is going off the rails.

Here is a simple breakdown of their study:

1. The Problem: The "Bad Smells" in the Conversation

The researchers looked at thousands of real conversations between humans and AI to figure out exactly how these robots mess up. They found three main categories of "smells":

The "Vague Boss" Smell (User Intent Quality): Sometimes the human is too vague.
- Analogy: You tell the robot, "Make it fancy." The robot doesn't know if "fancy" means gold plating, velvet cushions, or just a nice font. It guesses, and usually guesses wrong.
The "Forgetful Assistant" Smell (Historical Instruction Compliance): The AI ignores rules you set earlier.
- Analogy: You told the robot, "Never use red paint." In the next step, it paints the table red because it forgot your rule. Or, you said, "Use oak," and it switches to plastic.
The "Confused Robot" Smell (Historical Response Violation): The AI contradicts itself or breaks things it already fixed.
- Analogy: You ask the robot to fix a wobbly leg. It fixes the leg, but in doing so, it accidentally knocks over the whole table (breaking previous work). Or, it gives you the exact same answer it gave you five minutes ago, even though you asked a new question.

2. The Investigation: Who is the Worst Offender?

The researchers tested six of the most popular AI models (including GPT-4, Gemini, and various versions of Qwen) to see how often they made these mistakes.

The findings were surprising:

The biggest problem wasn't that humans were bad at asking questions. The "Vague Boss" smells were actually rare.
The biggest problem was that the AI forgot its own rules. The most common "smell" was "Must-Do Omission." This is when the AI is asked to add a new feature but forgets to keep the old rules (like "don't break the blue paint").
The "Broken Fix" problem: The AI often tries to fix a small bug but accidentally breaks a feature that was working perfectly fine.

3. The Solution: The "InCE" Framework

To fix this, the authors built a new system called InCE (Invariant-aware Constraint Evolution). Think of InCE as a Project Manager standing between you and the robot.

Here is how the Project Manager works:

The "Rule Book" (Invariant Extraction): Before the robot starts typing code, the Project Manager reads through your entire conversation and writes down a strict "Rule Book" of everything you've ever asked for (e.g., "Must be blue," "Must use oak," "No red paint").
The "Safety Inspector" (Proactive Smell Detector): Before the robot sends its answer to you, the Project Manager checks the draft against the Rule Book.
- Inspector: "Hey, you painted it red! That violates the 'No Red' rule from Turn 1."
- Robot: "Oh, my bad. Let me fix that."
- Inspector: "Also, you forgot to keep the oak legs. Fix that too."

4. The Results

When they used this "Project Manager" system:

The AI got much better at remembering its rules.
It stopped breaking things that were already working.
It stopped giving the same answer twice in a row.
Most importantly: The AI successfully finished the tasks much more often (the "Task Success Rate" went up).

The Big Takeaway

This paper teaches us that building software with AI isn't just about asking for code; it's about managing the conversation.

Currently, AI models are great at writing a single paragraph of code but terrible at remembering the whole story of a long project. The solution isn't just to make the AI "smarter"; it's to build systems that act like a memory keeper, constantly reminding the AI of the rules it promised to follow, ensuring that every new step builds on the last one without destroying the foundation.

An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

1. The Problem: The "Bad Smells" in the Conversation

2. The Investigation: Who is the Worst Offender?

3. The Solution: The "InCE" Framework

4. The Results

The Big Takeaway

1. Problem Statement

2. Methodology

A. Data Collection and Preprocessing

B. Taxonomy Construction (RQ1)

C. Empirical Evaluation (RQ2)

D. Mitigation Framework (RQ3)

3. Key Contributions

4. Key Results

Distribution of Smells (RQ2)

Mitigation Effectiveness (RQ3)

5. Significance

An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

1. The Problem: The "Bad Smells" in the Conversation

2. The Investigation: Who is the Worst Offender?

3. The Solution: The "InCE" Framework

4. The Results

The Big Takeaway

1. Problem Statement

2. Methodology

A. Data Collection and Preprocessing

B. Taxonomy Construction (RQ1)

C. Empirical Evaluation (RQ2)

D. Mitigation Framework (RQ3)

3. Key Contributions

4. Key Results

Distribution of Smells (RQ2)

Mitigation Effectiveness (RQ3)

5. Significance

More like this

Monotone Comparative Statics without Lattices

Motion Illusions Generated Using Predictive Neural Networks Also Fool Humans

Performance Analysis of IEEE 802.11p Preamble Insertion in C-V2X Sidelink Signals for Co-Channel Coexistence

Construction of time-varying ISS-Lyapunov Functions for Impulsive Systems

Real-Time BDI Agents: a model and its implementation