Behaviour Driven Development Scenario Generation with Large Language Models

Imagine you are the manager of a busy restaurant. You have a brilliant chef (the software developer) and a team of waiters (the testers). The problem? The customers (the business owners) keep giving you vague orders like, "I want a burger that tastes like summer."

If you just tell the chef "Make a summer burger," they might make a salad with a tomato on top. If you tell the waiter to write down exactly how to make it, they might spend hours writing a 50-page manual, or worse, forget to mention that the bun needs to be toasted.

This is the problem of Software Testing in the real world. It's hard to translate vague ideas into precise instructions that a computer can follow.

This paper is about using AI (specifically Large Language Models or LLMs) to act as a super-smart translator. The researchers wanted to see if AI could take a vague customer request and instantly write a perfect, step-by-step recipe (called a BDD Scenario) that the chef and waiters can follow without confusion.

Here is the breakdown of their adventure, explained simply:

1. The Setup: Building a "Training Gym"

Before testing the AI, the researchers needed a gym to train it. They couldn't just use fake examples; they needed real ones.

The Dataset: They gathered 500 real-life stories from a software company called IntelligenceBank. These were actual requests from customers, the detailed notes the company wrote about them, and the final "recipes" (BDD scenarios) the humans had written.
The Goal: They wanted to see if an AI could look at just the "Customer Request" and write a "Recipe" that was just as good as the one a human expert wrote.

2. The Contestants: The AI Models

They put three famous AI models in the ring:

GPT-4: The all-rounder, known for being very smart and following instructions well.
Claude 3: The careful thinker, known for being very precise and good at long conversations.
Gemini: The creative one, known for handling lots of information at once.

3. The Experiments: How did they play the game?

The researchers didn't just ask the AIs to "do it." They tried different ways of asking (called Prompting) to see what worked best.

The "Zero-Shot" (Just Ask): They gave the AI the request and said, "Write a recipe." No examples, no hints.
The "Few-Shot" (Show Me): They gave the AI the request plus a few examples of good recipes to copy the style.
The "Chain-of-Thought" (Think First): They told the AI, "First, think about the steps, then write the recipe."

The Result? It depended on the AI's personality!

GPT-4 was the "Genius who doesn't need help." It worked best when you just asked it directly (Zero-Shot).
Claude 3 was the "Student who needs a study guide." It did best when you asked it to think step-by-step (Chain-of-Thought).
Gemini was the "Visual learner." It did best when you showed it examples first (Few-Shot).

4. The Secret Ingredient: What you feed the AI matters most

This was the biggest surprise. The researchers tried feeding the AI different types of information:

Scenario A: Just the short "Customer Request" (e.g., "I want a summer burger").
Scenario B: Just the "Detailed Notes" (e.g., "Use a toasted bun, add grilled pineapple, serve at 20°C...").
Scenario C: Both together.

The Verdict:

If you gave the AI only the short request, it wrote terrible recipes. It was too vague.
If you gave the AI only the detailed notes, it wrote excellent recipes.
Conclusion: The AI is smart, but it can't read minds. It needs detailed instructions. If humans write good, detailed notes, the AI can do the heavy lifting. If humans are lazy with their notes, the AI will fail.

5. The Judges: Who is right?

How did they know if the AI recipes were good?

Computer Judges: They used math to compare the AI's recipe to the human's recipe. Did they use the same words? (Text Similarity). Did they mean the same thing? (Semantic Similarity).
Human Judges: They hired 6 real experts to taste-test the recipes.

The Twist:
The "Computer Judges" (math) were often wrong. They thought the AI that used the most similar words was the best. But the Human Judges preferred the AI that wrote the most logical and useful recipe, even if the words were slightly different.

Winner: Claude 3 was rated highest by the humans.
The New Star: They found that a specific AI called DeepSeek was actually the best "Computer Judge." It agreed with the human experts much better than the math formulas did.

6. The Settings: Turning the Dials

AI models have knobs like "Temperature" (how creative/random it is) and "Top_p" (how many options it considers).

The Finding: For writing recipes, creativity is the enemy.
The best results happened when they turned the "Creativity" knob all the way down (Temperature = 0). They wanted the AI to be a robot, not a poet. They wanted the exact same perfect recipe every time, not a "surprise" recipe.

The Big Takeaway (The "So What?")

This paper tells us that AI is ready to help write software tests, but we have to use it correctly:

Don't expect magic from vague ideas: You still need to write detailed requirements. If you do that, the AI can save you hours of work.
Pick the right tool for the job: Don't just pick the "famous" AI. Try different ways of asking (prompts) to see which one fits your team's style.
Keep it boring: For this specific task, turn off the "creative" mode. You want precision, not art.
Use AI to check AI: We found that one specific AI (DeepSeek) is really good at grading the work of other AIs, which could save companies a lot of money on human reviewers.

In short: AI is like a super-fast, super-literate sous-chef. If you give it a vague order, it will guess. But if you give it a detailed recipe card, it will chop, cook, and plate the dish faster than you can blink, leaving you free to enjoy the meal (or build the next feature).

Here is a detailed technical summary of the paper "Behaviour Driven Development Scenario Generation with Large Language Models" by Rathnayake, Shahin, and Abaei.

1. Problem Statement

The paper addresses the scalability and quality challenges in Behaviour-Driven Development (BDD) within modern software engineering.

The Bottleneck: While BDD bridges the gap between business and technical stakeholders using natural language (Gherkin syntax: Given/When/Then), the manual creation of comprehensive BDD scenarios is time-consuming, requires deep domain expertise, and often leads to inconsistent quality or missed edge cases.
The Gap: Although Large Language Models (LLMs) have shown promise in code generation and test case creation, there is a lack of systematic, large-scale evaluation regarding their ability to generate high-quality BDD scenarios. Existing research often relies on small, public datasets and lacks multi-dimensional evaluation (combining automated metrics with human expert assessment).
The Goal: To evaluate the effectiveness of state-of-the-art LLMs in automating BDD scenario generation, determine optimal prompting and configuration strategies, and establish a robust evaluation framework.

2. Methodology

The authors employed a rigorous experimental design involving dataset construction, model selection, and a multi-dimensional evaluation framework.

A. Dataset Curation (First Contribution)

Source: A proprietary dataset was constructed from four industrial software products (Digital Asset Management, Brand Management, Marketing Operations, and Marketing Compliance) developed by IntelligenceBank.
Scale: 500 user stories paired with detailed requirement descriptions and their corresponding human-written BDD scenarios.
Processing: Raw data from Jira and Confluence was cleaned to remove metadata, comments, and navigation links, resulting in standardized inputs.
Significance: This is the first publicly available dataset of its kind, moving beyond small public samples to real-world industrial artifacts.

B. Models and Configuration

Three representative LLMs were evaluated:

GPT-4 (OpenAI, gpt-4o)
Claude 3 (Anthropic, claude-3-opus)
Gemini (Google, gemini-1.5-flash)

C. Experimental Variables

The study investigated four Research Questions (RQs) by manipulating:

Prompting Techniques: Zero-shot, Few-shot, and Chain-of-Thought (CoT).
Input Types: User Story only, Requirement Description only, and Combined (Story + Description).
Model Parameters: Temperature (0, 0.5, 1.0) and Top_p (0.5, 1.0).

D. Evaluation Framework

A hybrid evaluation approach was used to assess the generated scenarios against the human-written references:

Text Similarity: BLEU, METEOR, ROUGE-L.
Semantic Similarity: BERTScore, SentenceBERT (Cosine/Euclidean), Universal Sentence Encoder.
LLM-based Evaluation: Using GPT-4, Claude 3, and DeepSeek (as an evaluator) to rate scenarios on a 1–5 scale.
Human Evaluation: Six senior QA experts rated a subset of 600 scenarios (200 per model) on a 1–5 scale.

3. Key Contributions

Dataset Release: Creation and public release of a dataset containing 500 real-world user stories, requirements, and BDD scenarios.
Comprehensive Evaluation: The first study to evaluate LLMs for BDD generation across multiple dimensions (syntax, semantics, LLM-judgment, and human-expert judgment).
Evaluation Metric Discovery: Identification that DeepSeek (as an evaluator) correlates more strongly with human judgment than traditional text/semantic similarity metrics.
Practical Guidelines: Empirical evidence on how prompting techniques, input quality, and model parameters specifically impact BDD generation quality.

4. Key Results and Findings

RQ1: Effectiveness of LLMs

Metric Divergence: GPT-4 achieved the highest scores in text and semantic similarity metrics (BLEU, METEOR, BERTScore). However, Claude 3 was rated highest by both Human Experts and LLM-based Evaluators.
Correlation: Traditional similarity metrics showed weak correlation with human judgment ( $\rho < 0.30$ ). DeepSeek showed the strongest correlation with human ratings ( $\rho = 0.72$ for Claude, $0.62$ for GPT-4), making it a reliable proxy for human evaluation in large-scale studies.

RQ2: Impact of Prompting Techniques

The optimal prompting strategy is model-specific:

GPT-4: Performs best with Zero-shot prompting (Score: 4.63). It does not benefit significantly from examples or CoT.
Claude 3: Shows marginal improvement with Chain-of-Thought (CoT) reasoning (Score: 4.22 vs 4.18 for Zero-shot).
Gemini: Performs best with Few-shot prompting (Score: 4.34), indicating it relies heavily on contextual examples.

RQ3: Impact of Input Quality

Input quality is the most critical factor:

Best Performance: Combined User Story + Requirement Description.
Description Only: High performance (only ~4-6% drop from combined input). Detailed technical descriptions are sufficient for high-quality generation.
User Story Only: Significant degradation in quality (~20-28% drop). User stories alone lack the specific acceptance criteria and edge cases required for comprehensive BDD scenarios.

RQ4: Model Parameters

Determinism Wins: The configuration Temperature = 0 and Top_p = 1.0 consistently produced the highest-quality scenarios across all models.
Implication: BDD scenario generation benefits from deterministic, consistent outputs rather than creative/randomized variations. Increasing temperature reduced quality.

5. Significance and Implications

For Industry: The paper provides a roadmap for integrating LLMs into CI/CD pipelines. It suggests that organizations should:
- Invest in detailed requirement descriptions rather than relying solely on brief user stories.
- Adopt hybrid workflows where LLMs generate drafts (using model-specific prompts) for human refinement, rather than full automation.
- Use DeepSeek as a cost-effective automated evaluator to screen large volumes of generated scenarios before human review.
For Research: It establishes a new benchmark dataset and demonstrates that text similarity metrics are insufficient for evaluating structured test generation; semantic understanding and expert alignment are paramount.
Future Work: The authors suggest exploring multi-scenario generation (covering edge cases and failure paths) and investigating controlled randomness to generate diverse test suites without sacrificing individual scenario quality.

In conclusion, the study confirms that LLMs are viable for automating BDD scenario generation, provided that organizations optimize their input documentation, select model-specific prompting strategies, and utilize deterministic settings for maximum quality.