From Law to Gherkin: A Human-Centred Quasi-Experiment on the Quality of LLM-Generated Behavioural Specifications from Food-Safety Regulations

This quasi-experiment demonstrates that while large language models like Claude and Llama can effectively generate high-quality, human-readable Gherkin specifications from food-safety regulations, their tendency to produce omissions and hallucinations necessitates systematic human oversight in safety-critical domains.

Shabnam Hassani, Mehrdad Sabetzadeh, Daniel Amyot

Published Thu, 12 Ma
📖 4 min read☕ Coffee break read

Imagine you are a chef trying to build a robot that cooks food. But before you can program the robot, you have to read a 50-page government rulebook about food safety. These rules are written in "legalese"—dry, complex, and designed to apply to any technology, not just your specific robot.

Translating those rules into instructions the robot can understand is like trying to turn a thick, dusty encyclopedia into a simple, step-by-step recipe card. Doing this by hand is slow, boring, and easy to mess up. If you miss a step, the robot might serve you a poisonous meal, and you could get in huge trouble.

This paper is about testing a new "smart assistant" (AI) to see if it can do this translation job for us.

Here is the story of their experiment, broken down simply:

1. The Goal: The "Translator" Test

The researchers wanted to see if Large Language Models (LLMs)—the same kind of AI that powers chatbots like Claude and Llama—could read food safety laws and automatically write "Gherkin" specifications.

  • What is Gherkin? Think of Gherkin as a special "recipe card" language. Instead of writing code, you write simple sentences like: "Given the egg is frozen, When I check the bacteria, Then it must be under 50,000."
  • Why Gherkin? It's a bridge. It's written in plain English so humans can understand it, but it's structured enough that computers can read it to run automatic tests.

2. The Experiment: The "Taste Test"

The researchers didn't just ask the AI to do it; they set up a rigorous "taste test" with human experts.

  • The Ingredients: They took 30 real food safety laws (like rules about how much water is allowed in dried eggs).
  • The Chefs: They hired two AI chefs: Claude and Llama. Each AI tried to turn the 30 laws into 60 different Gherkin "recipe cards."
  • The Judges: They recruited 10 human judges (mostly computer science students who know how to write these recipe cards).
  • The Task: Each judge looked at 12 of the AI-generated cards and rated them on five things:
    1. Relevance: Did it actually talk about the law?
    2. Clarity: Was it easy to understand?
    3. Completeness: Did it miss any steps?
    4. Singularity: Did it focus on just one thing per card, or did it mix everything up?
    5. Time Savings: Would this save me time if I used it?

3. The Results: The AI is a Great "Draftsman"

The results were surprisingly good, but with a catch.

  • The Good News: The AI did an amazing job. The human judges gave the AI-generated cards very high scores.

    • 95% were relevant.
    • 100% were clear.
    • 94% were complete.
    • The judges felt it would save them a massive amount of time.
    • The Verdict: Both AI chefs (Claude and Llama) performed almost exactly the same. Neither was clearly "better" than the other.
  • The Bad News (The "Hallucinations"): Even though the scores were high, the human judges found some dangerous errors when they looked closely.

    • The "Ghost" Ingredients (Hallucinations): Sometimes the AI invented rules that didn't exist. For example, one AI wrote a rule saying the robot should "display a warning light" if the temperature was wrong. But the original law never mentioned a warning light! The AI just made it up because it thought that's what a robot should do.
    • The "Missing" Ingredients (Omissions): Sometimes the AI forgot a crucial step. It might forget to mention that a label must be in both English and French, even though the law required it.
    • The "Mixed-Up" Recipe (Singularity): Sometimes the AI tried to do too much in one step. Instead of one card for "checking weight" and another for "checking bacteria," it mashed them together into one confusing card.

4. The Conclusion: AI is a Co-Pilot, Not the Captain

The paper concludes that AI is fantastic at writing the first draft, but it cannot be trusted to finish the job alone.

  • The Analogy: Think of the AI as a very fast, very creative intern. The intern can write a 10-page report in 10 seconds that looks perfect. But if you don't read it carefully, the intern might have invented a fake statistic or forgotten a key paragraph.
  • The Takeaway: In safety-critical fields (like food safety, medicine, or aviation), you cannot just let the AI run the show. You must have a human expert review every single line to catch the "ghost ingredients" and "missing steps."

In short: The AI can do the heavy lifting and save us hours of work, but we still need to be the editors to make sure the final product is safe and accurate.