Can ChatGPT Generate Realistic Synthetic System Requirement Specifications? Results of a Case Study

Imagine you are a chef trying to teach a new cooking class. You need to show your students how to make a "perfect lasagna." But there's a problem: you can't show them a real lasagna because the recipe is a top-secret family heirloom, and the real ingredients are locked in a vault.

So, you ask a very smart, but slightly hallucinating, robot assistant (let's call him "ChatGPT") to write a recipe for a fake lasagna that looks and tastes just like the real thing. The question is: Can this robot write a fake recipe that is so good, even a professional chef can't tell the difference?

This paper is the story of a team of researchers who tried exactly that, but instead of lasagna, they were writing System Requirement Specifications (SyRSs).

What is a "System Requirement Specification"?

Think of a SyRS as the blueprint for a building. Before you build a skyscraper, you need a massive document that says:

"The building must have 50 floors."
"It must hold 1,000 people."
"It must survive an earthquake."
"The elevators must be fast."

In the real world, these blueprints are written by humans for real companies. But researchers who want to study how to build better blueprints can't always see the real ones because companies keep them secret (confidentiality).

The Experiment: The Robot Architect

The researchers wanted to know if they could use a "Black-Box" AI (like ChatGPT) to generate Synthetic System Requirement Specifications (SSyRSs). These are fake blueprints that look real but are made up.

They set up a strict challenge:

No Real Blueprints: The AI couldn't look at any real documents.
No Human Experts: The AI couldn't ask a human architect for help.
10 Different Industries: They asked the AI to write blueprints for 10 different worlds: E-commerce, Healthcare, Logistics, Finance, etc.

How They Did It (The "Prompt" Game)

The researchers didn't just say, "Write a blueprint." That would be like asking a child to "draw a house" and expecting a skyscraper. Instead, they used a game of refinement:

The Template: They gave the AI a strict form to fill out (like a fill-in-the-blank worksheet).
The Persona: They told the AI, "Pretend you are a grumpy, experienced construction manager."
The Critic: After the AI wrote a blueprint, the AI had to grade itself. It had to say, "Is this realistic? If not, why?"
The Loop: They did this 10 times. Every time the AI made a mistake, the researchers tweaked the instructions (the "prompt") to make the next version better.

The Results: The "Uncanny Valley" of Blueprints

After generating 300 fake blueprints, they invited 87 real human experts (actual architects and engineers) to take a look.

The Good News:

62% of the experts said, "Hey, this looks pretty realistic! I could almost use this."
The structure was perfect. The AI knew exactly where to put the "Safety Requirements" and the "Budget Constraints."
It was great at sounding confident and professional.

The Bad News (The Catch):

The "Hallucination" Trap: While the blueprints looked real, they were often full of hidden nonsense.
- Analogy: Imagine a blueprint that says, "The building must be made of glass that is also fireproof," but then later says, "The building must be built in a volcano." A quick glance looks fine, but a closer look reveals it's impossible.
The Robot's Confidence: The AI was overconfident. It would write a fake rule with 100% certainty, even if the rule made no sense.
The Robot's Self-Grading was Flawed: When the AI graded its own work, it was inconsistent. One time it gave a blueprint a 9/10, and the next time (with the same blueprint), it gave it a 4/10. It couldn't be trusted to check its own work.

The Big Takeaway

The researchers concluded that ChatGPT is a fantastic "draftsman," but a terrible "inspector."

Can it generate realistic-looking specs? Yes, to a certain extent. It can create a document that looks like a professional blueprint.
Can we trust it without a human? Absolutely not.

The paper warns us that if you use AI to write these important documents, you must have a human expert double-check it. The AI is like a student who has memorized the textbook perfectly but has never actually built a house. They can write a perfect essay about building a house, but if you ask them to actually build one, they might try to use glue instead of cement.

Summary in One Sentence

ChatGPT can write a fake blueprint that looks real enough to fool a quick glance, but it's full of hidden errors and lies, so a human expert must always check the work before you start building.

Here is a detailed technical summary of the paper "Can ChatGPT Generate Realistic Synthetic System Requirement Specifications? Results of a Case Study."

1. Problem Statement

System Requirement Specifications (SyRSs) are critical natural language (NL) artifacts in Requirements Engineering (RE), essential for research in elicitation, validation, automated testing, and tool benchmarking. However, access to real-world SyRSs is severely restricted due to proprietary rights, confidentiality, and unavailability.

The Challenge: Researchers need realistic synthetic SyRSs (SSyRSs) to overcome data scarcity.
The Proposed Solution: Using Black-box Large Language Models (LLMs) like ChatGPT to generate these artifacts without access to real data or domain experts.
The Risk: LLMs suffer from hallucinations (generating nonsensical or unfaithful data) and overconfidence (presenting incorrect data with high certainty). Furthermore, LLMs are trained on general internet data, lacking the depth of expert-domain knowledge required for high-quality RE artifacts, increasing the risk of generating plausible but technically incorrect specifications.

2. Methodology

The study employs Design Science Research (DSR) to develop and evaluate a generational approach for SSyRSs. The core objective was to determine if ChatGPT could generate realistic SSyRSs without real data or expert involvement.

A. Generation Process

The authors designed an iterative, three-phase process (Figure 1 in the paper) executed over 10 iterations:

Preparation:
- Domains: 10 industry domains were selected (E-Commerce, Education, Finance, Government, Healthcare, Logistics, Manufacturing, Media, Retail, Telecommunications).
- Template: A simplified SyRS structure based on ISO/IEC/IEEE 29148:2018, comprising four main categories: System Overview, Functional Requirements, Non-Functional Requirements, and Constraints.
- Prompt Engineering: The generation prompt ( $P_{Gen}$ $P_{G e n}$ ) utilized specific patterns:
  - Zero-shot: No real examples provided to avoid bias.
  - Template Pattern: Enforced structural adherence.
  - Persona Pattern: The LLM acted as an "experienced requirements engineer."
  - Chain-of-Thought: Decomposed complex generation tasks.
  - Terminology: Used the generic term "scenario" instead of "SyRS" to mitigate hallucination risks associated with vague internet definitions.
Generation and Assessment:
- For each domain, 3 SSyRSs were generated per iteration (Total: 300 SSyRSs).
- Self-Assessment: The same LLM (GPT-4o) evaluated the generated artifacts using two prompts:
  - $P_{Comp}$ : Checked Completeness (boolean: does it match the template?).
  - $P_{DoR}$ : Assessed Degree of Realism (DoR) on a scale of [0, 1], deducting points for unrealistic elements.
- Semantic Similarity: Measured using SBERT (Sentence Transformers) to ensure diversity and avoid redundancy between generated documents within the same domain.
Analysis and Refinement:
- Iterations continued until subjective termination criteria were met (10 iterations), based on improvements in DoR scores, similarity scores, and author subjective ratings.

B. Expert Evaluation Study

To validate the LLM-generated artifacts, a questionnaire study was conducted:

Participants: $N=83$ software engineering experts (75% with >4 years of experience; mix of technical, domain, and hybrid experts).
Procedure: Participants evaluated one SSyRS per domain (30 total questionnaires, 87 submissions). They rated realism on a Likert scale and provided binary "realistic/unrealistic" judgments with free-text justifications.

3. Key Contributions

Methodology for Synthetic RE Artifacts: A systematic, iterative framework for generating and assessing SSyRSs using black-box LLMs without access to real data or domain experts.
Quality Metrics: Definition of three key properties for SSyRS quality:
- Completeness: Adherence to the structural template.
- Degree of Realism (DoR): Plausibility of content.
- Semantic Similarity: Diversity of content within a domain.
Empirical Evidence: A dataset of 300 SSyRSs across 10 industries, publicly available, along with the prompts and assessment results.
Critical Insight on LLM Evaluation: Demonstration that LLM-based quantitative plausibility checks are unreliable compared to human expert evaluation.

4. Results

A. Generation Statistics

Volume: 300 SSyRSs generated (avg. 716 words each; total corpus ~21,500 words).
Completeness: The LLM successfully adhered to the template structure in all cases ( $P_{Comp}$ was highly reliable).
Diversity: Semantic similarity scores averaged 0.66 (range 0.50–0.82), indicating a heterogeneous dataset with low redundancy.

B. LLM-Based Assessment vs. Reality

DoR Scores: LLM self-assessments yielded high DoR scores (0.77–0.82 across domains).
Model Bias: Cross-model validation revealed significant bias. While GPT-4o gave scores around 0.90, Sonnet 4.5 (a different model) gave significantly lower scores (avg 0.64).
Reliability: Re-running the same SSyRS with Sonnet 4.5 10 times resulted in DoR scores ranging from 0.48 to 0.73 (a 25% variance on a 0-1 scale), proving that quantitative LLM assessments are highly unreliable.

C. Expert Evaluation Findings

Perceived Realism: 61.4% of experts rated the SSyRSs as "somewhat" or "very realistic." Only 14.3% rated them as artificial.
Critical Flaws Identified: Despite high overall ratings, experts identified specific recurring issues in their comments:
1. Oversimplification: Text sounded like "textbook examples" or "buzzword bingo."
2. Lack of Detail: Vague definitions and missing specific constraints.
3. Logical Incoherence: Contradictions between sections (e.g., applying EU regulations to non-EU contexts).
4. Over-ambition: Unrealistic resource constraints (e.g., budget vs. scope).
Conclusion: Experts found the structure realistic but the technical depth often flawed. The "overconfidence" of the LLM presentation masked these logical inconsistencies.

5. Significance and Implications

Feasibility: ChatGPT can generate SSyRS candidates that are structurally sound and superficially realistic, making them useful as a starting point for research when real data is unavailable.
Limitation of Automation: LLM-based quantitative quality assessment is insufficient. The study proves that LLMs cannot reliably self-correct or evaluate their own hallucinations in complex expert domains.
Human-in-the-Loop: Expert review remains indispensable. While 60% of experts found the artifacts realistic, deep analysis revealed significant technical flaws that would render them unusable for rigorous engineering tasks without manual refinement.
Future Work: The generated dataset serves as a valuable baseline for extracting context-embedded requirements and can be refined by domain experts for future studies, potentially serving as an alternative to the PROMISE dataset.

Final Verdict: The paper concludes that while ChatGPT is a powerful tool for generating initial SSyRS candidates, it cannot replace thorough expert evaluation to ensure the artifacts are truly realistic and error-free.