Human-Centered Evaluation of an LLM-Based Process Modeling Copilot: A Mixed-Methods Study with Domain Experts

Imagine you have a very smart, fast-talking assistant who claims they can draw complex blueprints for a factory just by listening to you describe how the factory works. You say, "First, the raw materials come in, then they get washed, then they go to the oven," and the assistant instantly draws a diagram.

This is exactly what the researchers in this paper tried to build: an AI "Copilot" that turns your spoken or written descriptions into BPMN diagrams (which are just fancy, standardized blueprints for business processes).

The team built a tool called KICoPro and asked five real-life experts (people who draw these blueprints for a living) to try it out. They wanted to know: Is this thing actually useful, or is it just a cool toy?

Here is the story of what they found, explained simply:

1. The "Nice Face, Shaky Hands" Problem

The experts liked the look and feel of the tool. It was easy to chat with, the buttons worked, and it felt friendly. On a "usability" test, it scored a decent 67 out of 100.

The Analogy: Imagine a car with a beautiful leather interior, a smooth steering wheel, and a great radio. You love sitting in it. But when you try to drive it, the engine sputters, the brakes are unreliable, and the map is often wrong.

The Result: The experts said, "I like driving this car, but I wouldn't trust it to get me to the hospital in an emergency." Their trust score was only 48 out of 100. They didn't believe the AI would get the job right every time.

2. The "Mind Reading" Struggle

The experts found a weird problem: They knew what they wanted (a blueprint), but they didn't know how to ask for it.

The Analogy: It's like ordering a custom cake from a baker. You say, "I want a cake." The baker gives you a plain sponge. You say, "I want a chocolate cake with strawberries." The baker gives you a burnt chocolate cake with no strawberries. You realize you need to be a "cake whisperer" to get what you want, but the baker never asks, "Did you want vanilla or chocolate?"
The Finding: The AI never asked clarifying questions. If the description was vague, the AI just guessed. If the process was long and complicated, the AI got confused and only drew half the picture.

3. The "Chunking" Hack

To get good results, the experts had to do extra mental work. They couldn't just describe the whole factory at once. They had to break their big idea into tiny, bite-sized pieces and ask the AI to draw them one by one.

The Analogy: Instead of telling a painter, "Paint me the whole Grand Canyon," you have to say, "Paint the sky," then "Paint the rocks," then "Paint the river."
The Problem: This made the experts tired. They had to hold the whole picture in their heads and stitch the pieces together themselves. The tool was supposed to save them work, but it actually made them work harder to "fix" the AI's mistakes.

4. The "Silent Partner"

The experts noticed the AI followed the rules of the road (BPMN standards) poorly. Sometimes it drew lines that shouldn't be there or missed important details like "who does this task?"

The Analogy: It's like a GPS that sometimes tells you to drive into a lake because it didn't check the map carefully.
The Fear: In a real business, if you trust the AI and it draws a wrong blueprint, you might build a factory that doesn't work. The experts said, "I can't trust this for important decisions yet."

5. What the Experts Dreamed Of

Even with the flaws, the experts had big ideas for how this tool could be amazing in the future:

The "Sketch-to-Draft" Bot: Imagine drawing a messy picture on a napkin, and the AI turns it into a professional blueprint instantly.
The "Quality Police": The AI could check your existing blueprints to make sure you didn't break any company rules.
The "Local Brain": A version of the AI trained specifically on your company's history, so it knows your specific jargon and rules.

The Big Takeaway

The main lesson from this paper is: Just because a tool is easy to use doesn't mean it's trustworthy.

You can have a chatbot that feels great to talk to (high usability), but if it keeps making mistakes or guessing wrong (low reliability), professionals won't use it for serious work.

The Conclusion: We can't just test AI with computers (checking if the code is right). We have to test it with humans to see if they feel safe using it. The future of AI in business isn't just about making it smarter; it's about making it clearer, more honest about its mistakes, and better at asking questions before it starts drawing.

1. Problem Statement

The integration of Large Language Models (LLMs) into Business Process Management (BPM) tools aims to democratize the creation of Business Process Model and Notation (BPMN) diagrams, allowing non-experts to generate models from natural language. However, current evaluation frameworks rely heavily on automated benchmarks measuring syntactic and semantic correctness. These metrics fail to capture critical human factors such as:

Trust: Whether experts trust the generated output for professional use.
Usability: The cognitive load and interaction flow required to prompt the system effectively.
Professional Alignment: The system's ability to adhere to organizational conventions and handle complex, iterative refinement.

The authors argue that without human-centered evaluation, there is a risk of a "gulf of execution" where users understand the goal but cannot formulate prompts to achieve reliable results, leading to high cognitive load and low adoption.

2. Methodology

The study employed a mixed-methods approach combining qualitative insights with quantitative metrics, specifically designed for domain experts rather than novices.

System Evaluated: KICoPro, a web-based conversational BPMN copilot that transforms natural language descriptions into BPMN models, supports iterative refinement, and maintains conversation history.
Participants: Five process modeling experts ( $n=5$ ) from a single organization. All were experienced with BPMN (working several days a week) and had prior experience with chatbot interfaces.
Procedure:
1. Kickoff: Remote introduction and expectation setting.
2. Hands-on Exploration (2 weeks): Participants modeled processes of varying complexity using the tool.
3. Focus Group (3.5 hours): Semi-structured workshop discussing interface usability, model quality, limitations, and use cases.
4. Questionnaires: Standardized surveys administered post-workshop.
Instruments:
- Chatbot Usability Questionnaire (CUQ): 16-item scale measuring personality, onboarding, navigation, communication, and error handling.
- Trust Scale for the AI Context (TAI): 8-item scale measuring confidence, predictability, reliability, and security.
- Tool-Specific Quality Assessment: Custom 8-item questionnaire covering text understanding, flow correctness, and modification handling.
Analysis: Quantitative data was analyzed descriptively (means, SD) due to the small sample size. Qualitative data (focus group notes, open-ended responses) underwent thematic analysis following Braun and Clarke's six-phase approach.

3. Key Results

Quantitative Findings

Usability vs. Trust Gap: There is a significant discrepancy between perceived usability and trust.
- Usability (CUQ): Mean score of 67.2/100, which is marginally below the established benchmark of 68.0. Users found the interface intuitive and easy to use.
- Trust: Mean score of 48.8%, falling well below the suggested 60% threshold for acceptable trust.
- Reliability: Rated as the most critical concern (Mean: 1.8/5), indicating experts cannot consistently rely on the system's outputs.
Task Quality: The mean score for tool-specific quality was 54.4%, with high variability. While the level of detail was rated highly, text understanding and modification handling were inconsistent.

Qualitative Findings (Thematic Analysis)

Seven major themes emerged from the expert feedback:

Intuitive Interface, Opaque Prompting: The UI was simple, but users struggled with the "prompting paradox"—knowing the goal but not how to formulate prompts to get reliable results (e.g., balancing detail vs. conciseness).
Response Latency: High latency disrupted the iterative cognitive rhythm required for professional modeling.
Varying Output Quality: Longer process descriptions often resulted in incomplete models capturing only a subset of the process.
Chunking as a Coping Strategy: Experts manually segmented complex processes into smaller prompts to improve quality, which increased their cognitive load.
Imprecise Modifications: Modification requests were often unreliable, triggering unexpected connections or failing to implement changes due to unsupported elements (e.g., lanes, annotations).
Absent Clarification Dialogue: The LLM failed to ask clarifying questions for ambiguous inputs, instead making implicit assumptions and generating potentially incorrect complete models.
Convention Violations: The system frequently violated BPMN 2.0 standards and specific organizational guidelines (e.g., labeling patterns), lacking the ability to be configured for enterprise rules.

Envisioned Use Cases

Experts identified five potential use cases:

Support for Non-Modelers: Generating initial drafts for domain experts (though quality must be higher for non-experts who cannot spot errors).
Quality Assurance Bot: Validating existing models against standards and organizational conventions.
Visual-to-BPMN Conversion: Converting hand-drawn sketches or photos into formal models.
Enterprise-Integrated Local Deployment: A local LLM trained on an organization's specific process portfolio for pattern recognition and consistency.
Process Optimization: Using LLM reasoning to identify inefficiencies and suggest process improvements.

4. Key Contributions

Demonstration of the Usability-Trust Gap: The study empirically proves that a system can be perceived as usable (easy to interact with) while simultaneously being untrusted (unreliable output) by professionals.
Identification of the "Prompting Paradox": Highlighting that the barrier to entry is not just the tool's existence, but the user's inability to formulate effective prompts without guidance.
Critique of Automated Benchmarks: Arguing that automated metrics are insufficient for evaluating LLM agents in professional settings where reliability, convention adherence, and iterative dialogue are paramount.
Design Implications: Proposing specific technical interventions, including:
- Progressive Disclosure: Systems that chunk large inputs automatically rather than forcing users to do it.
- Proactive Clarification: LLMs must ask questions before generating outputs.
- Confidence Communication: Systems should highlight uncertainties and probable errors to calibrate user trust.
- Convention Configuration: Allowing systems to be tuned to specific enterprise modeling standards.

5. Significance

This paper is significant because it shifts the evaluation paradigm for LLM-based modeling tools from technical correctness to human-centered viability. It demonstrates that for LLM agents to be adopted in professional BPM contexts, they must move beyond simple generation to become collaborative partners that understand context, adhere to strict standards, and communicate their limitations. The findings suggest that future research and development must prioritize trust calibration and interaction design (e.g., handling ambiguity and latency) alongside model accuracy to ensure these tools are safe and effective for enterprise use.