User-driven development and evaluation of an agentic… — Plain-Language Explanation

Original authors: Corradi, M., Djidrovski, I., Ladeira, L., Staumont, B., Verhoeven, A., Sanz Serrano, J., Rougny, A., Vaez, A., Hemedan, A., Mazein, A., Niarakis, A., de Carvalho e Silva, A., Auffray, C., Wilighagen

Published 2026-03-12

📖 5 min read🧠 Deep dive

View on bioRxiv ↗PDF ↗

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a massive, ancient library filled with millions of books about how the human body works. These aren't just normal books; they are intricate, hand-drawn maps showing how every single cell, chemical, and organ talks to one another. Scientists call these "molecular interaction maps."

The problem? These maps are so huge and complicated that even experts get lost in them. It's like trying to find a specific street in a city the size of a continent without a GPS.

This paper is about building a smart, talking GPS for these biological maps. They call this GPS "Llemy."

Here is the story of how they built it, how they tested it, and what they learned, explained simply:

1. The Problem: Getting Lost in the Jungle

Scientists have created beautiful, detailed maps of how our bodies fight disease, process food, or react to drugs. But these maps are hard to read. If you want to know, "What happens to my liver if I take this specific pill?" you might have to spend hours hunting through thousands of lines of data.

2. The Solution: A "Digital Librarian" (Llemy)

The team decided to use Large Language Models (LLMs)—the same kind of AI technology that powers chatbots—to act as a guide. They built a system called Llemy.

Think of Llemy as a super-smart tour guide who has memorized every single one of these biological maps.

You ask a question: "Show me how bile acids are processed in the liver."
Llemy looks at the map: It doesn't just guess; it actually reads the specific map you selected.
Llemy answers: It gives you a summary, points out the exact parts of the map you need to look at, and even tells you which scientific papers back up the information.

3. How They Built It: The "Hackathon" Kitchen

Instead of building this in a quiet office and hoping it works, the team did something different. They held a two-day "hackathon" (a coding marathon) with the actual people who use these maps: doctors, biologists, and map-makers.

The Analogy: Imagine a chef trying to build a new kitchen tool. Instead of designing it alone, they invited the people who actually cook to come in, throw ingredients at the chef, and say, "This is too hard to chop," or "I need a sharper knife."
The team built a rough prototype, let the experts try to break it, and then fixed it immediately based on what the experts said.

4. The Test Drive: 25 Users Take the Wheel

Once they had a working version, they invited 25 experts to "test drive" Llemy. They asked the system all sorts of questions, from simple ones ("What is this protein?") to complex ones ("What happens if we block this pathway?").

After every answer, the users gave the system a report card with three grades:

Accuracy: Was the information correct?
Conciseness: Was it short and to the point?
Reliability: Did it provide good links to the source so you could check the work?

5. What They Found: The Good, The Bad, and The Slow

The results were a mix of excitement and caution:

The Good: The system was great at summarizing complex maps. It acted like a helpful translator, turning confusing diagrams into plain English. Users felt it saved them time.
The Bad:
- Speed: When the system took too long to think, users got annoyed and gave it lower grades. It's like waiting too long for a waiter to bring your menu.
- Hallucinations: Sometimes, the AI made things up or got confused by different names for the same thing (like calling a "heart" a "cardiac muscle" and not realizing they are the same).
- Inconsistency: If you asked the exact same question twice, you might get two slightly different answers. This is a common quirk of AI, but it makes scientists nervous.

6. The Future: Making the GPS Better

The paper concludes that while Llemy is a promising start, it needs more work.

The Roadmap: They plan to make it faster, fix the "made-up facts," and make sure it always points to the right source material.
The Big Picture: They want to move away from expensive, closed AI systems to open-source AI (free, community-built models) so that scientists everywhere can use and improve this tool without paying huge fees.

The Bottom Line

This paper isn't just about a new software tool; it's about a new way of building science tools. Instead of scientists building tools in isolation and hoping users like them, they built this tool with the users, testing it constantly.

Llemy is like a prototype for a future where complex biological data isn't locked behind a wall of jargon, but is accessible to anyone with a simple question, guided by a smart, honest, and helpful AI companion.

1. Problem Statement

Biomedical knowledge repositories, specifically molecular interaction maps (e.g., pathway diagrams, knowledge graphs), are growing in size, complexity, and volume. While these resources (hosted on platforms like MINERVA) are crucial for hypothesis generation and experimental design, they present significant barriers to navigation:

Complexity: Maps are manually curated, follow strict standards (SBGN/SBML), and contain dense interconnections that are difficult for novice users or even experts to traverse efficiently.
Fragmentation: Repositories are distributed with varying scopes and granularities, lacking unified access interfaces.
Retrieval Gap: While Large Language Models (LLMs) are emerging as tools for summarizing and analyzing structured knowledge, there is a lack of dedicated, evaluated solutions specifically designed to interact with and interpret interactive molecular interaction maps. Existing solutions often lack the ability to ground LLM outputs in specific diagram elements or provide verifiable references.

2. Methodology

The authors developed Llemy, an LLM-based agentic framework, using a rigorous user-driven development lifecycle.

A. Development Process

Hackathon Prototyping: The system was initially prototyped during a two-day hackathon involving hepatotoxicologists, curators, computational biologists, and LLM experts.
Iterative Refinement: Domain experts provided specific user prompts (including edge cases like false premises and out-of-scope queries) to guide the system's architecture and instruction tuning.
System Architecture:
- Backend: Built in Python using LangChain (v0.3.27) as the agentic backend.
- Frontend: Developed with Streamlit (v1.50.0).
- Model: Utilized GPT-4.1-nano (commercial LLM) for the synthesis agent.
- Workflow:
  1. Prompt Enrichment: User input is enriched with instructions for scientific focus and literature citation.
  2. Parallel Retrieval: Two agents operate in parallel:
    - One fetches map data (nodes, edges, annotations) from the MINERVA Platform via API.
    - One performs deep research using Perplexity (for external context).
  3. Synthesis: A third agent combines the enriched prompt, map data, and external research to generate a response.
  4. Post-processing: The output is formatted to include clickable links to specific elements within the original map for traceability.
- Deployment: Hosted on a cloud platform (VHP4Safety) and available as a Docker container. API keys are session-based for security.

B. Evaluation Study

Participants: 25 users recruited from the Disease Maps Community (including developers, curators, and end-users).
Data Collection:
- Prompt Dataset: 157 individual user prompts with system responses and user ratings (1–5 scale) for Accuracy, Conciseness, and Reliability.
- Summary Dataset: A post-study survey from 19 users regarding overall usability, time savings, and output variability.
Statistical Analysis:
- Prompts were categorized into three types: Summarise, Find, and Analyse.
- A Cumulative Link Mixed Model (CLMM) was used to analyze the impact of response time on performance metrics, accounting for within-user correlation.
- Dunn's test with Holm correction was used for post-hoc pairwise comparisons across prompt categories.

3. Key Contributions

Llemy Framework: The first agentic system specifically designed to explore and analyze large, interactive molecular interaction maps by grounding LLM responses in curated map data.
User-Centric Design: A novel development pipeline where domain experts co-designed the system via a hackathon and provided granular feedback on specific prompts, ensuring the tool addresses real-world scientific needs.
Traceability Mechanism: The system automatically generates clickable links to specific map elements (nodes/reactions) in the output, allowing users to verify the source of the LLM's claims directly within the MINERVA platform.
Benchmarking Framework: Established a methodology for evaluating LLM performance in the context of interactive diagram exploration, identifying specific failure modes (e.g., synonym handling, hallucinated references).

4. Results

Performance Metrics:
- Medians: Accuracy (4/5), Reliability (4/5), Conciseness (3/5).
- Response Time: A significant negative correlation was found between response time and perceived quality ( $\beta = -0.34, p < 0.001$ ). Longer delays led to lower ratings across all metrics.
- Task Categories:
  - Summarise: Received the highest mean scores.
  - Find: Showed the broadest distribution of scores and lower evaluations, likely due to the complexity of retrieving specific elements from complex maps.
  - Analyse: Performed similarly to "Find" but with fewer requests.
Qualitative Feedback:
- Strengths: Users praised the system's ability to provide comprehensive summaries and correctly identify pathway connections.
- Weaknesses:
  - Factual Errors: Occasional hallucinations of map content or reaction references.
  - Nomenclature Issues: Failure to recognize entities when using HGNC names instead of common abbreviations (synonym handling).
  - Context: Lack of domain-specific framing (e.g., failing to specify organ context).
  - Reliability: Inconsistent hyperlink behavior (broken links or non-resolving references) and variable output structure.
Variability: 95% of users reported high variability in output for similar queries (rating $\geq 3$ on a 1–5 scale), a known limitation of current commercial LLMs.
Usability: Over 80% of users rated the system's usability as high (4 or 5), and 75% reported time savings.

5. Significance and Future Outlook

Lowering Barriers: Llemy demonstrates that LLMs can effectively reduce the complexity barrier for navigating biological pathway maps, acting as a powerful entry point for knowledge retrieval.
Open Research & Sustainability: The authors advocate for a transition from commercial LLMs to open-weight models to ensure reproducibility, reduce costs, and support the open research environment. They highlight the need for dedicated infrastructure and benchmark datasets to compare open vs. commercial models.
Roadmap:
- Short-term: Improve response times and reference accuracy.
- Medium-term: Implement dedicated workflows for specific tasks (Summarise, Find, Analyse) and integrate with the MINERVA GUI via plugins.
- Long-term: Adopt a graph-retrieval approach (RAG) to overcome context window limitations and enable programmatic integration using the Model Context Protocol (MCP).
Community Engagement: The study emphasizes that continuous user-driven development and open-ended benchmarking are essential for keeping these tools aligned with the evolving needs of the systems biology community.

In conclusion, this paper presents a validated, user-tested framework for integrating LLMs with structured biological data, offering a blueprint for future tools that bridge the gap between complex biomedical knowledge graphs and human interpretation.

User-driven development and evaluation of an agentic framework for analysis of large pathway diagrams