Mozi: The "Governed Autopilot" for Drug Discovery

Imagine trying to build a skyscraper. You have a brilliant architect (the AI) who can dream up amazing designs, but if you let them run wild without a foreman, they might try to build a bridge out of jelly or forget to check if the foundation is solid. In the world of drug discovery, a mistake isn't just a collapsed building; it's a wasted decade and billions of dollars.

This paper introduces Mozi, a new system designed to be the perfect "Chief of Staff" for AI scientists. It doesn't just let the AI chat and guess; it puts the AI on a strict, safe, and highly organized leash.

Here is how Mozi works, explained through simple analogies:

1. The Problem: The "Wild West" AI

Current AI agents are like enthusiastic interns who read a lot of books but have never worked in a lab.

The Issue: If you ask a standard AI to "find a cure for Alzheimer's," it might hallucinate (make things up), pick the wrong tools, or get stuck in a loop. In drug discovery, one small mistake early on (like picking the wrong protein) ruins everything that comes after. It's like trying to bake a cake with the wrong flour; no matter how good the frosting is, the cake is ruined.
The Bottleneck: We need AI that is creative but also follows strict safety rules and scientific procedures.

2. The Solution: Mozi's Two-Layer Architecture

Mozi solves this by splitting the work into two distinct teams, like a General and a Factory.

Layer A: The General (The Control Plane)

Think of this as the Project Manager or the Traffic Cop.

Role: It doesn't do the heavy lifting. Instead, it listens to your request (e.g., "Find a drug for Sepsis") and breaks it down into a strict checklist.
The Rules: It acts as a gatekeeper. It decides who is allowed to do what.
- Analogy: If the "Research Intern" asks to use the expensive, dangerous nuclear reactor (a complex simulation), the General says, "No, you can only use the library books."
- Self-Correction: If the intern makes a mistake, the General stops the process, says, "Wait, that didn't work," and re-plans the route before moving forward. It prevents the AI from drifting off course.

Layer B: The Factory (The Workflow Plane)

Think of this as the Assembly Line or the Master Chefs.

Role: This is where the actual work happens. Mozi has pre-built, step-by-step "recipes" for drug discovery (like Target Identification, Finding Hits, and Optimizing Leads).
The Safety Net: These recipes are rigid. They ensure that if Step 1 (finding a protein) isn't perfect, Step 2 (testing drugs) never starts.
The Human Checkpoint: Crucially, at the most dangerous or uncertain moments, the system pauses. It calls a human expert (a real scientist) to say, "Hey, we found this protein. Is this the right one?" The human hits "Go," and the machine continues. This turns the AI from a "black box" into a co-scientist.

3. How It Works in Real Life: The "Drug Discovery Pipeline"

Mozi treats drug discovery like a relay race with four distinct legs. The baton (the data) must be passed perfectly between runners.

Target Identification (Finding the Enemy): The AI looks at a disease (like Parkinson's) and finds the specific protein causing the trouble.
- Mozi's trick: It checks multiple databases and asks a human, "Are we sure this is the right target?" before moving on.
Hit Identification (Finding the Bullet): It searches millions of chemical compounds to find ones that might stick to that protein.
- Mozi's trick: It uses two strategies at once: one that creates new molecules from scratch and another that screens existing libraries. It filters out the junk immediately.
Hit-to-Lead (Polishing the Bullet): It takes the best candidates and tweaks them to make them stronger and safer.
- Mozi's trick: It runs strict "safety tests" (like checking if the drug would poison the liver). If a candidate fails, it's tossed out automatically.
Lead Optimization (The Final Polish): It fine-tunes the best candidate to ensure it works in the human body, can be manufactured, and isn't toxic.

4. Why Mozi is a Game-Changer

The paper tested Mozi on real-world scenarios (Crohn's disease, Parkinson's, and Sepsis) and compared it to other AI systems.

Reliability: While other AIs often get confused or hallucinate, Mozi's "General" keeps it on track. If a computer simulation crashes, Mozi catches the error, logs it, and keeps going without the whole system failing.
Speed & Scale: In the Parkinson's test, Mozi screened 377,000 compounds in just 35 minutes. That's a job that would take a human team months.
Quality: The drugs Mozi designed were not just random guesses. They were chemically sound, safe, and competitive with drugs currently in clinical trials.

The Bottom Line

Mozi is the bridge between "Creative AI" and "Rigid Science."

Before Mozi, using AI for drug discovery was like giving a toddler a scalpel and saying, "Go perform surgery." It might work by luck, but it's dangerous.
With Mozi, it's like giving the toddler a scalpel but putting them in a surgical theater with a strict head surgeon watching every move, ready to step in if things go wrong.

It transforms the AI from a chatty, unreliable conversationalist into a reliable, governed co-scientist that can help us discover life-saving medicines faster and safer than ever before.

1. Problem Statement

The paper addresses the critical bottleneck in deploying Large Language Model (LLM) agents for high-stakes scientific domains, specifically drug discovery. While tool-augmented LLMs promise to unify reasoning and computation, their application in pharmaceutical pipelines is hindered by two primary issues:

Unconstrained Tool-Use Governance: Generic LLM agents often suffer from "hallucinations" regarding tool parameters, violate strict Standard Operating Procedures (SOPs), and lack role-based access control, leading to unsafe or invalid execution in regulated environments.
Poor Long-Horizon Reliability: In complex, multi-stage workflows (e.g., Target Identification $\to$ Lead Optimization), early-stage errors or hallucinations multiplicatively compound, causing downstream failures. Purely generative agents lack the state management and deterministic rigor required to maintain scientific validity over long trajectories.

2. Methodology: The Mozi Architecture

Mozi introduces a Dual-Layer Architecture designed to bridge the flexibility of generative AI with the deterministic rigor of computational biology. The system operates on the principle of "free-form reasoning for safe tasks, structured execution for long-horizon pipelines."

Layer A: The Control Plane (Governance & Orchestration)

This layer acts as a hierarchical supervisor-worker system to manage unstructured reasoning and ensure safety.

Supervisor-Worker Hierarchy: A central Supervisor Agent decomposes high-level user intents into minimal, bounded plans. It delegates tasks to specialized Workers (e.g., Research Agent, Computation Agent) with isolated context windows.
Governed Action Spaces: Instead of open-ended exploration, the system enforces Role-Based Tool Isolation. Tools are filtered based on agent clearance (e.g., a "Research" agent cannot trigger expensive docking simulations).
Reflection & Replanning: The Supervisor employs a reflection mechanism to evaluate step completion. If a step fails or yields insufficient information, it triggers dynamic replanning rather than blindly continuing, preventing error propagation.
Intent Routing: A prompt-based router classifies tasks into Knowledge Retrieval, Single-Stage Tasks, or End-to-End Workflows, directing them to the appropriate execution path.

Layer B: The Workflow Plane (Stateful Execution)

This layer operationalizes canonical drug discovery stages as Composable Stateful Skill Graphs (Directed Acyclic Graphs - DAGs).

Skill Graphs: Abstract scientific protocols (Target ID, Hit ID, Hit-to-Lead, Lead Optimization) are encoded as graphs with explicit state interfaces. This ensures valid data flow and prevents "state-loss" between steps.
Data Contracts & Format Adapters: Nodes in the graph enforce strict input/output schemas. Adapters programmatically validate and clean data (e.g., ensuring PDB files are formatted correctly) before passing them to computational tools, mitigating "garbage in, garbage out" failures.
Human-in-the-Loop (HITL) Checkpoints: Critical decision boundaries (e.g., selecting a protein structure or finalizing a candidate list) include mandatory human validation gates. Experts can approve, reject, or rollback the process, ensuring alignment with clinical tractability.
Parallel Strategies: For Hit Identification, Mozi employs a Dual-Stream Strategy:
- Generative Path: Uses diffusion models (e.g., DiffSBDD) for de novo design.
- Screening Path: Uses deep learning-based virtual screening (HTVS) on commercial libraries.
- Results are fused, deduplicated, and re-ranked.

Infrastructure: Model Context Protocol (MCP)

Mozi utilizes MCP to federate heterogeneous tools (databases like UniProt/PDB, computational tools like AutoDock Vina, and generative models) into a unified service layer, abstracting complexity while maintaining provenance tracking.

3. Key Contributions

Governed Autonomy Framework: Proposes a novel dual-layer architecture that separates high-level governance (Layer A) from structured workflow execution (Layer B), solving the trade-off between agent flexibility and scientific rigor.
Stateful Skill Graphs: Encodes the entire drug discovery pipeline as state-aware DAGs, ensuring reproducibility and preventing error accumulation in long-horizon tasks.
Role-Based Tool Governance: Implements hard-coded tool filtering and isolation modes (Strict vs. Permissive) to prevent unauthorized tool usage and ensure resource safety.
PharmaBench Benchmark: Introduces a curated benchmark of 88 tasks covering the full drug discovery spectrum (from TDC quantitative tasks to HLE expert reasoning scenarios) to evaluate agent reliability.
End-to-End Case Studies: Demonstrates the system's ability to navigate complex chemical spaces for Crohn's disease, Parkinson's disease, and Sepsis, successfully integrating HITL checkpoints and handling computational failures gracefully.

4. Results

Benchmark Performance (PharmaBench):
- Mozi outperformed the baseline Biomni in classification accuracy and regression error (SMAPE) across both 30B and 235B parameter model backends.
- On the HLE Drug Discovery Subset (28 complex reasoning tasks), Mozi (using Qwen3-235B) achieved 17.86% exact-match accuracy, surpassing general-purpose models like Gemini-2.5-Pro (10.71%) and other specialized agents.
Case Study Efficiency:
- Parkinson's Disease: Screened 377,760 compounds in ~35 minutes using a parallelized DTI prefilter, identifying high-affinity hits.
- Error Containment: In the Sepsis case, the system successfully handled 7 AutoDock Vina failures locally within the skill graph without crashing the entire pipeline, demonstrating robust error containment.
- Multi-Objective Optimization: In the Parkinson's case, the system detected severe hERG toxicity liabilities in intermediate hits and autonomously navigated the chemical space to generate a novel scaffold with improved safety profiles, outperforming existing clinical benchmarks (DNL-201) in in silico metrics.
Comparative Analysis: Compared against platforms like BIOS and K-Dense, Mozi demonstrated superior adherence to physicochemical constraints (e.g., molecular weight for BBB penetration) and more robust end-to-end execution, whereas other systems often failed to generate novel molecules or produced designs violating fundamental drug-likeness rules.

5. Significance

From "Fragile Conversationalist" to "Reliable Co-Scientist": Mozi transforms LLM agents from fragile tools prone to hallucination into governed, auditable partners capable of executing complex, multi-day scientific workflows.
Regulatory Readiness: By enforcing strict SOPs, role-based access, and full trajectory auditability, Mozi addresses the critical need for compliance and safety in regulated pharmaceutical R&D.
Scalability & Robustness: The separation of concerns allows the system to scale to massive chemical spaces (hundreds of thousands of compounds) while maintaining stability through stateful graphs and HITL interventions.
Future of AI in Science: The paper establishes a blueprint for "governed autonomy," suggesting that the future of scientific AI lies not just in better reasoning models, but in architectures that constrain and structure those models within deterministic, verifiable workflows.

In conclusion, Mozi represents a significant step forward in AI for Science (AI4S), providing a practical, robust framework for deploying autonomous agents in the high-stakes, dependency-heavy environment of drug discovery.

Mozi: Governed Autonomy for Drug Discovery LLM Agents