Mimosa Framework: Toward Evolving Multi-Agent Systems for Scientific Research

Imagine you are trying to solve a massive, complex puzzle, like building a skyscraper or curing a disease. In the past, scientists (and the computers helping them) had to follow a rigid, pre-written instruction manual. If the manual said "Step 1: Mix chemicals," they did it. If Step 1 failed because the chemicals were different this time, the whole project crashed. The computer couldn't say, "Hey, maybe we should try a different order?" or "Let's ask a different expert for help."

Mimosa is a new, open-source framework that changes the game. Instead of following a rigid manual, Mimosa is like a self-improving, adaptive project manager that builds its own team and rewrites its own instructions as it goes.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Brittle" Robot

Current AI scientists are like robots with a single, unchangeable script.

The Issue: If you ask a standard AI to do a complex scientific task (like analyzing drug data), it tries to do it all in one long line of thinking. If it gets confused halfway through, it often hallucinates (makes things up) or forgets what it was doing.
The Analogy: Imagine asking a single person to build a house, design the plumbing, bake the cake for the party, and write the speech for the opening ceremony, all while remembering every single detail from the start. They would get overwhelmed, forget the blueprint, and likely fail.

2. The Solution: A Dynamic "Dream Team"

Mimosa doesn't use one robot; it builds a team of specialized agents (little AI workers) that change based on what the job needs.

The Meta-Orchestrator (The Architect): This is the boss. When you give it a task (e.g., "Find a new material for solar panels"), it doesn't just guess. It looks at its library of past projects to see if it has done something similar. If not, it invents a new team structure on the spot.
- Analogy: If you need to fix a car, the Architect doesn't just call a mechanic. It calls a mechanic, an electrician, and a painter, and tells them exactly how to talk to each other. If the car is actually a boat, it instantly swaps the mechanic for a marine engineer.

3. The Secret Sauce: "Trial, Error, and Evolution"

This is the most magical part. Mimosa doesn't just run the plan once and hope for the best. It evolves.

The Loop:
1. Build: The Architect creates a workflow (a plan).
2. Run: The team of agents tries to do the task using real scientific tools (like software for chemistry or biology).
3. Judge: A "Judge" AI watches the whole process. It doesn't just look at the final answer; it watches how they worked. Did they talk to each other well? Did they use the right tools? Did they get the goal?
4. Refine: If the Judge says, "You failed because Agent A didn't pass the data to Agent B correctly," the Architect rewrites the plan. It might fire Agent A, hire a new one, or change how they communicate.
5. Repeat: It tries the new plan. It keeps doing this (up to 10 times) until the plan is perfect.
The Analogy: Think of it like a band practicing for a concert.
- Round 1: They play the song. The drummer is too loud, and the singer forgot the lyrics.
- The Critic: A producer (the Judge) says, "Drummer, play softer. Singer, look at the sheet music."
- Round 2: They play again. The guitar is out of tune.
- The Critic: "Guitarist, tune up. Also, let's swap the order of the chorus."
- Round 3: They play again. It's getting better.
- Round 4: They nail it.
- Mimosa does this automatically, thousands of times faster than humans, until the "song" (the scientific experiment) is perfect.

4. Why This Matters for Science

Science is messy. Experiments fail, data looks weird, and new tools are invented every day.

Old Way: If a new tool appears, the old AI system breaks because it wasn't programmed to use it.
Mimosa Way: Mimosa uses a system called MCP (Model Context Protocol). Think of this as a universal power strip. Any new scientific tool (a new microscope, a new database, a new coding library) can be plugged in. Mimosa's "Architect" sees the new plug and instantly figures out how to use it in the workflow.

5. The Results

The researchers tested Mimosa on ScienceAgentBench, a tough test with 102 different scientific challenges (from biology to psychology).

The Winner: Using a specific AI model (DeepSeek-V3.2), Mimosa solved 43.1% of the tasks.
The Comparison:
- A single AI trying to do it alone? Only 38.2% success.
- A static team (no learning)? 32.4% success.
- Mimosa (Evolving Team): 43.1% success.
The Takeaway: The system proved that by letting the AI team "learn" from its mistakes and reorganize itself, it gets significantly better at solving hard problems.

6. Open Source and Transparency

Unlike some "black box" AI systems where you don't know how they work, Mimosa is open-source.

The Analogy: It's like giving everyone the recipe and the kitchen tools, not just the finished cake.
Auditability: Every step Mimosa takes is recorded. If a scientist uses Mimosa to discover a new drug, they can look at the "logbook" and see exactly how the AI got there. This solves a huge problem in science called the "reproducibility crisis," where other scientists can't repeat the work because the steps were too vague.

Summary

Mimosa is a framework that turns AI from a rigid robot into a flexible, self-correcting scientific partner. It builds its own teams, learns from its failures, adapts to new tools, and keeps a detailed record of everything it does. It's a step toward a future where AI doesn't just follow orders, but actively helps scientists discover new things by constantly improving its own methods.

1. Problem Statement

Current Autonomous Scientific Research (ASR) systems, despite leveraging Large Language Models (LLMs), face two critical limitations that hinder their adaptability in real-world scientific settings:

Architectural Rigidity: Most systems rely on fixed, static workflows and predefined toolsets. They cannot reconfigure agent coordination or tool usage when experimental conditions change, new tools are introduced, or intermediate results suggest alternative analytical paths.
Long-Horizon Execution Failures: Single-agent systems suffer from "semantic drift" (loss of focus, hallucinations, and attention dilution) as context windows grow over extended reasoning trajectories. They often fail to recover from early errors or adapt to non-linear discovery processes.

The core challenge is to create a system that can dynamically adjust its coordination and tool use as scientific tasks evolve, without losing context or becoming brittle.

2. Methodology: The Mimosa Framework

Mimosa is an open-source, evolving multi-agent framework designed to automatically synthesize and iteratively refine task-specific workflows. It is built on five distinct layers:

A. Core Architecture

Planning Layer (Optional): Decomposes high-level scientific goals into granular, repeatable tasks. (In the reported experiments, this layer was bypassed to focus on individual task evolution).
Tool Discovery Layer (Layer 1): Utilizes the Model Context Protocol (MCP) and a companion platform called Toolomics. This allows the system to dynamically scan and enumerate available computational tools (e.g., statistical libraries, simulation software) as discoverable services. Tools are containerized and isolated, ensuring reproducibility and security.
Meta-Orchestration Layer (Layer 2): The brain of the system. It synthesizes multi-agent workflows and iteratively refines them via single-incumbent local search.
- Initialization: For a new task, it queries a workflow archive for semantically similar past tasks (using embedding similarity). If a match is found, it retrieves and mutates that workflow; otherwise, it synthesizes a workflow de novo.
- Evolutionary Loop: The orchestrator proposes structural mutations (adding/removing agents, rewiring edges, refining prompts) based on feedback from a "Judge."
Agent Execution Layer (Layer 3): Executes the workflow using SmolAgent (Hugging Face's code-generating agents).
- Agents write and execute Python code directly to invoke tools, rather than relying on rigid JSON schemas. This allows for complex operations like looping over parameter grids, preprocessing data, and chaining library calls (e.g., RDKit, BioPython) dynamically.
- Agents operate within sandboxed containers to ensure isolation.
Judge Layer (Layer 4): An LLM-based judge evaluates the execution trace against four criteria:
- Goal Alignment
- Agent Collaboration Efficiency
- Output Quality
- Answer Plausibility
- Note: The judge provides a directional signal for optimization but does not determine the final benchmark success rate (which is calculated externally).

B. Workflow Evolution Mechanism

The system treats workflow design as a search problem. It starts with an initial workflow ( $W_0$ ) and performs iterative refinement:

Mutate: The meta-orchestrator generates a neighbor workflow ( $W'$ ) by applying a single structural edit (e.g., changing an agent's prompt or rewiring data flow).
Execute & Score: $W'$ is executed, and the Judge assigns a score.
Select: If $W'$ scores higher than the current incumbent ( $W_n$ ), it becomes the new incumbent.
Repeat: This continues for a fixed number of iterations or until a score threshold (0.9) is reached.

3. Key Contributions

Dynamic Workflow Synthesis: Unlike static pipelines, Mimosa generates and evolves DAG-structured workflows where agent roles, communication edges, and tool allocations are mutable.
Tool-Agnostic Integration via MCP: By leveraging MCP and containerized servers, Mimosa can integrate heterogeneous tools (from local scripts to remote HPC resources) without modifying its core logic.
Code-as-Action Execution: Agents generate executable Python code to interact with tools, enabling more sophisticated data manipulation and tool chaining than schema-constrained JSON approaches.
Auditability and Reproducibility: The framework logs every execution trace and archives workflows, preserving the full analytical history for inspection and replication.
Open-Source Foundation: Both Mimosa and Toolomics are released under the Apache 2.0 license to foster community-driven ASR.

4. Experimental Results

The framework was evaluated on ScienceAgentBench, a benchmark comprising 102 data-driven discovery tasks across four disciplines (bioinformatics, chemistry, geography, psychology).

Performance Metrics:
- DeepSeek-V3.2 (Iterative-Learning): Achieved a 43.1% Success Rate (SR), surpassing both single-agent baselines (38.2%) and static multi-agent configurations (32.4%). It also achieved the highest CodeBERTScore (0.921).
- GPT-4o: Showed significant improvement from single-agent (3.8%) to iterative-learning (21.6%), demonstrating a 4x gain.
- Claude Haiku 4.5: Improved from single-agent (7.8%) to one-shot multi-agent (31.3%) but saw a slight degradation with iterative learning (30.3%), indicating that evolutionary benefits are model-dependent.
Evolutionary Trends:
- Performance gains were consistent across iterations 1–8, with diminishing returns and a slight decline at iteration 10, suggesting a performance ceiling for the current single-incumbent search strategy.
- Cost Efficiency: DeepSeek-V3.2 achieved the best balance, reaching 43.1% SR at ~$1.7 per task, significantly cheaper than using frontier reasoning models like OpenAI's o1-preview (which reportedly costs >10x more for similar performance).

5. Significance and Implications

Paradigm Shift: Mimosa moves ASR from "brittle, expert-designed pipelines" to "adaptive multi-agent frameworks" that learn from experience.
Model-Architecture Interaction: The results reveal that the benefits of multi-agent decomposition and iterative learning are not universal; they depend heavily on the underlying model's instruction-following robustness. Optimal ASR systems must be tailored to specific model capabilities.
Resource Efficiency: The framework demonstrates that high-performance scientific automation can be achieved using cost-effective, non-reasoning models (like DeepSeek-V3.2) when combined with effective workflow evolution, challenging the notion that only the most expensive models can solve complex scientific tasks.
Reproducibility: By enforcing containerized execution and full trace logging, Mimosa addresses the reproducibility crisis in scientific computing, making automated research steps auditable and verifiable.

Conclusion

Mimosa represents a significant step toward truly autonomous scientific research by combining dynamic tool discovery, code-generating agents, and iterative workflow evolution. It successfully demonstrates that evolving multi-agent topologies can outperform static configurations, provided the system is paired with a capable execution model and a robust evaluation loop. Future work will focus on open-ended exploration strategies, cross-task generalization, and validating the framework's ability to replicate published scientific studies.