MOSAIC: A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent RL, LLM, VLM, and Human Decision-Makers

Imagine you are the director of a massive, chaotic movie set. On one side, you have Robots (Reinforcement Learning agents) that are incredibly fast, calculate millions of moves per second, and speak only in numbers and code. On the other side, you have Genius Writers (Large Language Models) who are brilliant at strategy and storytelling but speak only in sentences and paragraphs. Then, you have Human Actors who react emotionally and unpredictably, and Visionary Artists (Vision-Language Models) who see the world as a mix of pictures and words.

The problem? Until now, these groups couldn't work together in the same scene. The robots didn't understand the writers' sentences, the writers couldn't process the robots' numbers, and the humans had no way to interact with either of them on the same stage. They were like different species living in isolated islands, never able to compare who was actually the best at the game.

Enter MOSAIC.

Think of MOSAIC as the Universal Translator and Stage Manager for this movie set. It's a new open-source platform that finally lets these different "species" of decision-makers play the same game, side-by-side, under the exact same rules.

Here is how it works, broken down into simple concepts:

1. The "Glass Wall" Protocol (The Workers)

Imagine each agent (the robot, the writer, the human) is in their own soundproof booth. They have their own favorite tools and languages.

The Problem: If you try to force a writer to speak code, they break. If you force a robot to write a poem, it crashes.
The MOSAIC Solution: MOSAIC builds a "glass wall" (an IPC protocol) around each booth. The agents stay in their own booths, using their own tools exactly as they were designed. MOSAIC just acts as the messenger, translating the game state into the language the agent understands and translating their answer back into a format the game understands.
The Analogy: It's like a diplomatic summit where every country keeps its own language and customs, but a team of expert interpreters ensures everyone understands the agenda perfectly without anyone having to change their culture.

2. The "Universal Remote" (The Operator)

In the past, if you wanted to test a robot against a human, you had to build a custom controller for every single combination.

The MOSAIC Solution: MOSAIC introduces a "Universal Remote" called an Operator. Whether you are controlling a super-fast robot, a slow-thinking AI writer, or a human pressing a keyboard, MOSAIC treats them all as just "Agent #1," "Agent #2," etc.
The Analogy: It's like a video game console that has a single controller port. You can plug in a standard controller, a specialized racing wheel, or even a dance pad. The console doesn't care what you plug in; it just knows how to read the signals coming from the port.

3. The "Fair Play" Arena (Cross-Paradigm Evaluation)

This is the most important part. MOSAIC allows researchers to run a "fair fight."

The Scenario: Imagine a soccer game.
- Team A: Two super-fast robots trained to play soccer.
- Team B: Two AI writers who have never played soccer but are reading the rules and trying to figure it out.
- Team C: Two humans.
The MOSAIC Magic: MOSAIC runs all these teams on the exact same field, with the exact same weather, and the exact same starting ball position (using "shared seeds").
The Result: You can finally see the truth. Does the robot win because it's faster? Or does the AI writer win because it understands the strategy better? Or does the human win because they are creative? Before MOSAIC, you couldn't compare them fairly because they were playing on different fields.

4. The "Director's View" (Visual Interface)

MOSAIC isn't just code; it has a visual dashboard.

The Feature: You can watch the game in real-time. You see the robot's view (a grid of numbers), the AI writer's view (text descriptions), and the human's view (the actual game graphics) all on one screen.
The Analogy: It's like a sports broadcast where you can switch between the camera angles of the players, the coach, and the referee simultaneously to see exactly what each of them is thinking and doing at the same moment.

Why Does This Matter?

For years, scientists have been studying robots, AI writers, and humans separately. They've been asking, "Who is the best?" but they were comparing apples, oranges, and elephants.

MOSAIC puts them all in the same fruit bowl. It allows us to:

Test Teamwork: Can a robot and a human be a great team? Can an AI writer help a robot solve a puzzle it's too rigid to figure out?
Find Weaknesses: Maybe AI writers are great at strategy but terrible at reacting quickly. Maybe robots are fast but can't adapt to new rules.
Build Better AI: By seeing how humans and different types of AI interact, we can build future systems that combine the speed of robots with the creativity of humans and the reasoning of AI writers.

In short, MOSAIC is the great equalizer. It stops us from comparing things that can't be compared and starts a new era where we can see how different types of intelligence truly work together.

1. Problem Statement

Despite the rapid maturation of Reinforcement Learning (RL), Large Language Models (LLMs), and Vision-Language Models (VLMs), these paradigms have largely evolved in isolation. Existing infrastructure faces three critical limitations:

Fragmented Agent Interfaces: RL agents expect tensor observations and integer actions; LLM/VLM agents expect text prompts and generate text responses; human operators require interactive GUIs. No single platform unifies these distinct input/output modalities.
Lack of Fair Comparison: Current benchmarks (e.g., AgentBench, BALROG) or RL frameworks (e.g., RLlib, CleanRL) do not allow agents from different paradigms to operate within the same environment instance under identical conditions (e.g., shared random seeds). This makes it impossible to isolate the effect of the decision-making paradigm from environmental variance.
Inadequate Ad-Hoc Teamwork (AHT) Support: While AHT research studies cooperation with unknown teammates, prior work assumes all agents share the same observation and action representations. Real-world scenarios increasingly involve heterogeneous teams (e.g., an RL bot cooperating with an LLM or a human), which current tools cannot simulate.

2. Methodology: The MOSAIC Architecture

MOSAIC is an open-source platform designed to bridge these gaps using a three-tier architecture that separates orchestration, communication, and execution.

A. Three-Tier Architecture

Orchestration Layer (Qt6 GUI): Acts as the authoritative control plane. It spawns and supervises worker processes, manages the event loop, and provides a real-time visual interface. It handles command routing (reset, step, train) and aggregates telemetry but contains no algorithmic logic.
Communication Layer (IPC Protocol): Uses a lightweight, versioned JSON protocol over stdin/stdout (or gRPC for batch telemetry) to facilitate Inter-Process Communication (IPC). This ensures that agents run as isolated subprocesses, preventing memory leaks or library conflicts from crashing the entire system.
Execution Layer (Worker Subprocesses): Each agent (RL, LLM, VLM, Human) runs in its own isolated process. This allows the integration of diverse third-party frameworks (e.g., CleanRL, XuanCe, RLlib, BALROG) without modifying their source code.

B. Key Technical Components

Worker Protocol: Workers communicate via structured commands (e.g., {"cmd": "reset", "seed": 42}). They return typed responses including environment metadata, actions, rewards, and termination states. A "Telemetry Proxy" sidecar process parses JSONL logs, validates them against schemas, and forwards them to the daemon via gRPC for reproducible logging.
Operator Abstraction: This is the core unifying interface. An Operator maps a worker to an agent slot in an environment.
- Unified Interface: Regardless of whether the backend is an RL policy, an LLM, or a human, the OperatorController exposes a standard method: select_action(observation) or select_actions(observations).
- Paradigm Handling:
  - RL: Wraps frameworks like CleanRL or XuanCe.
  - LLM/VLM: Routes to specific workers (e.g., BALROG for single-agent, native MOSAIC LLM for multi-agent). It handles the serialization of tensor observations to text (for LLMs) or text+images (for VLMs) and parses text outputs back to discrete actions.
  - Human: Connects keyboard/mouse input via a dedicated Human Worker.
Evaluation Modes:
- Manual Mode: Advances $N$ operators in lock-step under shared seeds. The GUI renders side-by-side viewports with color-coded badges (e.g., Purple for RL, Blue for LLM) for fine-grained behavioral inspection.
- Script Mode: Drives automated, long-running evaluations via declarative Python scripts, producing JSONL telemetry for systematic analysis.

3. Key Contributions

Unified Cross-Paradigm Infrastructure: MOSAIC is the first system to support RL, LLM, VLM, and Human agents simultaneously in the same environment. It enables "Ad-Hoc Teamwork" where teammates operate through fundamentally different decision-making mechanisms ( $\pi_{RL}, \lambda_{LLM}, \psi_{VLM}, h_{Human}$ ).
Zero-Modification Integration: The platform wraps existing frameworks as isolated subprocess workers. Integrating a new framework (e.g., CleanRL) requires only $\sim$ 50–120 lines of "glue code" with zero modifications to the original library source.
Deterministic Evaluation Framework: By enforcing shared random seeds and a unified execution loop, MOSAIC allows for fair, reproducible comparisons between paradigms. It introduces a specific experimental design where agents are trained solo ( $N=1$ ) and frozen, then deployed in heterogeneous teams to isolate the "paradigm variable" from "partner distribution" variables.
Open-Source Ecosystem: The platform supports 26 environment families (including MiniGrid, MultiGrid, and Chess), provides a comprehensive GUI, and includes extensive documentation and testing suites (28+ test files).

4. Results and Experimental Design

While the paper primarily presents the platform architecture and design, it outlines a rigorous experimental framework for future evaluation (detailed in the Appendices):

Adversarial Matchups: Configurations to compare homogeneous teams (RL vs. RL) against cross-paradigm teams (RL vs. LLM, RL vs. VLM, LLM vs. VLM) to determine performance ceilings.
Cooperative Heterogeneity: Experiments to test if LLMs/VLMs can effectively cooperate with frozen, solo-trained RL agents (Zero-Shot Coordination across paradigms).
Baseline Comparisons: The design explicitly distinguishes itself from standard Zero-Shot Coordination (ZSC) by allowing heterogeneous observation spaces ( $O_{RL} \neq O_{LLM}$ ) and heterogeneous action parsing, whereas ZSC assumes identical interfaces.
Scope Limitation: The paper notes that LLM/VLM agents are currently scoped to discrete grid-world environments (e.g., MiniGrid) rather than continuous control (e.g., MuJoCo), citing literature that suggests LLMs struggle with the low-level spatial reasoning and reaction times required for continuous domains.

5. Significance

MOSAIC represents a paradigm shift in multi-agent research:

Bridging Communities: It creates a common ground for the RL, NLP (LLM/VLM), and Human-Computer Interaction communities to collaborate and compare agents fairly.
Scientific Rigor: By solving the "apples-to-oranges" comparison problem, it enables the first systematic study of how different decision-making paradigms (symbolic reasoning vs. neural policy vs. human intuition) interact in shared environments.
Future of AI Teams: As real-world AI systems will inevitably be heterogeneous (mixing specialized bots with generalist LLMs and humans), MOSAIC provides the necessary infrastructure to study, debug, and optimize these complex, mixed-paradigm teams before they are deployed in the real world.

Availability:

Source Code: https://github.com/Abdulhamid97Mousa/MOSAIC
Documentation: https://mosaic-platform.readthedocs.io
License: MIT

MOSAIC: A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent RL, LLM, VLM, and Human Decision-Makers

1. The "Glass Wall" Protocol (The Workers)

2. The "Universal Remote" (The Operator)

3. The "Fair Play" Arena (Cross-Paradigm Evaluation)

4. The "Director's View" (Visual Interface)

Why Does This Matter?

1. Problem Statement

2. Methodology: The MOSAIC Architecture

A. Three-Tier Architecture

B. Key Technical Components

3. Key Contributions

4. Results and Experimental Design

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank