Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism

Imagine you are the captain of a spaceship, and you need to plan a complex journey through an asteroid field. Before you actually fly, you need a World Model: a simulation that lets you test your route, see what happens if you hit an asteroid, and figure out the best path without crashing your real ship.

Currently, scientists have two main ways to build these simulators, and both have big problems:

The "Hand-Crafted" Simulator: This is like a detailed, physical model made by expert engineers. It's incredibly accurate and reliable. But if you want to change the asteroid field or add a new type of ship, you have to call in a whole team of engineers to rebuild the model from scratch. It's too slow and expensive to change on the fly.
The "Black Box" AI: This is like a magic crystal ball. You ask it, "What happens if I turn left?" and it guesses. It's very flexible and fast to ask questions. But it's unreliable. If you ask it to predict a long journey, it might start hallucinating (making things up), forget the rules of physics, or give you a different answer if you ask the same question twice. It's hard to trust because you can't see why it made its prediction.

The Problem: We need a simulator that is as reliable as the hand-crafted one but as flexible as the AI one. We need something that can be built quickly from a simple description, but still follows strict rules so we know it's telling the truth.

The Solution: The "LEGO" Approach (DEVS)

The authors of this paper propose a middle ground. They use a formal system called DEVS (Discrete Event System Specification).

Think of DEVS like LEGO bricks.

Instead of building a giant, solid statue (the hand-crafted way) or guessing the shape (the AI way), you build your world out of small, standardized LEGO pieces.
Each piece (a robot, a traffic light, a bank account) has a specific job and a specific way it connects to others.
When you want to change the world, you don't rebuild the whole thing. You just swap out a few bricks or rearrange the connections.

How They Did It: The "Architect and the Builders"

The paper introduces a new pipeline where an AI (a Large Language Model) acts as a construction crew, but with a very strict manager to keep things organized.

The Architect (Structural Synthesis): First, the AI reads your natural language description (e.g., "I need a warehouse with 5 robots and 2 chargers"). Instead of trying to write the whole code at once, the AI acts as an Architect. It draws a blueprint: "Here is the list of bricks we need, and here is how they connect." It creates a strict contract for every single piece.
The Builders (Behavioral Synthesis): Once the blueprint is ready, the AI acts as a team of specialized builders. Because the blueprint is so clear, each builder only has to focus on one tiny brick (e.g., "Just make the robot move when it gets a signal"). They don't have to worry about the whole building; they just follow the contract.
The Inspector (Trace-Based Evaluation): How do we know the simulation is right? We don't just check the code. We watch the event trace. Imagine the simulation is a movie. The "Inspector" watches the movie and checks if the events happen in the right order.
- Did the robot arrive before it started charging?
- Did the battery drain before the charger arrived?
  If the movie breaks the rules, the Inspector points exactly to which "brick" failed, so we can fix just that piece.

Why This Matters

This approach is a game-changer for a few reasons:

It's Fast and Cheap: Because the AI breaks the big problem into tiny, parallel tasks, it can build complex simulations in a fraction of the time and cost of traditional methods.
It's Trustworthy: Because the simulation is built from "LEGO bricks" with strict rules, it doesn't hallucinate. If the robot says it charged, it actually charged. We can verify the truth by watching the event log.
It's Adaptable: If you need to change the simulation while it's running (e.g., "Add a third robot!"), you can just swap in a new brick without crashing the whole system.

The Real-World Analogy: The Restaurant Kitchen

Imagine a busy restaurant kitchen.

Old Way (Hand-Crafted): The head chef writes a 500-page manual for every dish. If you want to add a new ingredient, you have to rewrite the whole manual.
Bad AI Way: You ask a random person, "What happens if we cook this steak?" They guess, "Maybe it burns?" or "Maybe it turns into a cake?" You can't trust the answer.
This Paper's Way: You have a kitchen with specialized stations (Grill, Salad, Dessert). Each station has a strict recipe card (the DEVS model).
- The Architect writes the recipe cards based on your order.
- The Chefs (AI builders) follow their specific cards.
- The Manager (Inspector) watches the tickets coming out of the kitchen. If a steak comes out raw, the Manager knows exactly which station failed and why, without having to fire the whole kitchen.

In short: This paper teaches us how to use AI to build reliable, rule-following simulations that can be changed on the fly, turning the chaotic "black box" of AI into a structured, trustworthy, and adaptable tool for planning our future.

1. Problem Statement

The paper addresses the limitations of current World Models used in agentic systems for planning and evaluation. Existing approaches fall into two extremes:

Hand-engineered Simulators: Highly consistent and reproducible but costly to adapt and difficult to modify during online execution.
Implicit Neural Models (e.g., LLMs): Flexible and adaptable but suffer from unreliability over long horizons, lack of verifiability, difficulty in debugging, and unbounded drift due to latent assumptions about timing and causality.

There is a lack of a "middle ground" that offers the reliability and verifiability of explicit simulators with the flexibility to be synthesized and adapted on-demand from natural language specifications. This is particularly critical for environments governed by discrete events (e.g., queueing, service operations, network protocols, multi-agent coordination).

2. Methodology

The authors propose a framework that synthesizes executable Discrete-Event System Specification (DEVS) models directly from natural-language specifications. The approach consists of three core components:

A. Formal Representation: DEVS

The world model is formalized using the Parallel DEVS formalism.

Structure: Systems are decomposed into Atomic Models (encapsulating local state, transition logic, and timing) and Coupled Models (defining how components interact and route events).
Semantics: Dynamics are governed by explicit state transitions triggered by discrete events and time advances, ensuring structured evolution and preventing the drift common in implicit models.
Interface: The simulator acts as a black box with a standardized input (configuration arguments $I$ and optional exogenous input stream $J$ ) and output (a structured event trace in JSONL format).

B. Staged Generation Pipeline (DEVS-Gen)

To handle the complexity of synthesizing a monolithic simulator, the authors introduce a staged, modular generation pipeline driven by Large Language Models (LLMs):

Structural Synthesis:
- An LLM analyzes the natural language specification to infer the system hierarchy, component types (atomic vs. coupled), and interaction graphs.
- It produces a PlanTree, a structured artifact defining component interfaces, port schemas, and coupling rules.
- This stage acts as a "contract" to constrain subsequent generation.
Behavioral Synthesis:
- Parallel Atomic Generation: Atomic components are synthesized independently based on the PlanTree contracts. This allows for parallelization and reduces the context burden on the LLM.
- Adaptive Coupling: A "Summarizer" agent analyzes the generated code of sibling components to extract "ground-truth" interfaces. The parent coupled model is then generated based on these actual interfaces rather than the initial plan, preventing integration failures due to minor semantic drifts.
Output: The pipeline produces an executable Python simulator (using the xdevs toolkit) that emits structured event traces.

C. Trace-Based, Specification-Driven Evaluation

Since there is no unique "ground truth" code for a natural language spec, the authors propose evaluating models based on behavioral conformance:

Operational Success: Checks if the simulator compiles, runs without crashing, and adheres to the I/O contract.
Behavioral Conformance: Validates the emitted event traces against a set of rules derived from the specification.
- Component-level: Checks internal state transitions and timing logic (e.g., correct queue waiting times).
- System-level: Checks causal and temporal relationships across components (e.g., ensuring events do not occur before their causes).
Diagnostics: If a violation occurs, the framework provides localized diagnostics identifying the specific constraint, responsible entity, and state variable, enabling systematic refinement.

3. Key Contributions

DEVS-Gen Framework: A novel pipeline for synthesizing executable, discrete-event world models from natural language using the DEVS formalism, bridging the gap between rigid simulators and flexible neural models.
Modular Generation Strategy: A decomposition strategy that separates structural planning from behavioral implementation, enabling parallel generation and significantly improving stability and scalability compared to monolithic code generation.
Trace-Based Evaluation Benchmark: A rigorous evaluation methodology that validates simulators against specification-derived temporal and semantic constraints rather than code equivalence, accompanied by a curated dataset of 7 diverse scenarios (banking, transport, epidemiology, networking, logistics).
Adaptive Assembly: A mechanism to resolve interface mismatches between planned structure and generated code, ensuring robust integration of components.

4. Experimental Results

The authors evaluated DEVS-Gen against state-of-the-art iterative software engineering agents (OpenHands, SWE-Agent) across 7 benchmark scenarios using various LLM backbones (Large and Small models).

Effectiveness:
- DEVS-Gen achieved competitive Behavioral Conformance Scores (BCS) and Operational Success Scores (OSS) compared to full iterative agents, despite lacking the ability to execute code and self-correct during generation.
- Crucially, DEVS-Gen significantly outperformed "Lite" (non-iterative) baselines, demonstrating that the structured DEVS decomposition acts as a "correct-by-construction" guide, reducing reliance on trial-and-error loops.
Efficiency:
- DEVS-Gen reduced token consumption by approximately 0.8 orders of magnitude (approx. 6x reduction) compared to iterative agents.
- It achieved faster wall-clock times, particularly for smaller models, by avoiding the "doom loops" of failed debugging common in iterative agents.
Scalability:
- The modular design allows for parallel synthesis of atomic components. While the planning phase showed modest speedups due to current benchmark scales, the generation phase achieved a ~4.7x speedup, validating the approach's potential for large-scale systems where complexity grows logarithmically ( $O(\log N)$ ) rather than linearly.

5. Significance

This work establishes a principled foundation for specification-driven world modeling. By leveraging the DEVS formalism, it enables the creation of world models that are:

Verifiable: Behavior can be audited against explicit constraints via event traces.
Adaptable: Models can be synthesized and modified on-demand during online execution.
Scalable: The modular generation pipeline supports complex, large-scale systems efficiently.

The framework moves beyond purely rule-based systems by allowing LLMs to act as event-generating or decision-making entities within DEVS components, paving the way for hybrid simulations in domains like social dynamics, organizational behavior, and coordinated multi-agent task solving.

Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism

The Solution: The "LEGO" Approach (DEVS)

How They Did It: The "Architect and the Builders"

Why This Matters

The Real-World Analogy: The Restaurant Kitchen

1. Problem Statement

2. Methodology

A. Formal Representation: DEVS

B. Staged Generation Pipeline (DEVS-Gen)

C. Trace-Based, Specification-Driven Evaluation

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks