From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI

The Big Problem: The "Fluent but Wrong" AI

Imagine you hire a brilliant, fast-talking consultant (a standard Large Language Model or LLM) to manage your company's finances. You ask, "Can we approve this $50,000 expense?"

The consultant answers instantly: "Yes, absolutely! Here is the approval." They sound confident, and the grammar is perfect. But here's the catch: They didn't check the rules. They didn't look at the current budget, they didn't check if the manager has the authority to sign off on that amount, and they didn't see that the company is currently in a "frozen spending" mode due to a merger.

They just guessed based on general knowledge. In the real world, this is dangerous. If you fire them, you can't prove why they said yes, because they didn't actually follow a process. They just "felt" it was right.

This is what the paper calls "Illusive Accuracy." The AI looks smart (high accuracy), but it's actually hallucinating a decision because it skipped the necessary steps to check the specific rules of the moment.

The Solution: LOM-action (The "Simulation Sandbox")

The authors propose a new system called LOM-action. Instead of letting the AI guess, they force it to play a "what-if" game before it makes a decision.

Think of it like a Flight Simulator for business decisions.

The Real World (The Enterprise Ontology): This is your company's actual rulebook, database, and org chart. It's huge and complex.
The Event (The Trigger): A business event happens (e.g., "A manager submits an expense report").
The Sandbox (The Simulation): Before the AI says "Yes" or "No," it creates a copy of the company's rulebook in a safe, isolated room (a sandbox).
- It applies the specific rules for this event (e.g., "Oh, this manager is in the Marketing department, so they have a $5k limit," or "The company is in a freeze, so no new spending").
- It cuts out the parts of the rulebook that don't apply and adds the new constraints.
- Crucially: It does this without touching the real company database. It's just a simulation.

How It Works: The Three-Step Dance

The paper describes a strict three-step process that the AI must follow, like a pilot going through a pre-flight checklist:

Phase 1: The Translator (Scenario Parsing)
The AI reads the messy human request and translates it into strict business rules.
- Analogy: You tell the pilot, "I want to fly to London." The pilot translates that into: "Check wind speed, check fuel levels, check runway 22 availability."
Phase 2: The Simulator (Sandbox Simulation)
The AI goes into the "Sandbox." It takes a copy of the company's data and physically removes or changes the parts that don't fit the current situation.
- Analogy: The pilot runs the flight simulator. The computer simulates the wind, the fuel burn, and the runway conditions. It creates a specific "flight path" that is valid only for this specific trip.
- The Magic: If the simulation shows the path is blocked (e.g., "No valid path exists because the budget is frozen"), the AI stops. It doesn't guess. It reports the blockage.
Phase 3: The Decision (Derivation)
The AI looks only at the result of the simulation.
- Analogy: The pilot looks at the simulator's output. If the simulator says "Go," the pilot says "Go." If the simulator says "Crash," the pilot says "Cancel."
- The Audit Trail: Because the AI followed the simulation steps, we have a perfect record (a receipt) of exactly why the decision was made. "We said no because the simulation showed the budget was frozen."

The "Dual-Mode" Brain

The system has two ways of thinking, like a human having a "reflex" and a "thoughtful" mode:

Skill Mode (The Reflex): If the AI has seen this type of problem before, it uses a pre-approved "tool" (like a calculator or a database query) to get the answer instantly. It's fast and safe.
Reasoning Mode (The Thoughtful): If the problem is new and complex, the AI pauses, loads the simulated data into its "working memory," and thinks through the logic step-by-step.

Why This Matters: The "Illusive Accuracy" Trap

The paper tested this against top-tier AI models (like Doubao and DeepSeek).

The Top Models: They got the final answer right 80% of the time. But when you checked how they got there, they skipped the simulation steps. They just guessed. Their "Tool-Chain F1" (a score for following the process) was terrible (around 24-36%).
LOM-action: It got the answer right 94% of the time, and it followed the process perfectly (98% score).

The Lesson: In a business, being "right by accident" is a liability. If you get sued, you can't say, "The AI guessed right." You need to say, "The AI followed the rules, ran the simulation, and the simulation said yes."

Summary Metaphor: The Traffic Light

Standard AI: A driver who sees a red light but thinks, "I'm a good driver, I'll just speed through it because I feel like it." They might make it across safely (Accurate), but they broke the law and have no record of why they thought it was safe.
LOM-action: A driver who stops, checks the traffic camera feed (Simulation), sees the light is red, checks the police report (Audit Trail), and waits. If the light turns green, they go. If it stays red, they wait. They have a perfect log of every second they waited.

In short: This paper argues that for AI to be trusted in business, it shouldn't just be a smart talker; it must be a disciplined simulator that proves its work before making a single decision.

1. Problem Statement

Current Large Language Model (LLM) agent systems suffer from a critical architectural failure when applied to enterprise environments: they generate decisions based on an unrestricted knowledge space rather than simulating how active business scenarios reshape that space.

The "Illusive Accuracy" Phenomenon: Frontier models often produce fluent, high-accuracy answers by relying on parametric knowledge or static retrieval. However, they fail to derive these answers from the specific, scenario-evolved state of the enterprise ontology (e.g., current carrier contracts, active spending policies, user scope).
Lack of Auditability: Because these models do not simulate the scenario conditions before deciding, their outputs lack a traceable audit trail. They answer "what does the static graph say?" rather than "what does the graph say after this event reshapes it?"
Context Limitations: Traditional approaches treat context as a capacity problem (fitting more text), whereas enterprise AI requires semantic precision. Accumulating raw conversation history leads to semantic noise, where irrelevant prior turns compete for attention budget.

2. Methodology: LOM-action

The authors propose LOM-action, a system that equips enterprise AI with event-driven ontology simulation. The core philosophy is the "Simulation-First Principle": decisions must be derived exclusively from a graph state that has been mutated by active business conditions before any reasoning or tool invocation occurs.

Core Pipeline: Event $\rightarrow$ Simulation $\rightarrow$ Decision

The system operates in three strictly ordered phases:

Phase 1: Scenario Parsing:
- The system receives a structured business event payload.
- It aligns the event to the Enterprise Ontology (EO) to identify active scenario conditions ( $\mathcal{R}$ ).
- Complex natural language scenarios are parsed into a deterministic sequence of sandbox operations (e.g., "match nodes," "delete edges," "update weights").
Phase 2: Sandbox Simulation (The Critical Step):
- A working copy of the enterprise ontology is instantiated in an isolated, session-scoped sandbox (backed by a Neo4j-compatible engine).
- The model applies the scenario conditions ( $\mathcal{R}$ $R$ ) to mutate this copy.
  - Constraint Conditions: Remove nodes/edges violating access policies.
  - Augmentation Conditions: Add new nodes/edges (e.g., new org units).
- This produces the Simulation-Valid Graph ( $G_{sim}$ ).
- Key Guarantee: The persistent authoritative EO is never touched; only the sandbox copy is mutated. Every mutation is logged for audit.
Phase 3: Decision Derivation:
- The model executes decision tools (e.g., shortest_path, max_flow) exclusively against the evolved $G_{sim}$ in the sandbox.
- The result, combined with the simulation trace, forms a fully auditable Decision Trace.

Dual-Mode Execution Architecture

LOM-action employs a dual-mode strategy to manage context and reasoning:

Skill Mode (Default): If a registered skill (API) exists for the task, the model calls the tool against the sandbox. The raw graph structure never enters the LLM's context window. The model reasons over the interface (node names, stats) via tool logs.
Reasoning Mode (Fallback): For novel computations not covered by registered skills, the model loads a fused, attribute-pruned version of $G_{sim}$ into the context. This ensures the context footprint is bounded by semantic scope, not conversation length.

Ontology Harness Engineering

The paper introduces the concept of the Ontology Harness, where the Enterprise Ontology acts as the "engine" (authority) and the LOM system acts as the "harness" (execution environment).

Human-in-the-Loop (HITL): A critical gate at the intent boundary. If the system is unsure about entity alignment (e.g., which organizational scope applies), it pauses to clarify with the user before entering the sandbox simulation. This prevents unauthorized simulations.
Four Production Principles:
1. Minimize business logic in code; maximize it in the ontology.
2. All context must be ontology-grounded; nothing bypasses the ontology.
3. Prefer LOM (fine-tuned) over frontier LLMs unless the task exceeds LOM's capability.
4. Expose the ontology schema and graph query logic for human inspection.

3. Key Contributions

Scenario Simulation Innovation: The introduction of a mandatory sandbox simulation step where EO-authorized constraints drive deterministic graph mutations before any decision is made. This closes the "simulation gap" left by standard LLMs.
Decision Derivation Innovation: A strictly sequential pipeline (Event $\rightarrow$ Simulation $\rightarrow$ Decision) realized via a dual-mode architecture, ensuring every decision produces a fully traceable, replayable audit log.
The Illusive Accuracy Index ($IA$): A new metric defined as $IA(M) = Acc(M) - F1_{chain}(M)$ . It quantifies the gap between answer accuracy and the correctness of the reasoning chain. High accuracy with low tool-chain F1 indicates "illusive accuracy" (correct answers derived via the wrong process).
Architectural Inversion: Shifting the paradigm from "LLM as the authority" to "Ontology as the authority," where the LLM is merely a component of an execution harness that channels ontological rules into natural language.

4. Experimental Results

The system was evaluated on a benchmark of 11 tasks (2,200 training samples, 1,100 test samples) using a 19-function graph API suite. It was compared against zero-shot frontier baselines (Doubao-1.8 and DeepSeek-V3.2).

Accuracy vs. Tool-Chain F1:
- LOM-action: Achieved 93.82% Accuracy and 98.74% Tool-Chain F1.
- Baselines: Achieved ~80% Accuracy but only 24.42% (Doubao) and 36.21% (DeepSeek) Tool-Chain F1.
The Illusive Accuracy Phenomenon:
- On basic traversal tasks, baselines achieved near-perfect accuracy but F1 = 0.00. Manual inspection revealed they answered correctly using parametric knowledge without ever calling the sandbox tools, bypassing the simulation entirely.
- On scenario-simulation tasks (e.g., fc_constraint_connection), LOM-action achieved 100% accuracy, while baselines dropped to ~64-66% because they failed to simulate the constraints before querying the graph.
Significance: The four-fold F1 advantage demonstrates that ontology-governed simulation is the architectural prerequisite for trustworthy enterprise AI, not model scale.

5. Significance and Implications

Trustworthiness & Compliance: LOM-action solves the "black box" problem in enterprise AI. By enforcing a simulation-first approach, every decision is mathematically derived from a specific, auditable state of the business graph, satisfying strict regulatory requirements.
Redefining Evaluation: The paper argues that standard accuracy metrics are insufficient for enterprise AI. The Tool-Chain F1 and Illusive Accuracy Index are proposed as the new standards for evaluating agents that must operate within constrained, evolving business contexts.
Scalability: By decoupling the reasoning substrate (sandbox) from the context window, the system avoids the "infinite context" problem, maintaining semantic precision regardless of conversation length.
Future Path: The work outlines a roadmap toward SKILLS-standard integration, moving from natural language scenario descriptions to formal, machine-readable ontological schemas, enabling fully automated, production-grade enterprise AI.

In conclusion, LOM-action proves that for enterprise AI to be reliable, it must stop treating the knowledge graph as a static retrieval source and start treating it as a dynamic, simulatable substrate that evolves with every business event.

From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI

The Big Problem: The "Fluent but Wrong" AI

The Solution: LOM-action (The "Simulation Sandbox")

How It Works: The Three-Step Dance

The "Dual-Mode" Brain

Why This Matters: The "Illusive Accuracy" Trap

Summary Metaphor: The Traffic Light

1. Problem Statement

2. Methodology: LOM-action

Core Pipeline: Event →\rightarrow→ Simulation →\rightarrow→ Decision

Dual-Mode Execution Architecture

Ontology Harness Engineering

3. Key Contributions

4. Experimental Results

5. Significance and Implications

More like this

OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains

Sustained Impact of Agentic Personalisation in Marketing: A Longitudinal Case Study

RAMP: Hybrid DRL for Online Learning of Numeric Action Models

Parameterized Complexity Of Representing Models Of MSO Formulas

Model Space Reasoning as Search in Feedback Space for Planning Domain Generation

Core Pipeline: Event $\rightarrow$ Simulation $\rightarrow$ Decision