KAIJU: An Executive Kernel for Intent-Gated Execution of LLM Agents

Imagine you are the CEO of a high-stakes research firm. You have a brilliant, fast-talking intern (the LLM) who is incredibly smart but prone to daydreaming, getting overwhelmed by too much paperwork, and sometimes accidentally trying to blow up the office because they misunderstood a command.

Your current way of working (called ReAct) is like this: You give the intern a task. They think, do one thing, write a report, show it to you, wait for your feedback, then think again, do the next thing, write a longer report, show it to you, and so on.

The Problem with the Old Way:

The "Notebook" gets too heavy: Every time the intern shows you a report, they have to carry the entire history of everything they've ever done in this project. After 10 steps, the notebook is so thick (too many "tokens") that the intern can't read it anymore, gets confused, and starts hallucinating nonsense.
The "Give-Up" Factor: If the intern tries to call a plumber and the line is busy, they might just decide, "Eh, I'll just guess the answer," or ask you for help immediately, rather than trying a different plumber.
The "Safety" Flaw: You tell the intern, "Don't touch the red button." But if they get confused or tricked by a weird prompt, they might press it anyway. There's no physical lock on the button; just a verbal warning.

Enter KAIJU: The Executive Kernel
The authors of this paper built a new system called KAIJU. Think of KAIJU as a super-efficient, automated factory floor that sits between you (the user) and the intern.

Here is how KAIJU works, using simple analogies:

1. The Architect vs. The Construction Crew (Separation of Powers)

In the old way, the intern was the Architect, the Foreman, and the Bricklayer all at once.
In KAIJU, you have two distinct roles:

The Planner (The Architect): The intern sits in a quiet room. You give them the blueprints. They draw a complete map of the job (a "Dependency Graph") before anyone picks up a hammer. They don't do the work; they just plan it.
The Executive Kernel (The Factory Manager): This is the KAIJU system. It takes the Architect's map and sends it to the construction crew. The crew works in waves. They don't wait for the Architect to come back and check every single brick. They just follow the map.

2. The "Intent-Gated" Security Checkpoint (IGX)

This is the coolest part. Imagine a high-security airport.

Old Way: The intern is told, "Don't fly to dangerous countries." But if they are tricked, they might fly anyway.
KAIJU Way: The intern draws the flight plan. But before the plane can take off, it hits a Security Gate.
- Scope: Is this plane allowed to fly at all?
- Intent: Did the CEO (you) authorize this specific trip?
- Impact: Is this a passenger flight (safe) or a bomb drop (dangerous)?
- Clearance: Does the destination country actually have a visa for us?

The gate is a robot. It checks these four things. If the answer is "No," the plane is grounded. Crucially, the intern never sees the gate. They just see "Flight Failed." They can't trick the gate because they aren't even in the room with it.

3. The "Wave" System (Parallel Execution)

Instead of the intern doing one thing at a time and waiting for you to say "Good job, now do the next thing," KAIJU sends out waves of workers.

Wave 1: Send 5 workers to check the weather, call the bank, check the traffic, and order lunch. They all go at the same time.
Wave 2: Once the weather report comes back, a "Reflector" (a smart supervisor) looks at it. If the weather is bad, the supervisor instantly changes the plan for the next wave without asking you.

Because they work in parallel, the job gets done much faster, especially for big, complex tasks.

4. The "Bounded Context" (No More Heavy Notebooks)

In the old way, the intern carried a notebook that grew bigger with every step.
In KAIJU, the Factory Manager keeps the big notebook. The intern (the Planner) only sees the specific task for this wave.

Wave 1: Intern sees "Check Weather."
Wave 2: Intern sees "Check Traffic."
They never have to read the whole history. This prevents them from getting overwhelmed and making mistakes.

5. What Happens When Things Go Wrong?

Old Way: If a tool fails, the intern panics, stops, and asks you, "What should I do?" or just guesses.
KAIJU Way: If a worker drops a brick, the Micro-Planner (a specialized robot) instantly swaps in a different tool or tries a different angle. It keeps trying until the job is done. It doesn't ask you for permission; it just fixes the problem and keeps moving.

The Results: Why does this matter?

The paper tested KAIJU against the old method on hard tasks (like calculating planetary positions or finding complex data).

Simple tasks: The old way was slightly faster (because KAIJU has to draw the map first).
Complex tasks: KAIJU was 2x to 3x faster and much more reliable.
Safety: KAIJU never let a "dangerous" command slip through, whereas the old way sometimes did.
Quality: KAIJU didn't give up when things got hard. It kept digging until it found the answer, whereas the old way often gave up and guessed.

In Summary:
KAIJU turns a chaotic, chatty conversation into a military-grade operation. It separates the "thinking" from the "doing," puts a robotic security guard at the door, and ensures that even if the plan goes wrong, the system fixes itself without bothering the human boss. It makes AI agents safer, faster, and more reliable for serious work.

1. Problem Statement

Current Large Language Model (LLM) agents, particularly those using the ReAct (Reasoning + Acting) paradigm, face three critical limitations as task complexity increases:

Quadratic Context Growth: In ReAct, every reasoning turn appends the full conversation history and tool results to the context. For $n$ turns with average tool result size $k$ , the token complexity is $O(n^2k)$ . This leads to context window exhaustion, degraded attention, and empty outputs on multi-step tasks.
Unilateral Authority & Reliability: The LLM retains full control over whether to persist, retry, or abandon a task. If a tool fails, the model may rationally decide to "give up" and rely on parametric knowledge or ask the user, undermining the reliability of autonomous execution.
Vulnerability to Safety Breaches: Safety is enforced via prompt instructions (e.g., "do not delete files"). These are easily bypassed via prompt injection, hallucination, or context overflow. Adversaries can iteratively probe the model to discover policy boundaries because the model observes rejection reasons.

Existing solutions like LLM Compiler (parallel DAG execution) lack mid-execution adaptation and safety gating, while Multi-agent frameworks introduce coordination overhead and often fail on complex enterprise tasks.

2. Methodology: The KAIJU Architecture

KAIJU introduces a system-level abstraction that strictly decouples the Reasoning Layer (LLM) from the Execution Layer (Kernel).

Core Components

Executive Kernel: Manages scheduling, tool dispatch, dependency resolution, failure recovery, and security. The LLM is a stateless resource invoked only at discrete points (planning, reflection, aggregation) and has no visibility into execution mechanics.
Intent-Gated Execution (IGX): A security paradigm that enforces authorization via four independent variables before any tool executes. The gate decisions are structural and do not feed back into the LLM's context.
- Scope: Which tools are allowed (Allowlist).
- Intent: The operational ceiling set by the trigger source (e.g., observer vs. operator), not the LLM.
- Impact: The declared impact level of the tool (e.g., read=0, write=1, delete=2).
- Clearance: External authorization via HTTP endpoints (e.g., checking geofences or AD groups).
Graph-Based Execution: The LLM produces a Directed Acyclic Graph (DAG) of tool calls.
- Dependency Injection: Parameters are resolved structurally at execution time from upstream outputs, removing the need for sequential data passing.
- Failure Recovery: If a node fails, a "Micro-Planner" grafts replacement nodes (retry, alternative tool, or skip) without asking the user.
- Preemption: Human operators can inject messages that act as reflection checkpoints, blocking pending nodes until a decision is made.

Adaptive Execution Modes

KAIJU supports three modes to balance speed, cost, and oversight:

Reflect: Reflection checkpoints occur between dependency waves. The LLM evaluates the current wave's results and replans if necessary.
nReflect: Reflection fires after every $N$ node completions, decoupling evaluation frequency from graph depth.
Orchestrator: A lightweight observer evaluates every completed node, allowing immediate injection of follow-up nodes or cancellation of pending work.

3. Key Contributions

Structural Separation of Concerns: By separating planning from execution, KAIJU reduces token complexity from $O(n^2k)$ to $O(nk)$ (Orchestrator) or $O(nkd)$ (Reflect), preventing context exhaustion.
Deterministic Safety (IGX): Safety is enforced in compiled code via a four-variable gate. Since the LLM never sees the gate's decision criteria or rejection reasons, it cannot adaptively probe or bypass the policy.
Parallelism with Adaptation: Unlike LLM Compiler (which waits for the whole graph to finish), KAIJU executes in "waves" and allows mid-execution adaptation (replanning) via reflection, achieving $O(d)$ latency where $d$ is dependency depth.
Delegated Clearance: Resource-level authorization is offloaded to external endpoints, making the agent framework domain-agnostic while supporting complex enterprise policies.
Behavioral Guarantees: The system enforces persistence structurally. If a tool fails, the kernel retries or replans; it cannot "give up" or defer to the user unless explicitly gated by an operator.

4. Experimental Results

The authors evaluated KAIJU against a ReAct baseline using GPT-4.1 across 40 queries (Simple, Targeted, Complex, Computational) and the GAIA benchmark.

Latency & Complexity:
- Simple Queries: ReAct is marginally faster (3.6s vs 3.9s) due to lack of planner overhead.
- Complex/Computational Queries: KAIJU significantly outperforms ReAct.
  - Complex: nReflect (9.5s) vs ReAct (28.9s).
  - Computational: nReflect (25.2s) vs ReAct (43.7s).
- Reason: KAIJU executes independent tools in parallel waves with bounded context, while ReAct accumulates all history, leading to context exhaustion.
Reliability:
- On a 10-question computational astronomy benchmark, KAIJU completed 10/10 queries. ReAct failed 2/10 (empty verdicts due to context limits).
- On the GAIA benchmark (Level 3 hardest questions), KAIJU achieved 21.1% accuracy vs 0.0% for ReAct.
Output Quality: KAIJU produces more thorough, structured results (e.g., step-by-step calculations) because the reflection loop forces evidence evaluation before concluding, preventing the model from "hallucinating" an answer from parametric knowledge when tools fail.

5. Significance

KAIJU represents a paradigm shift from conversational agents to executive kernels.

Security: It solves the "adaptive adversary" problem by removing the feedback loop between safety enforcement and the LLM. The model cannot learn the policy boundary because it never sees the boundary.
Scalability: It enables agents to handle complex, multi-step workflows that exceed the context windows of current LLMs by breaking tasks into bounded, parallel waves.
Enterprise Readiness: The separation of execution mechanics from reasoning allows for the integration of strict safety, compliance, and domain-specific authorization without modifying the LLM itself.

In summary, KAIJU demonstrates that structural guarantees (via DAGs and intent gates) are superior to prompt-based guarantees for autonomous agents, offering a robust foundation for deploying LLM agents in high-stakes, complex environments.