Original authors: George Andronchik, Pavel Lokhmakov
Original authors: George Andronchik, Pavel Lokhmakov
Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer
Technical Summary: AI Code Sandboxes: A Comparative Security Study (Part 1)
Problem Statement
The paper addresses the critical challenge of engine-level isolation for AI agents executing untrusted code. While "sandboxing" is a standard defense family in agentic AI security literature, there is a lack of deep, engine-level measurement comparing how different products isolate guest code from the host kernel. The urgency is driven by "dangerous capability" research indicating that AI agents are already capable of performing multi-step cyber-attacks (e.g., completing 22 of 32 corporate network attack steps) within current compute budgets. The study focuses on the T0.H2.N2 threat model: a single-tenant operator running untrusted code on their own infrastructure, where the operator trusts the infrastructure but not the code. The goal is to measure how five specific AI-sandbox products (arrakis, e2b, microsandbox, gvisor, daytona) prevent host kernel escape and information leakage.
Methodology
The study employs a six-axis, cross-class comparative framework measuring properties determined by the underlying engine (microVM, userspace kernel, or OCI container). The methodology explicitly forbids composite scoring or overall ranking, instead providing per-axis orderings and a threat-model qualification matrix.
The Six Axes:
- Host Attack Surface (1.1): Measures the footprint of the runtime/mediator (L2) on the host kernel (L1) via
stracesyscall counts, seccomp filter ceilings, and primitive reachability (14 specific kernel-LPE/container-escape primitives). - Information Leakage (1.2): Measures what host-identifying data (CPU, RAM, kernel version, disk serials) is exposed to the guest via
/proc,/sys, and/devreads. - Defense-in-Depth Stackability (1.3): Evaluates whether an operator can layer additional Linux hardening (seccomp, AppArmor, user namespaces, etc.) on top of the engine defaults.
- Public CVE History (1.4): Analyzes the last 24 months of CVEs for each engine, classifying them by impact (Escape, HostLeak, HostDoS).
- Patch Cadence (1.5): Measures the time lag between upstream patching and product-level availability, distinguishing between coordinated disclosures and "silent-fix-first" models.
- Upstream Fuzzing Posture (1.6): Assesses the presence of continuous public fuzzing, in-tree harnesses, and per-CVE attribution to fuzzer discovery.
Experimental Setup:
- Host: Single Hetzner bare-metal node (Ubuntu 24.04, Kernel 6.8.0).
- Products: Five products mapped to three engine classes:
- MicroVMs: arrakis (Cloud Hypervisor), e2b (Firecracker), microsandbox (libkrun).
- Userspace Kernel: gvisor (runsc).
- OCI Container: daytona (runc via Docker-CE).
- Verification: Uses "probe" tests (pass/fail), "measurement" (syscall counts), and "desk research" (CVE/Fuzzing analysis).
Key Contributions and Findings
1. Engine Classes vs. Product Variance
While engine classes (microVM vs. userspace kernel vs. container) separate cleanly on architectural axes (attack surface, leakage), products within the same class do not. Product-level configuration and pin policies are often more significant differentiators than the engine class itself.
- Example:
arrakis(microVM) has a "frozen" patch policy (471+ days), whiledaytona(container) is "current" on patches, reversing the expected security hierarchy based on isolation class alone.
2. Attack Surface and Primitive Reachability
- gVisor has the tightest attack surface (5/14 primitives reachable) due to its userspace kernel intercepting syscalls.
- Firecracker (e2b) has the tightest seccomp ceiling (55 syscalls) but still suffers from 2 new Escape-class CVEs in the 2026 window, proving that a small surface does not guarantee zero bugs in the exercised paths.
- arrakis exposes a live
/dev/kvminterface to the guest, allowing nested virtualization without privilege escalation, significantly expanding its kernel-LPE surface compared to other microVMs.
3. Patch Propagation Dominance
The study finds that product-side pin policy is the dominant operator-facing variable, aggregating to ≈0 days lag for coordinated disclosures upstream but spanning 0 to 471+ days downstream.
- arrakis and e2b (self-hosted) are "frozen" on older engine versions, leaving them unpatched against recent critical CVEs (e.g.,
CVE-2026-45782for arrakis,CVE-2026-5747for e2b). - gVisor follows a "silent-fix-first" model where fixes ship months before CVE assignment, resulting in negative lag (operators receive fixes before public disclosure).
4. Fuzzing Posture and "Unmeasured" Risks
- gVisor is the only engine with a continuous public fuzzer (syzkaller) and in-tree harnesses.
- Firecracker and libkrun have no upstream fuzzing infrastructure.
- Critical Finding: The combination of "MicroVM class" (strong isolation) and "Continuous Public Fuzzer" (strong residual-bug detection) is unoccupied in this set.
- libkrun (microsandbox) is structurally unmeasured: it has 0 published CVEs and no upstream fuzzer. The paper argues that "0 CVEs" here is an absence of signal, not proof of soundness, creating a "structurally unmeasured" risk profile.
5. Information Leakage
- MicroVMs generally leak 0–1 host identifiers (configurable CPU strings).
- gVisor leaks 2 identifiers (RAM total, BIOS product) due to implementation gaps in its synthetic
/proc. - daytona leaks 10 identifiers, including disk serials and full kernel signatures, due to the shared-kernel architecture.
Significance and Claims
The paper claims no overall ranking is possible or proposed. Instead, it provides a threat-model qualification matrix that allows operators to answer four specific sub-questions:
- Escape Resistance: Can the code escape the host kernel?
- Reconnaissance Resistance: What can the code learn about the host?
- Hardening Compatibility: Can the operator add Linux hardening layers?
- Patch Propagation: Does the operator receive fixes promptly?
Key Conclusions:
- Trade-offs are unavoidable: The strongest isolation class (microVM) does not automatically correlate with the strongest residual-bug posture (fuzzing). Operators must choose between "strongest isolation" (microVMs) and "shallowest residuals" (gVisor).
- Product defaults matter: Engine-level strengths (e.g., Cloud Hypervisor's per-thread seccomp) can be negated by product-level defaults (e.g., arrakis's nested-KVM exposure or e2b's frozen pin).
- The "Unmeasured" Gap: The absence of CVEs and fuzzing in
libkruncreates a risk profile that cannot be inferred as "safe" or "unsafe," only "unmeasured." - Methodological Shift: The study moves beyond simple "replay" of CVEs to a meta-analysis of architectural properties, patch cadence, and fuzzing investment to describe the current state of AI sandbox security.
The paper serves as a baseline for engine-level measurement, identifying specific product-level configuration gaps (like arrakis's nested-KVM and daytona's Privileged: true hardcoding) that require immediate operator attention or upstream remediation.
Drowning in papers in your field?
Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.
Get the best AI papers every week.
Trusted by researchers at Stanford, Cambridge, and the French Academy of Sciences.
Check your inbox to confirm your subscription.
Something went wrong. Try again?
No spam, unsubscribe anytime.