AgentCgroup: Understanding and Controlling OS Resources… — Plain-Language Explanation

Original authors: Yusheng Zheng, Jiakun Fan, Quanzhi Fu, Yiwei Yang, Wei Zhang, Andi Quinn

Published 2026-02-24

📖 6 min read🧠 Deep dive

Original authors: Yusheng Zheng, Jiakun Fan, Quanzhi Fu, Yiwei Yang, Wei Zhang, Andi Quinn

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Picture: AI Agents are "Chaotic Roommates"

Imagine you have a very smart, but slightly unpredictable roommate (the AI Agent) who lives in a shared apartment building (the Cloud Server). This roommate is hired to do complex chores like fixing a car engine or writing a novel.

To do their job, the roommate doesn't just sit and think; they constantly run out to the garage, the library, and the hardware store to grab tools, read manuals, and test parts. These trips are called "tool calls."

The problem? The apartment building manager (the Operating System) is trying to manage electricity and water for hundreds of these roommates at once. The manager assumes everyone uses resources steadily, like a lightbulb that stays on. But these AI roommates are wild: they might use almost nothing for 10 minutes, then suddenly turn on a massive industrial oven for 2 seconds, then go back to sleeping.

The paper argues that the current rules for managing these resources are broken, and the authors built a new system called AgentCgroup to fix it.

Part 1: What They Discovered (The "Aha!" Moment)

The researchers watched 144 different AI tasks and found four surprising things:

Most time is spent "getting ready," not thinking.
- Analogy: Imagine a chef who spends 10 minutes sharpening knives and preheating the oven, but only 2 minutes actually cooking the steak.
- Reality: 56% to 74% of the time an AI spends on a task is just setting up the environment or running tools. The actual "thinking" (LLM reasoning) is only a small chunk.
Memory is the bottleneck, not CPU.
- Analogy: It's not that the chef is too slow to chop vegetables (CPU); it's that the kitchen runs out of counter space (Memory) when they pull out a giant cutting board.
- Reality: AI agents don't need massive processing power, but they need huge amounts of temporary memory (RAM) in short bursts.
The "Spike" is wild.
- Analogy: Imagine a water pipe that usually drips a cup of water a day, but once a week, it suddenly gushes out 15 cups in one second, then stops.
- Reality: When an AI runs a specific tool (like testing code), its memory usage can jump 15.4 times higher than its average usage in just a second or two.
It's impossible to predict.
- Analogy: If you ask the chef to make a sandwich, sometimes they use a knife, sometimes they use a laser cutter, and sometimes they use a chainsaw. You can't guess which one they'll pick until they actually start.
- Reality: Even if you run the exact same task twice, the AI might take a completely different path, using totally different resources.

Part 2: Why Current Systems Fail (The "Mismatch")

The researchers compared AI agents to three other types of digital workers: Serverless (short, quick tasks), Microservices (steady, long-running tasks), and Batch Jobs (predictable, heavy lifting).

They found three major mismatches:

The Granularity Mismatch (The "Swing Door" Problem)
- Current System: The building manager sets a rule for the whole apartment: "You can use 100 gallons of water."
- The Problem: The AI needs 10 gallons for 99% of the time, but 100 gallons for 1 second. If the manager sets the limit to 100, they waste 99 gallons. If they set it to 10, the AI crashes (runs out of water) the moment they need the big burst.
- Need: We need to control water usage for every single trip to the sink, not just the whole apartment.
The Responsiveness Mismatch (The "Slow Manager" Problem)
- Current System: The manager sees a water spike, runs to the control room, checks a log, and then turns off the valve. This takes seconds.
- The Problem: The AI's spike happens in milliseconds. By the time the manager reacts, the damage is done, or the spike is already over.
- Need: The manager needs to react instantly, like a reflex.
The Adaptability Mismatch (The "History Book" Problem)
- Current System: The manager looks at last week's data to guess this week's needs. "Last time you made a sandwich, you used a knife, so I'll give you a knife."
- The Problem: AI agents are non-deterministic. They might decide to use a chainsaw this time. Also, if the manager kills the agent for using too much water, the agent loses all its notes and has to start over from scratch, which is expensive and slow.
- Need: The system needs to talk to the agent while it's working, not just guess based on the past.

Part 3: The Solution: AgentCgroup

The authors built AgentCgroup, a new system that acts like a super-intelligent, real-time bouncer inside the computer's kernel (the core of the OS).

Here is how it works:

Micro-Management (Granularity):
Instead of giving the whole AI agent one big bucket of memory, AgentCgroup creates a tiny, temporary bucket for every single tool call.
- Analogy: Instead of giving the chef a whole warehouse of water, it gives them a cup for the sink, a bottle for the fridge, and a hose for the garden, managing each one separately.
Instant Reflexes (Responsiveness):
It uses a technology called eBPF (think of it as a super-fast, programmable security guard living inside the building's walls).
- Analogy: If the water pressure spikes, the guard doesn't run to the control room. They instantly clamp the pipe in microseconds, preventing a flood before the chef even notices.
Two-Way Conversation (Adaptability):
This is the cleverest part. AgentCgroup lets the AI "speak up" before it starts a task.
- Upward: The AI can say, "Hey, I'm about to run a heavy test, I need extra memory." The system grants it.
- Downward: If the AI tries to use too much, instead of killing it (which wipes its memory), the system gently slows it down and whispers, "Whoa, that's too heavy. Try a lighter approach." The AI can then adjust its strategy on the fly.

The Result

In their tests, this new system allowed many more AI agents to run on the same server without crashing each other. It reduced delays for high-priority tasks by 29% and prevented "OOM" (Out of Memory) crashes that would have forced agents to restart and lose their progress.

Summary

AgentCgroup is a new way to manage AI agents that treats them like dynamic, unpredictable workers rather than static machines. By managing resources at the level of individual "tool calls," reacting instantly inside the computer's core, and allowing the AI to adapt its behavior, it makes running AI agents in the cloud much more efficient and stable.

1. Problem Statement

AI coding agents (e.g., Claude Code, OpenHands, SWE-agent) are increasingly deployed in multi-tenant cloud environments. These agents operate in a "reason-then-act" loop, executing diverse tool calls (compilers, test runners, file editors) within sandboxed containers.

Current resource management strategies (designed for serverless, microservices, or batch workloads) fail to handle the unique dynamics of AI agents, leading to three critical mismatches:

Granularity Mismatch: Existing controls (like Kubernetes QoS or cgroup limits) operate at the container level, but agent resource demands fluctuate wildly at the tool-call level. Setting limits to the peak wastes massive resources (>90%), while setting them to the average causes Out-Of-Memory (OOM) kills during tool bursts.
Responsiveness Mismatch: Agent resource spikes are extremely fast (1–2 seconds) and unpredictable. User-space controllers (like systemd-oomd or Kubernetes VPA) react too slowly (milliseconds to minutes) to prevent OOMs or latency spikes.
Adaptability Mismatch: Traditional workloads are deterministic, allowing history-based prediction. AI agents are non-deterministic; the same task can yield different execution paths and resource demands across runs. Furthermore, killing an agent container to recover from an OOM is catastrophic because it destroys the accumulated LLM context, forcing a costly restart (31–48% of task time) with no guarantee of converging to the same solution.

2. Methodology

The authors conducted a systematic characterization of OS-level resource dynamics and designed a new kernel-level controller.

A. Workload Characterization

Setup: Analyzed 144 software engineering tasks from the SWE-rebench benchmark across two LLM backends: Claude Haiku 4.5 (cloud API) and GLM-4.7-Flash (local GPU).
Environment: Ran in isolated Podman containers on a 24-core, 128 GB RAM server (Linux 6.15.11).
Metrics: Collected 1-second interval data on CPU, memory, tool call types, and timestamps.

B. System Design: AgentCgroup

To address the identified mismatches, the authors proposed AgentCgroup, an intent-driven, eBPF-based resource controller.

Fine-Grained Domains: Uses a hierarchical cgroup v2 structure where the agent workload is the parent, and each tool call is a child cgroup. A transparent bash wrapper intercepts tool invocations to spawn ephemeral child cgroups dynamically.
In-Kernel Enforcement: Leverages eBPF for microsecond-level reaction times:
- CPU: Uses sched_ext to prioritize latency-sensitive tool calls.
- Memory: Uses memcg_bpf_ops to apply custom throttling delays when limits are breached, avoiding immediate OOM kills.
Intent-Driven Adaptation: Establishes a bidirectional protocol:
- Agent $\to$ System: The agent declares expected resource needs (e.g., memory:high for tests) via environment variables before a tool call.
- System $\to$ Agent: If a tool is throttled or killed, the system injects natural language feedback into stderr, allowing the agent to retry with a less resource-intensive strategy.

3. Key Contributions & Findings

A. Characterization Findings

OS Overhead Dominance: OS-level execution (tool calls + initialization) accounts for 56–74% of end-to-end task latency, while LLM reasoning accounts for only 26–44%.
Memory is the Bottleneck: Memory, not CPU, limits multi-tenant concurrency. While CPU utilization is low (<13% avg), memory peaks can reach 2–4 GB.
Two-Layer Structure: Memory usage consists of a stable ~185 MB framework baseline plus tool-call-driven bursts.
Extreme Volatility:
- Peak-to-Average Ratio: Up to 15.4× (e.g., 4 GB peak vs. 264 MB average).
- Unpredictability: Resource demands vary 20× across different tasks and 1.8× across runs of the same task.
- Burst-Silence Pattern: 98.5% of memory bursts occur during tool calls, which occupy only ~50% of the time.
- Retry Loops: 85–97% of tasks contain retry loops that cause progressive memory accumulation without cleanup.

B. Mismatch Analysis

The paper quantitatively demonstrates that existing solutions (Serverless, Microservices, Batch) cannot handle AI agents due to:

Granularity: Container-level policies cannot distinguish between a lightweight git status (13.5 MB) and a heavy pytest run (518 MB).
Responsiveness: User-space reaction times are too slow for 1–2 second bursts.
Adaptability: History-based prediction fails due to non-determinism, and "kill-and-restart" is too expensive due to context loss and long cold starts.

4. Results (Preliminary Evaluation)

The authors evaluated AgentCgroup by replaying real agent traces in a multi-tenant setting with 50× speed acceleration.

Scenario: 1 High-priority task vs. 2 Low-priority tasks under tight memory constraints (1100 MB total capacity for ~1233 MB demand).
Baseline (No Isolation): The system triggered an OOM kill on one Low-priority process, resulting in a 66% survival rate.
AgentCgroup (BPF):
- Survival Rate: 100% (all processes completed).
- Mechanism: The system throttled Low-priority allocations (239 delay triggers) while protecting the High-priority task.
- Latency: Reduced High-priority P95 allocation latency by 29% (71.0 ms $\to$ 50.1 ms).
- Overhead: Negligible; P50 latency increased by only 0.3%, and total completion time decreased by 1.1%.

5. Significance

New Workload Class: This paper establishes the first systematic characterization of AI agent workloads, revealing they are fundamentally different from traditional cloud workloads.
Kernel-Level Solution: It demonstrates that eBPF and sched_ext/memcg_bpf_ops are viable for managing highly dynamic, non-deterministic AI workloads, moving control from user-space daemons to the kernel.
Intent-Driven Paradigm: It introduces a novel control loop where the AI agent actively participates in resource management by declaring intent and adapting to feedback, rather than being a passive consumer of resources.
Open Source: The prototype is open-sourced, providing a foundation for future research in OS-level AI agent orchestration.

Conclusion: AgentCgroup effectively solves the resource management challenges of AI agents by aligning control granularity with tool calls, reacting at kernel speeds, and leveraging the agent's ability to adapt its own execution strategy, thereby improving multi-tenant isolation and reducing resource waste.

AgentCgroup: Understanding and Controlling OS Resources of AI Agents