ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning

Imagine you are running a massive, high-speed restaurant called "The AI Chef."

In this restaurant, the "Head Chef" is a super-smart AI (a Large Language Model) that tries to solve complex problems, like writing code or finding deep answers on the internet. To do this, the Chef doesn't just think; it needs to act. It needs to open a window to check the weather, use a calculator, or call a delivery service. These are the "external resources" (CPUs, GPUs, APIs).

The Problem: The "All-You-Can-Eat" Buffet Disaster

In the old way of running this restaurant (the "Existing Frameworks"), the management was incredibly wasteful.

Imagine that every time the Chef decided to cook a single dish (a "trajectory"), the kitchen manager would immediately reserve a whole private dining room, a dedicated team of 10 sous-chefs, and a full stock of ingredients just for that one dish.

The Reality: The Chef only actually uses that private room for 10 minutes out of an hour. For the other 50 minutes, the room sits empty, the sous-chefs stand around doing nothing, and the ingredients sit on the shelf.
The Result: The restaurant runs out of space and money very quickly. If a new order comes in, there's no room for it, so the new orders sit in a long line (queue), and the whole kitchen slows down. This is called "Over-provisioning."

The Solution: ARL-Tangram (The "Smart Kitchen Manager")

The paper introduces a new system called ARL-Tangram. Think of it as a genius kitchen manager who changes the rules from "Reserve a whole room" to "Reserve a single tool for exactly as long as you need it."

Here is how it works, using simple analogies:

1. The "Action-Level" Switch (Tangram Pieces)

Instead of thinking in terms of "whole meals" (trajectories), ARL-Tangram breaks everything down into tiny, atomic actions.

Old Way: "I need a kitchen for the whole hour."
ARL-Tangram Way: "I need a knife for 5 seconds, then a stove for 10 seconds, then a phone for 2 seconds."

It treats every tiny step as a separate request. This allows the system to take a tool back the second the Chef is done with it and hand it to the next person immediately. It's like a Tangram puzzle: you can rearrange the pieces (resources) instantly to fit whatever shape (task) is needed right now, rather than having fixed, rigid boxes.

2. Elastic Scheduling (The "Dynamic Bus")

Imagine a bus system.

Old Way: You run a bus with 50 seats every 10 minutes, even if only 2 people show up. You waste fuel (money) and space.
ARL-Tangram: It watches the crowd. If 2 people show up, it sends a small van. If 50 people show up, it instantly adds more buses.
The Magic: If a specific task (like checking a reward) can be done faster by using more power, ARL-Tangram says, "Okay, let's give that task 4 GPUs instead of 1," to finish it in half the time. If the task is done, it takes the GPUs away immediately. This is called Elasticity.

3. The "Breakdown & Pool" Strategy

The system has two main tricks:

Breakdown: It stops locking resources for the whole "meal." It unlocks them the moment the "action" is done.
Pool: It keeps a giant, shared pool of all resources (CPUs, GPUs, APIs). When a Chef needs a tool, it grabs it from the pool. When done, it puts it back. This means resources are never sitting idle; they are constantly being reused by different chefs.

The Results: Why It Matters

The paper tested this system on real-world AI tasks (like coding and deep searching) and found amazing results:

Speed: The AI finished its training steps 1.5 times faster. It's like the restaurant serving meals 50% faster without hiring more staff.
Efficiency: The time it took to complete a single action dropped by 4.3 times.
Cost Savings: They saved 71% of the external resources. Imagine if you could run your entire restaurant with only 30% of the electricity and staff you used before, but still serve more customers.

The Bottom Line

ARL-Tangram is a smart resource manager that stops AI systems from hoarding expensive computer power. Instead of letting resources sit idle in empty rooms, it treats them like a shared pool of tools, handing them out and taking them back in the blink of an eye. This makes AI training faster, cheaper, and much more efficient, allowing companies to build smarter AI without breaking the bank.

1. Problem Statement

Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to solve complex real-world problems (e.g., AI coding, deep search) by interacting with external tools. However, unlike traditional RL, agentic RL relies heavily on heterogeneous external resources (CPUs for code execution, GPUs for reward models, API quotas for web browsing) that exist outside the primary training cluster.

The paper identifies a critical inefficiency in existing frameworks: Static Over-Provisioning. Current systems manage resources at the trajectory level (reserving resources for the entire lifecycle of a multi-turn interaction) or the task level (isolating services per task). This leads to severe resource waste because:

Within Trajectories: Resources are reserved even during LLM generation phases where no external tools are used. For example, in AI coding, environments are accessed only ~47% of the time, leaving 53% of allocated CPUs idle.
Within RL Tasks: External services (like reward models) are deployed on isolated resources that remain under-utilized due to the bursty and fluctuating nature of invocation patterns.

This over-provisioning limits system concurrency, increases queuing delays, slows down RL training steps, and incurs unnecessary cloud costs.

2. Methodology: ARL-Tangram

The authors propose ARL-Tangram, a unified resource management system that shifts the granularity of resource management from the trajectory/task level down to the action level (atomic invocations). The system architecture consists of three core components:

A. Unified Action-Level Formulation

To manage heterogeneous resources uniformly, the system models every external invocation as an Action with a Vectorized Resource Cost.

Resource Vector: Each action is represented by a vector $C_i = (c_{i,0}, ..., c_{i,k-1})$ covering CPU, GPU, memory, and API quotas.
Elasticity Modeling: For actions that can scale (e.g., parallelizing test cases or reward model inference), the system models the relationship between resource allocation and execution duration. It assumes a "key elasticity resource" (e.g., GPU count) where increasing resources reduces latency according to a specific function.
Goal: Normalize diverse actions into a common format to enable joint scheduling.

B. Elastic Resource Scheduling Algorithm

The core scheduler aims to minimize the Action Completion Time (ACT), defined as the sum of queuing time and execution time.

Strategy: It employs a First-Come First-Served (FCFS) ordering policy to prevent starvation (critical for RL trajectories) but uses a Greedy Eviction Algorithm for resource allocation.
Mechanism:
1. Select a candidate set of actions from the queue.
2. Group them by their key elasticity resource.
3. Iteratively "evict" the last action from the candidate set and re-allocate its resources to the remaining actions to see if the total ACT improves.
4. Uses a topology-agnostic Dynamic Programming (DP) approach to find optimal discrete allocations for scalable actions.
Outcome: This allows the system to dynamically scale resources (Degree of Parallelism) for elastic actions when resources are available, and shrink them when resources are scarce, minimizing idle time.

C. Heterogeneous Resource Managers

ARL-Tangram implements specialized managers to handle the "Breakdown & Pool" strategy (releasing resources after actions and pooling them for the next):

Basic Manager: Handles non-scalable resources like API quotas (concurrency/limit based).
CPU Manager (AOE - Allocate-on-Execution): Uses Docker's cgroup update interface to dynamically adjust CPU limits and sets for containers at the moment of execution. It preserves container memory state to avoid full restarts, enabling fine-grained CPU sharing.
GPU Manager (EOE - Evict-on-Execution): Addresses the high overhead of loading GPU models. It pre-deploys services and backs up their state to CPU memory. When an action requests a service, it restores the state to GPU memory, evicting other cached services if necessary (LRU policy). This allows multiple services to share a limited GPU pool efficiently.

3. Key Contributions

Problem Analysis: Identified and categorized external resource over-provisioning in agentic RL into two levels: within trajectories (idle time during LLM generation) and within tasks (isolated, under-utilized services).
Action-Level Scheduling: Proposed a paradigm shift from trajectory-level to atomic action-level scheduling, enabling fine-grained resource sharing and elasticity.
Unified System Design: Developed ARL-Tangram, featuring a vectorized action formulation, a greedy eviction-based elastic scheduler, and specialized managers for heterogeneous resources (CPU/GPU/API).
Real-World Deployment: Successfully deployed the system to support the training of the MiMo series models at Xiaomi.

4. Experimental Results

Evaluated on real-world agentic RL tasks (AI Coding, DeepSearch, and MOPD), ARL-Tangram demonstrated significant improvements over static baselines (Kubernetes, SGLang, ServerlessLLM):

Action Completion Time (ACT): Improved by up to 4.3× on average.
RL Training Step Duration: Accelerated by up to 1.5×.
- AI Coding: 4.3× improvement in total duration (9.0× for environment interaction, 2.8× for reward computation).
- DeepSearch: 1.5× improvement, primarily due to better traffic control and reduced retries.
Resource Efficiency: Saved up to 71.2% of external resources.
- In GPU scalability tests, ARL-Tangram served 10 reward services using only 29% of the GPUs required by the over-provisioned baseline while achieving the same ACT.
Scalability: Outperformed baselines significantly under high concurrency (e.g., 18.1× better ACT than SGLang at batch size 2048).

5. Significance

ARL-Tangram addresses a critical bottleneck in the scaling of Agentic RL. By treating external resource invocations as first-class citizens and managing them with fine-grained elasticity, the system:

Reduces Cost: Drastically lowers the cloud infrastructure costs associated with the "bursty" nature of agentic workloads.
Accelerates Training: Shortens the critical path of RL training steps, enabling faster iteration cycles for model development.
Enables Complex Agents: Makes it feasible to run complex, multi-tool agentic workflows that would otherwise be too expensive or slow due to resource contention and idle time.

The system represents a shift from static, rigid resource provisioning to dynamic, workload-aware orchestration, which is essential for the future of large-scale agentic AI systems.