CUCo: An Agentic Framework for Compute and Communication Co-design

Imagine you are the conductor of a massive, high-speed orchestra playing in a stadium with thousands of musicians (GPUs) spread across different buildings (data centers). The goal is to perform a complex symphony (training a giant AI model) as fast as possible.

The Problem: The "Conductor" Bottleneck
In the old days, the conductor (the CPU) had to stand in the middle, shouting instructions to every section.

He would tell the strings (Computation) to play a note.
Then, he would stop, run over to the brass section (Communication), tell them to pass a sheet of music to the next building, wait for them to finish, and then tell the strings to play the next note.

This "stop-and-go" approach is slow. The strings sit idle while the conductor runs back and forth. Even though the musicians are incredibly fast, the whole orchestra is held back by the conductor's running time.

Recently, technology improved: the musicians were given walkie-talkies (Device-Side APIs like NVSHMEM/NCCL). Now, the strings could talk directly to the brass without the conductor running back and forth. But here's the catch: Writing the script for how the musicians should talk to each other while playing is incredibly hard. It's like asking a composer to write a script where the violinists must simultaneously play a solo, pass a note to a friend in a different building, and listen for a signal to start the next measure—all without getting confused or crashing the performance. Doing this manually is a nightmare of errors and trial-and-error.

The Solution: CUCo (The AI Composer)
Enter CUCo. Think of CUCo as a super-smart, tireless AI composer that doesn't just write the music; it learns how to write the perfect script for this specific orchestra, in this specific stadium, with these specific walkie-talkies.

CUCo works in two distinct phases, like a student first learning the rules and then practicing to become a virtuoso:

Phase 1: The "Fast-Path" (The Safety First Student)

Imagine a student who is told, "Write a script that works, even if it's not the fastest."

The Goal: Correctness.
How it works: The AI writes a script where the musicians play, then stop, then pass notes, then play again. It's a bit clunky, but it guarantees that no one crashes the show. It creates a "safe baseline."
Why it matters: If you try to jump straight to the most complex, high-speed script, you'll likely get a syntax error or a deadlock (everyone waiting for someone who never speaks). This phase ensures the foundation is solid.

Phase 2: The "Slow-Path" (The Evolutionary Virtuoso)

Now, the AI takes that "safe but slow" script and starts an evolutionary experiment.

The Goal: Speed and efficiency.
How it works: The AI creates hundreds of slightly different versions of the script.
- Version A: "Let's pass the notes while playing the first half of the song."
- Version B: "Let's have the violins pass notes to the building next door while the cellos play."
- Version C: "Let's try a different way of signaling."
The Test: It runs these scripts on the actual hardware.
- If a script crashes? It's thrown in the trash (but the AI learns why it crashed so it doesn't make that mistake again).
- If a script is slow? It gets a low score.
- If a script is fast? It gets to "breed" with other good scripts to make even better ones.
The Result: After many generations of this "survival of the fittest," the AI converges on a script that is perfectly tuned for that specific hardware and network. It finds tricks that human engineers would never think to try, like hiding the time it takes to pass notes inside the time it takes to play the music.

The Magic Analogy: The Relay Race

Old Way: The runner (GPU) runs a lap, stops at the wall, waits for the coach (CPU) to hand them the baton, then runs again.
CUCo Way: The runner is handed a baton that they can grab while they are still running. They don't stop. They pass the baton to the next runner in the next lane while they are sprinting. CUCo figures out the exact angle, speed, and timing to make this happen without dropping the baton.

Why This Matters

The paper shows that CUCo can make these distributed AI training tasks up to 1.57 times faster than the best human-written code.

No Training Required: The AI doesn't need to be "taught" by thousands of examples. It reasons through the problem using the rules of the hardware.
Adaptable: If you change the stadium (different GPUs) or the walkie-talkies (different network), CUCo re-learns the best strategy automatically.
Open Source: The creators are giving this "AI Composer" to the world so everyone can build faster AI.

In short, CUCo is an automated system that takes the impossible, error-prone task of writing super-fast, complex code for AI clusters and turns it into a structured, evolutionary search, finding solutions that are faster and more efficient than anything a human could write by hand.

1. Problem Statement

In large-scale distributed LLM training and inference, maximizing GPU utilization requires tightly overlapping computation and communication. While modern hardware (e.g., NVLink, InfiniBand) and APIs (e.g., NCCL Device-Side, NVSHMEM) allow GPUs to initiate communication directly without CPU intervention, manually writing high-performance kernels that fuse these two domains remains labor-intensive, error-prone, and intractable.

Key challenges include:

Complex Design Space: Decisions regarding communication placement, synchronization scope, chunking granularity, and backend selection (GIN vs. LSA) interact non-trivially. Optimal configurations are highly sensitive to specific hardware topologies and workloads.
Hardware Heterogeneity: Variations in network bandwidth, latency, and topology (e.g., Fat-Tree vs. Torus) mean a kernel optimized for one setup may fail or underperform on another.
Limitations of Existing Tools: Traditional ML compilers assume communication is a host-side side effect, preventing them from optimizing within the kernel. Existing LLM-based kernel generators lack support for multi-GPU collective semantics and often fail to produce correct distributed code without extensive manual tuning.

2. Methodology: The CUCo Framework

CUCo is a training-free, agent-driven workflow designed to automatically generate fused compute-communication CUDA kernels. It employs a "two-path" agentic architecture to balance correctness and performance.

A. Structured Design Space Specification

To ground the agents' reasoning, CUCo defines a formal optimization configuration space ( $C$ ) comprising five dimensions:

Backend ( $B$ ): Choice of communication primitive (e.g., GIN for cross-node RDMA, LSA for intra-node direct memory access).
Issuer ( $I$ ): Granularity of thread participation (Thread, Warp, or CTA).
Placement ( $P$ ): Aggressiveness of overlap vs. ordering complexity.
Sync Scope ( $S$ ): Tightness of synchronization vs. overlap potential.
Granularity ( $G$ ): Transfer chunk size vs. synchronization overhead.

Agents must emit an explicit Optimization Directive based on these dimensions before generating code, ensuring decisions are traceable and valid.

B. The Fast-Path Agent (Correctness-First)

The fast-path agent's sole goal is to produce a functionally correct device-initiated kernel from a host-driven baseline.

Workflow:
1. Analysis: Extracts communication dependency graphs from user code.
2. Transformation: Rewrites host-driven NCCL calls into device-initiated equivalents (e.g., replacing ncclSend with gin.put). It uses a LLM-Judge loop to verify compilation and correctness iteratively.
3. Annotation: Marks specific code regions as EVOLVE-BLOCK to define safe mutation zones for the next stage.
Strategy: It prioritizes conservative, barrier-delimited phases to ensure distributed safety (preventing deadlocks or inconsistent states), creating a reliable "seed" for optimization.

C. The Slow-Path Agent (Performance-Driven)

The slow-path agent performs an LLM-driven evolutionary search to optimize the seed kernel for specific workloads and hardware.

Multi-Island Population: Maintains $K$ independent evolutionary islands to prevent premature convergence. Islands periodically migrate top-performing candidates to cross-pollinate strategies.
LLM Mutation Operator: Instead of random code edits, the LLM acts as a structured variation operator. It proposes mutations (diffs, full rewrites, or cross-pollination) strictly within the annotated EVOLVE-BLOCK regions, guided by:
- Parent code and metrics.
- Archive of diverse high-performing solutions (MAP-Elites).
- Meta-summarized trends from previous generations.
Two-Phase Search Policy:
- Explore Phase: High-temperature mutations to discover diverse structural strategies (e.g., split put/wait pipelines, fused kernels).
- Exploit Phase: Low-temperature localized diffs to refine the best candidates.
Cascade Evaluation: A three-level filter ( $\ell_1 \to \ell_2 \to \ell_3$ $ℓ_{1} \to ℓ_{2} \to ℓ_{3}$ ) screens candidates:
1. Compilation check.
2. Numerical correctness verification (via mpirun).
3. Performance benchmarking (latency/throughput).
- Feedback from failures is fed back to the LLM to guide future mutations.

3. Key Contributions

CUCo Framework: The first agentic system capable of co-designing compute and communication kernels for distributed ML without requiring model training.
Structured Design Space: A declarative interface that grounds agent reasoning in valid collective semantics, reducing hallucinations of invalid synchronization patterns.
Two-Path Architecture: A novel separation of concerns where a "Fast-Path" guarantees correctness and a "Slow-Path" drives performance, solving the "correctness vs. optimization" trade-off in distributed code generation.
Hardware-Aware Evolution: The system dynamically adapts to specific hardware topologies (e.g., NVLink vs. RoCE) and workload characteristics (e.g., token imbalance in MoE) through empirical feedback loops.

4. Experimental Results

CUCo was evaluated on four representative multi-GPU workloads:

Flash Attention with Context Parallelism: (4 GPUs, NVLink)
DeepSeek-V3 MoE Dispatch/Combine: (2 GPUs, Inter-node RoCE)
KV-Cache Transfer: (2 GPUs, NVLink)
GEMM + AllGather: (4 GPUs, Mixed links)

Performance Gains:

Latency Reduction: CUCo consistently reduced end-to-end latency by 5.3% to 26.2% compared to host-driven baselines.
Speedup: Achieved up to 1.57× speedup over non-fused baselines.
Case Study (Flash Attention): The fused kernel eliminated pipeline bubbles and host-side API overhead, reducing latency from 1230.7ms to 1091.7ms (11.3% reduction). Key gains came from per-tile pipelining, which was impossible in the host baseline due to SM saturation.

Ablation Studies:

Fast-Path Necessity: Skipping the fast-path agent resulted in a 2.5% lower peak performance and required 5 generations to produce a correct kernel (vs. 2 iterations with fast-path), wasting 25% of the search budget on compilation errors.
Two-Phase Search: The Explore-then-Exploit strategy outperformed single-phase exploitation, achieving a 4.3% higher final score and reaching near-optimal performance 3x faster.

5. Significance

CUCo represents a fundamental shift in how distributed machine learning systems are optimized. By automating the co-design of computation and communication, it unlocks performance opportunities that are currently inaccessible to manual engineering or existing compiler tools.

Scalability: It addresses the growing complexity of distributed training where communication overhead is becoming a bottleneck.
Automation: It removes the need for domain experts to manually tune kernels for every new hardware generation or network topology.
Future-Proofing: As device-initiated communication becomes the standard for distributed AI, frameworks like CUCo provide the necessary infrastructure to harness these capabilities efficiently.

The authors plan to fully open-source CUCo to accelerate research in compute-communication co-design.

CUCo: An Agentic Framework for Compute and Communication Co-design

Phase 1: The "Fast-Path" (The Safety First Student)

Phase 2: The "Slow-Path" (The Evolutionary Virtuoso)

The Magic Analogy: The Relay Race

Why This Matters

1. Problem Statement

2. Methodology: The CUCo Framework

A. Structured Design Space Specification

B. The Fast-Path Agent (Correctness-First)

C. The Slow-Path Agent (Performance-Driven)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Convolutional Surrogate for 3D Discrete Fracture-Matrix Tensor Upscaling

Generating Counterfactual Patient Timelines from Real-World Data

LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning

Robust Multi-agent Communication via Multi-view Message Certification

DySCo: Dynamic Semantic Compression for Effective Long-term Time Series Forecasting