ParEVO: Synthesizing Code for Irregular Data: High-Performance Parallelism through Agentic Evolution

The Big Problem: The "Traffic Jam" of Computers

Imagine you have a massive library of books (data) that needs to be organized.

The Old Way (Sequential): One librarian stands at a desk and sorts the books one by one. It's slow, but they never make mistakes because they are the only one touching the books.
The New Way (Parallel): You hire 64 librarians to sort the books at the same time. This should be 64 times faster, right?

The Catch: If the books are arranged in a messy, unpredictable pile (what the paper calls "Irregular Data"), chaos ensues.

Librarian A tries to grab a book while Librarian B is holding it.
They argue over who gets to write on the catalog card.
They accidentally delete each other's work.
The whole team stops to argue (a "deadlock"), and the job takes longer than if one person did it alone.

This is the nightmare of Parallel Computing. It's easy when the data is neat (like a grid of numbers), but when the data is messy (like a social network graph or a complex map), it's incredibly hard to get a team of computers to work together without crashing.

The Current AI Failure: The "Overconfident Intern"

Recently, we've had AI models (Large Language Models) that are great at writing code. But when you ask them to manage this chaotic team of 64 librarians, they fail spectacularly.

They write code that looks correct but has hidden traps (race conditions).
They try to use simple tools that don't work for complex jobs.
It's like asking a brilliant intern to manage a construction site; they might know the theory, but they don't know how to handle the real-world chaos of workers bumping into each other.

The Solution: ParEVO (The "Evolutionary Coach")

The authors created ParEVO, a system that doesn't just ask the AI to "write code." Instead, it treats code generation like evolution and coaching.

Think of ParEVO as a three-step process to train a champion coder:

1. The Training Camp (The "Parlay-Instruct" Corpus)

Before the AI can compete, it needs to learn the rules of the game.

The Problem: Most AI is trained on messy, broken code found on the internet.
The Fix: The researchers built a special "textbook" called Parlay-Instruct. They didn't just scrape the web; they used a "Teacher-Student" system to generate 13,820 perfect examples of parallel code.
The Analogy: Imagine a driving school where every student is tested on a real track with a safety car. If they crash, they are immediately ejected from the class. Only the drivers who pass the test get to graduate. This ensures the AI learns only safe, working patterns.

2. The Specialized Coaches (Fine-Tuned Models)

The researchers took powerful AI models (like DeepSeek and Qwen) and gave them a crash course in ParlayLib.

What is ParlayLib? Think of it as a "High-Level Language" for parallel computing. Instead of telling the AI, "Go grab a hammer and nail this board yourself" (low-level threading, which is dangerous), ParlayLib says, "Use this pre-made 'Nail-Gun' tool."
The Result: The AI learns to use these safe, pre-built tools. It stops trying to reinvent the wheel and starts using the wheel that's already been engineered to not fall off.

3. The "Survival of the Fittest" Agent (The Evolutionary Coding Agent)

This is the magic sauce. The AI doesn't just write code once and hope for the best. It uses an Evolutionary Agent.

How it works:
1. The AI generates 10 different versions of a solution.
2. It runs them on a real computer.
3. The Judge: A compiler and a "race detector" (a tool that finds errors) act as the judges. If a code crashes or has a race condition, it gets a score of 0.
4. The Mutation: The AI takes the best-performing code, mixes it with other ideas, and tries again.
5. The Loop: It repeats this 30 times, slowly "evolving" the code until it is not only correct but incredibly fast.
The Analogy: Imagine a chef trying to make the perfect soup.
- Normal AI: Writes a recipe, cooks it, serves it. If it's salty, they try again with a different recipe.
- ParEVO: Writes 10 recipes. Tastes them. Discards the burnt ones. Takes the best one, adds a pinch of salt, tastes it again. Takes that one, adds a pinch of pepper. After 30 rounds of tasting and tweaking, the soup is perfect.

The Results: Speeding Up the World

The results are staggering.

Speed: On complex problems, ParEVO made code run 106 times faster on average than standard AI models. In some cases, it was 1,100 times faster.
Reliability: It fixed the "crashing" problem. The code it generates actually compiles and runs without the team of librarians fighting each other.
Beating Humans: In some specific tasks (like finding the "Maximal Independent Set" in a graph), ParEVO's code was 4 times faster than code written by expert human engineers.

The Trade-off: Safety vs. Speed

The paper notes a funny trade-off. Because the AI was trained to be "safe" (using the high-level tools), it sometimes avoids the absolute fastest, riskiest tricks that a human expert might use.

Analogy: A human driver might take a risky shortcut through a narrow alley to save 10 seconds. ParEVO takes the main highway. It's slightly slower than the risky shortcut, but it's guaranteed to get you there without a crash. In the world of high-performance computing, reliability is often more valuable than a tiny bit of extra speed.

Summary

ParEVO is a system that teaches AI how to manage chaotic, messy data by:

Training it on a dataset of perfect, verified examples.
Teaching it to use safe, high-level tools instead of dangerous low-level tricks.
Using evolutionary testing to iteratively fix errors and optimize speed until the code is perfect.

It turns the AI from a "confident intern" into a "seasoned project manager" who knows exactly how to get a team of 64 computers to work together without a single argument.

1. Problem Statement

The transition from sequential to parallel computing is critical for modern high-performance applications (HPC), yet it faces a steep learning curve, particularly for irregular data structures (e.g., sparse graphs, unbalanced trees, non-uniform meshes).

The Challenge: Irregular algorithms suffer from unpredictable memory access patterns and dynamic work distribution, rendering static load balancing ineffective. They require sophisticated techniques like work-stealing and lock-free synchronization.
LLM Limitations: Current Large Language Models (LLMs) exhibit a strong "sequential bias." When tasked with parallelizing irregular code, they often generate solutions plagued by race conditions, deadlocks, or sub-optimal scaling (e.g., using naive #pragma omp parallel for on graph traversals). They struggle to capture the semantic nuances of synchronization and often fail to utilize high-level parallel primitives effectively.
The Gap: There is a lack of systems that can synthesize correct and high-performance parallel code for irregular data, bridging the gap between generative AI and rigorous HPC requirements.

2. Methodology: The ParEVO Framework

ParEVO is an end-to-end system designed to synthesize high-performance parallel algorithms. It operates through three distinct stages:

Stage 1: Data-Centric Synthesis (Parlay-Instruct Corpus)

To address data scarcity in high-quality parallel C++ and Rust code, the authors created a synthetic dataset via a "Teacher-Student-Critic" pipeline:

Seed Generation: 593 "golden" examples of ParlayLib primitives and 20 DMOJ competitive programming problems were manually authored.
Mutation: A "Teacher" model (Gemini-3-Pro) mutated these seeds using three operators: Type Mutation (changing data types), Constraint Mutation (adding logical predicates), and Algorithmic Mutation (transforming problem structures, e.g., reduce to scan).
Critic Loop (Rejection Sampling): Generated code was subjected to a strict verification pipeline. It was compiled against ParlayLib headers and executed against synthesized unit tests. Only code that compiled and passed tests was accepted.
Performance Optimization: A subset of data included "slow-fast" pairs where an evolutionary agent optimized a baseline solution to achieve at least a 1.2× speedup, teaching the model to reason about runtime efficiency.
Result: A curated corpus of 13,820 verified instruction-tuning pairs.

Stage 2: Specialized Model Fine-Tuning

The authors fine-tuned several open-source and commercial models to align with the rigorous semantics of ParlayLib (a high-level parallel library) and safe Rust patterns:

Models: DeepSeek-6.7B, Qwen3-30B (C++ and Rust variants), and Gemini-2.5-Pro.
Technique: Low-Rank Adaptation (LoRA) was used to minimize compute costs.
Alignment Strategy:
- Supervised Fine-Tuning (SFT): To learn ParlayLib syntax and primitives.
- Direct Preference Optimization (DPO): Applied to the Qwen3 model to explicitly suppress failure modes by training on contrastive triplets (passing vs. failing/inefficient solutions).
Goal: To internalize the "Work-Span" cost model and high-level primitives (e.g., filter, scan, reduce) rather than low-level threading primitives.

Stage 3: Evolutionary Coding Agent (ECA)

To overcome the stochastic limitations of single-shot generation, ParEVO employs an Evolutionary Coding Agent that iteratively refines code:

Process: The agent maintains a population of candidate solutions.
Fitness Function: $f(x)$ is determined by compilation success, unit test passing, and runtime performance. Crucially, solutions triggering data races (detected by dynamic sanitizers) are assigned a fitness of 0.
Selection: The agent uses MAP-Elites to select diverse survivors based on features like code length, cyclomatic complexity, and synchronization primitive frequency, alongside top-performing solutions.
Feedback Loop: The LLM synthesizes the next generation of code using the selected candidates and diagnostic artifacts (compiler logs, race detector outputs) as context. This treats the compiler and profiler as "adversarial critics" rather than relying on the LLM to self-verify.

3. Key Contributions

Parlay-Instruct Corpus: A novel dataset of 13,820 verified parallel coding tasks, generated via a "Critic-Refine" pipeline that filters for empirically performant algorithms.
Specialized Fine-Tuned Models: Release of DeepSeek-Parlay, Qwen-Parlay, and Gemini-2.5-Parlay models that outperform general-purpose LLMs by internalizing ParlayLib semantics and safe Rust patterns.
Evolutionary Coding Agent (ECA): A framework that integrates deterministic external tools (compilers, race detectors) into the generation loop to iteratively repair code and optimize for performance.
Insight on the "Alignment Tax": The paper identifies a trade-off where fine-tuning significantly improves code correctness (safety) but can slightly reduce peak speedup because models learn to prefer stable, high-level primitives over risky, fine-grained atomic operations.

4. Experimental Results

The framework was evaluated on the ParEval benchmark, PBBS/RPB (expert human baselines), and DMOJ competitive programming problems.

Speedup Performance:
- ParEVO achieved an average 106× speedup across the ParEval suite (max 1103×).
- On highly complex irregular graph problems, it achieved a robust 13.6× speedup.
- It outperformed state-of-the-art commercial models (e.g., GPT-5-Thinking, Gemini-3-Pro) and open-source baselines.
Correctness vs. Performance:
- Fine-tuning increased Pass@1 from 0.42 to 0.76 (a massive gain in reliability).
- While peak speedup dropped slightly (from 21.7× to 13.6×) due to the preference for safe primitives, the generated code was mathematically provable to scale and free of race conditions.
Comparison to Humans:
- ParEVO matched or exceeded expert human-written baselines. For the Maximal Independent Set problem, the generated Rust solution achieved a 4.1× speedup over the human baseline by identifying a superior parallel strategy.
Scalability: Strong scaling tests showed near-linear scaling (e.g., 40× on 64 cores for FFT) for regular problems and effective handling of load imbalance for irregular graphs.

5. Significance and Conclusion

ParEVO demonstrates that AI-driven agents can effectively navigate the complex landscape of high-performance computing for irregular data.

Paradigm Shift: It moves beyond simple code completion to AI-Driven Performance Engineering, where the system actively reasons about scalability, correctness, and the interplay between algorithms and hardware.
Abstraction Alignment: The success of ParEVO highlights that aligning LLMs with high-level, composable abstractions (like ParlayLib) is superior to forcing them to manage low-level threading primitives. This reduces the "state-tracking" burden on the model and minimizes race conditions.
Future Impact: By democratizing access to high-performance parallel code, ParEVO lowers the barrier to entry for HPC, enabling the synthesis of efficient, correct, and scalable algorithms for irregular data structures that were previously too difficult to implement manually or via standard LLMs.

Source Code & Data: Available at https://github.com/WildAlg/ParEVO.