AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization

Imagine you are trying to bake the perfect cake, but you are working in a brand-new, high-tech kitchen (the Huawei Ascend NPU) that no one has ever used before.

In the old, popular kitchen (the NVIDIA GPU), there are thousands of recipe books, cooking shows, and expert chefs to help you. If you want to bake a cake, you can just look up a recipe, tweak it, and you're good to go.

But in this new kitchen, there are no recipe books. The instructions are written in a strange, complex language, and if you get the timing or the ingredient placement even slightly wrong, the oven explodes (the code fails to compile). This is the "knowledge bottleneck" the paper talks about.

Enter AscendOptimizer, a smart, self-teaching robot chef designed to solve this problem without needing a human expert or a massive library of old recipes.

Here is how it works, broken down into simple steps:

1. The Two-Part Cake (The Problem)

Making a high-performance operator on this chip isn't just about the cooking (the math); it's about two things working together:

The Delivery Driver (Host-side Tiling): This decides how to chop the ingredients into small, manageable chunks and move them from the pantry to the counter. If the chunks are too big, the counter gets cluttered. If they are too small, the driver wastes time walking back and forth.
The Chef (Device-side Kernel): This is the actual cooking. It decides how to chop, mix, and bake the ingredients efficiently.

The problem is that these two are coupled. You can't just fix the delivery driver without checking if the chef can handle the new chunks, and vice versa.

2. The Robot's Strategy: A Two-Step Dance

Since the robot can't look up a recipe, it has to invent one through trial and error, but it does so very smartly using two different tricks.

Step 1: The "Evolutionary" Guessing Game (Optimizing the Delivery)

The robot starts with a basic plan for moving ingredients. It then tries thousands of tiny variations: "What if I move 10% more? What if I move them in a zig-zag?"

The Magic: It doesn't just guess; it tests every guess on the real hardware immediately.
The Filter: If a guess causes the oven to explode (compile error) or the cake to burn (wrong math), it instantly throws that idea away.
The Result: Over time, the robot "evolves" a delivery plan that is perfectly tuned to the physical limits of this specific kitchen, finding the fastest way to move data without crashing the system.

Step 2: The "Rewind" Trick (Optimizing the Cooking)

This is the cleverest part. The robot needs to learn how to cook better, but it has no "Good vs. Bad" examples to study. So, it creates its own examples!

The Rewind: The robot takes a "good" piece of code it has already found and deliberately breaks it. It removes a shortcut, slows down a process, or makes the code messy.
The Lesson: Now it has a "Bad" version and a "Good" version. It asks its AI brain: "What exactly did I change to make this slow? How do I fix it?"
The Library: It writes down these "Bad-to-Good" fixes in a notebook (an Experience Bank).
The Application: When it encounters a new, slow operator, it looks at the problem, checks its notebook, and says, "Ah, this looks like that time I broke the mixing speed. I know how to fix it!" It then applies that fix.

3. The Loop: Dancing Together

The robot doesn't just do Step 1 then Step 2 once. It dances between them:

It tweaks the Delivery to make the ingredients arrive faster.
It tweaks the Cooking to make the processing faster.
It checks the results. If the new cooking style needs different delivery chunks, it goes back to Step 1.
It keeps switching back and forth, slowly refining the whole process until it hits the speed limit of the hardware.

The Results

The researchers tested this robot on 127 real-world tasks.

The Baseline: The standard, open-source code (the "average chef").
The Result: AscendOptimizer made the code run 1.19 times faster on average.
The Wow Factor: For nearly half of the tasks, it was significantly faster, with some tasks running 2x faster or more.

Why This Matters

Before this, if you wanted to write fast code for Huawei's chips, you needed a rare, expensive human expert who knew all the secrets. Now, this "episodic agent" (the robot) can bootstrap its own expertise. It learns by doing, by breaking things on purpose to understand how to fix them, and by constantly testing on the real hardware.

It's like teaching a robot to drive a Formula 1 car on a track it's never seen before, not by giving it a manual, but by letting it crash a few times, learn from the crashes, and then drive faster than any human could without a manual.

1. Problem Statement

The paper addresses the critical bottleneck in optimizing operators for Huawei's Ascend Neural Processing Units (NPUs), specifically using the AscendC programming model. Unlike the mature NVIDIA CUDA ecosystem, the Ascend ecosystem suffers from severe knowledge scarcity:

Lack of Reference Data: There are few public, high-quality reference implementations for learning optimization patterns.
Complex Architecture: Ascend uses the Da Vinci architecture with an explicitly managed memory hierarchy (Unified Buffer, UB). Developers must manually orchestrate data movement and synchronization, unlike GPUs where caches are implicit.
Dual-Part Artifact: An AscendC operator consists of two coupled components:
1. Host-side Tiling Program: Decides data partitioning and movement.
2. Device-side Kernel Program: Decides instruction scheduling and pipelining.
Generalization Gap: Large Language Models (LLMs) trained on CUDA data fail to generate valid AscendC code. Benchmarks show a "Pass@1" generation rate of ~50% for CUDA but <2.1% for AscendC, due to buffer overflows and API misuse.
Limitations of Existing Methods: Traditional auto-tuning (e.g., TVM/Ansor) struggles with the discontinuous, non-convex search space of Ascend. Existing LLM agents lack the specific hardware feedback loops and domain knowledge to optimize effectively without extensive training data.

2. Methodology: AscendOptimizer

The authors propose AscendOptimizer, an episodic agent framework that bootstraps optimization expertise internally without requiring additional model training (Training-free) or manual rule engineering. It treats optimization as a block coordinate descent problem, alternating between two stages:

Stage I: Evolutionary-Guided Program Search (Host Tiling)

Goal: Optimize the host-side tiling function ( $T$ ) which determines data layout and movement.
Challenge: The tiling space is highly discontinuous; small changes can cause compilation failures.
Approach:
- Evolvable Template Synthesis: An LLM analyzes the operator to create a base tiling function with "evolution markers" (placeholders for logic/parameters).
- LLM-based Mutation: The LLM acts as a mutation operator, generating offspring tiling programs based on parent code and historical feedback.
- Hardware-in-the-Loop (HIL) Feedback: The system uses a zero-tolerance strategy. Any candidate that fails to compile or produces precision errors is immediately discarded. Valid candidates are evaluated on real NPU hardware to measure latency.
- Result: The evolutionary search rapidly converges to valid, high-performance tiling configurations within the implicit feasible region of the hardware.

Stage II: Optimization-Rewind based Experience Bootstrapping (Device Kernel)

Goal: Optimize the device-side kernel code ( $K$ ) to fix structural bottlenecks (e.g., pipeline stalls, lack of double buffering).
Challenge: Lack of "bad-to-good" training pairs for supervised learning.
Approach (Optimization Rewind):
- Inverse Distillation: Starting from a small set of high-performance "seed" kernels, an LLM acts as an "inverse agent" to systematically de-optimize them (e.g., removing vectorization, breaking pipelines).
- Trajectory Construction: This creates a trajectory of "Good $\to$ Bad" code. The system validates that the de-optimized versions are indeed slower on hardware.
- Experience Bank: The LLM analyzes the code differences and hardware profiling signals to distill structured Optimization Tuples ( $M$ ): <Title, Description, Bottleneck, Code Diff>. These form a retrievable pattern library.
- Retrieval-Augmented Refinement: During online optimization, the agent diagnoses bottlenecks in the current kernel, retrieves relevant patterns from the Experience Bank, and applies them as structured rewrites to generate new candidates.

Alternating Optimization Loop

The framework alternates between Stage I and Stage II:

Fix Kernel $K$ , optimize Tiling $T$ (Stage I).
Fix Tiling $T$ , optimize Kernel $K$ (Stage II).
This iterative handoff ensures that improvements in one domain (e.g., better data movement) reshape the feasible region for the other (e.g., allowing more aggressive pipelining).

3. Key Contributions

AscendOptimizer Framework: A novel two-stage episodic agent that solves the coupled host-tiling and device-kernel optimization problem for Ascend NPUs without manual rules or model fine-tuning.
Optimization Rewind Mechanism: A practical method to bootstrap kernel optimization experience under data scarcity by synthesizing "bad-to-good" trajectories from seed kernels, creating a retrievable pattern library for RAG-based rewriting.
Comprehensive Benchmark: Curated a standardized benchmark of 127 real-world AscendC operators and demonstrated consistent performance gains over open-source baselines and strong agent/search baselines.

4. Experimental Results

The system was evaluated on 127 AscendC operators using Huawei Ascend 910B4 NPUs.

Overall Performance:
- Achieved a 1.19× geometric-mean speedup over the open-source baseline (cann-ops).
- 49.61% of operators outperformed their references.
- Outperformed strong baselines like OpenEvolve and BoN (Best-of-N) sampling.
Level-wise Breakdown:
- Level 3 (Hardest operators): Achieved a 1.81× geometric mean speedup, with 28.57% of operators achieving >2.0× speedup.
- Level 1 & 2: Consistent improvements with geometric means of 1.08× and 1.21× respectively.
Ablation Study:
- Stage I only: GM = 1.09 (improves robustness but hits a ceiling).
- Stage II only: GM = 1.12 (good semantic rewriting but lacks optimal data movement).
- Combined (AscendOptimizer): GM = 1.19, demonstrating the synergy of alternating tiling and kernel optimization.
Case Studies:
- Visualized the "Experience Bank" showing distinct clusters of optimization strategies (some aligning with docs, others discovering novel patterns like fine-grained synchronization).
- Demonstrated a specific operator ("foreach pow scalar and tensor") where Stage I plateaued at 1.09×, but Stage II introduced a structural rewrite (block-level load balancing) to reach 2.31×.

5. Significance

Bridging the Knowledge Gap: Provides a viable path to high-performance operator development on Ascend NPUs without relying on scarce human experts or massive training datasets.
Training-Free Paradigm: Proves that effective optimization can be achieved via episodic reasoning and retrieval rather than Reinforcement Learning (RL) or Supervised Fine-Tuning (SFT), which are cost-prohibitive for niche hardware.
Generalizability: The "Optimization Rewind" and "Alternating Search" strategies offer a blueprint for optimizing other domain-specific accelerators (DSAs) where training data is scarce and architectural constraints are complex.
Practical Impact: Directly addresses the supply of computational resources for Large Language Models (LLMs) in regions where GPU access is constrained, enabling more efficient AI deployment on Huawei hardware.