Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

Here is an explanation of the paper "EvoKernel" using simple language and creative analogies.

The Big Problem: The "Data Wall" in New Hardware

Imagine you are a master chef (a Large Language Model) who has spent years cooking in a massive, well-stocked kitchen (like NVIDIA's CUDA ecosystem). You have millions of recipes, every spice imaginable, and a library of expert techniques. You can cook a perfect steak in your sleep.

Now, imagine you are suddenly dropped into a tiny, remote cabin in the woods (a new, specialized chip called an NPU).

The Problem: There are no cookbooks here. The ingredients are weird and unfamiliar. The tools are different. Even though you are a world-class chef, you have no idea how to cook a meal here because you've never seen this specific kitchen before.
The Result: If you try to cook immediately, you'll likely burn the food or serve raw ingredients. This is the "Cold-Start" problem. The AI is smart, but it has no data to learn from for this specific new hardware.

The Solution: EvoKernel (The Self-Evolving Apprentice)

The authors created a system called EvoKernel. Instead of trying to force the chef to memorize a new cookbook (which is expensive and hard), they gave the chef a smart, self-updating notebook and a strict taste-tester.

Here is how it works in two main stages:

Stage 1: The "Cold-Start Draft" (Finding a Recipe)

The Goal: Just get something edible on the plate. It doesn't have to be Michelin-star quality yet; it just has to be cooked.
The Process:
1. The AI tries to write code (a recipe) for the NPU.
2. It checks its Memory Notebook. Since it's the first time, the notebook is empty, so it guesses.
3. The Taste-Tester (Verifier): The code is run. If it crashes or tastes bad (fails to compile or run), the tester says, "Nope, that's raw."
4. The Lesson: The AI writes down why it failed in the notebook. It doesn't just throw the recipe away; it learns, "Oh, I can't use that ingredient here."
5. It tries again, using the lesson from the notebook. It keeps trying until it finally serves a dish that is edible (functionally correct).

Stage 2: "Continual Refining" (Making it a Masterpiece)

The Goal: Now that the food is edible, let's make it delicious and fast.
The Process:
1. The AI looks at the "edible" dish it just made.
2. It checks the notebook for Value-Driven Memories. This is the secret sauce.
  - Old Way: "Let's look at recipes that look similar to this one." (This often fails because the new kitchen is too different).
  - EvoKernel Way: "Let's look at recipes that actually worked and made the food faster."
3. The system learns a "Value Score" for every note in the notebook. It asks: "Does this old note help me solve the current problem?"
4. It tweaks the recipe to make it faster (lower latency).
5. If the new version is faster, it gets a high score and is added to the "Best Practices" section of the notebook.

The Magic: "Value-Driven Memory"

Think of the Memory Notebook not as a static library, but as a living, breathing mentor.

Traditional AI: Like a student who reads a textbook once and then forgets it. If they fail a test, they just try again with the same textbook.
EvoKernel: Like a student who keeps a journal of mistakes and wins.
- When they face a hard math problem, they don't just look for "similar problems." They look for the specific note in their journal that says, "Hey, when I was stuck on a problem like this, the trick was to use a specific formula."
- The system learns which notes are valuable for the current stage.
  - Drafting Stage: "I need notes that help me avoid crashing."
  - Refining Stage: "I need notes that help me speed things up."

The Results: From Novice to Pro

The paper tested this on Ascend C, a language for Huawei's NPUs (a very data-scarce environment).

Before EvoKernel: The best AI models could only get about 11% of the tasks right. They were mostly guessing and failing.
With EvoKernel: The success rate jumped to 83%.
Speed: The AI didn't just get it right; it got it fast. On average, the refined code was 3.6 times faster than the first draft.

The "Cross-Pollination" Analogy

One of the coolest parts is Cross-Task Transfer.
Imagine the AI is learning to cook a Steak (a simple task). Once it masters the steak, it writes a note in the book: "High heat works well for meat."

Later, it tries to cook a Fish (a harder, different task). Instead of starting from zero, it looks at the book, sees the note about "High heat," and realizes, "Oh! I can use that high-heat technique for the fish too, but maybe adjust the time."

The AI learns from simple tasks to solve hard ones, and it even learns from one type of chip to help another.

Summary

EvoKernel is a framework that teaches AI how to learn new, difficult programming languages (for specialized computer chips) without needing a massive library of existing examples.

It does this by:

Drafting: Trying until it gets a working solution.
Refining: Iteratively making that solution faster.
Remembering: Keeping a "smart notebook" that learns which past experiences are actually useful for the current problem, allowing the AI to get smarter with every single attempt, even in a data-scarce world.

It turns a "cold start" (starting with nothing) into a "warm start" (starting with a growing library of wisdom).

Here is a detailed technical summary of the paper "Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis."

1. Problem Statement

The paper addresses the "Cold-Start" problem in deploying Large Language Models (LLMs) to data-scarce programming domains, specifically kernel synthesis for emerging Domain-Specific Architectures (DSAs) like NPUs (e.g., Huawei Ascend).

The Data Wall: Unlike mature ecosystems like NVIDIA CUDA, which have massive pre-training corpora, emerging platforms (e.g., Ascend C) suffer from extreme data scarcity. Public code is rare, documentation is esoteric, and compiler feedback is opaque.
Performance Collapse: State-of-the-art LLMs (e.g., GPT-5.2, DeepSeek) achieve high correctness on CUDA (92% on Level 1 tasks) but suffer catastrophic drops on NPU DSLs (dropping to 14% or even 0% on Level 2 tasks).
Limitations of Existing Solutions:
- Supervised Fine-Tuning (SFT): Requires thousands of expert-labeled examples per domain, which is prohibitively expensive and slow to adapt to rapidly evolving hardware.
- Standard RL: Requires extensive online rollouts with high sample complexity and risks catastrophic forgetting.
- Traditional RAG: Fails when the retrieval database is sparse; semantic similarity does not guarantee functional correctness in rigid "all-or-nothing" compilation environments.
Core Challenge: How can an agent autonomously master a rigorous, data-scarce kernel synthesis task from scratch without expert demonstrations or expensive fine-tuning?

2. Methodology: EvoKernel

The authors propose EvoKernel, a self-evolving agentic framework that formulates kernel synthesis as a Memory-based Markov Decision Process (M-MDP). The system does not update the LLM's weights; instead, it learns to retrieve high-utility experiences from a self-evolving memory bank.

A. Framework Architecture

The process is divided into two stages managed by a shared memory:

Cold-Start Drafting: Focuses on bootstrapping a functionally correct kernel.
Continual Refining: Focuses on optimizing latency (performance) once a feasible draft exists.

B. Value-Driven Retrieval Mechanism

Unlike traditional similarity-based retrieval, EvoKernel uses a Value-Driven Retrieval policy ( $\mu$ ) that learns stage-specific Q-values to estimate the utility of historical experiences.

Q-Value Definition: $Q_k(s, m)$ $Q_{k} (s, m)$ estimates the expected benefit of including memory item $m$ $m$ in the context at stage $k$ $k$ .
- Drafting Stage ( $Q_1$ ): Estimates the likelihood that an item contributes to a functionally correct kernel.
- Refining Stage ( $Q_2$ ): Estimates the contribution of an item to latency optimization.
Unified Update Rule: The system uses a unified Monte-Carlo (MC) update rule to refine Q-values based on verifier feedback without updating the generator weights:
$Q(s, m) \leftarrow Q(s, m) + \alpha \cdot (r - Q(s, m))$
Where $r$ is the reward signal derived from the environment.

C. Multi-Gate Verification Environment

The environment acts as a robust verifier ( $V$ ) providing structured feedback:

Anti-Hacking ( $g_{hack}$ ): Ensures the solution uses the NPU kernel path and does not "cheat" by using Python/PyTorch fallbacks or constant folding.
Compilation ( $g_{comp}$ ): Verifies successful compilation against the backend toolchain.
Correctness ( $g_{corr}$ ): Validates output against a PyTorch reference with strict tolerances.
Latency ( $\ell_{lat}$ ): Measures on-device execution time for feasible kernels.

Reward Functions:

Drafting: Binary reward (+1 for feasibility, -1 otherwise).
Refining: Relative reward based on speedup over the best-so-far latency, normalized using PopArt-style online normalization to handle scale differences.

D. Memory Architecture

The memory bank ( $M$ ) is heterogeneous, containing:

API templates for the specific backend.
Summarized success/failure experiences.
Generation traces (drafts and refined variants).
Best practices for kernel refinement.
Cross-Task Sharing: Memories are shared across different operators, allowing the agent to transfer insights from simple tasks to complex ones.

3. Key Contributions

Unified Drafting-Refining Pipeline: A two-stage framework over a shared memory that transitions from feasibility-driven drafting to latency-driven refining, effectively bootstrapping NPU kernels.
Evolving Value-Driven Retrieval: A novel mechanism that learns stage-specific Q-values to dynamically prioritize memory items based on the current objective (correctness vs. speed), adapting the policy via verifier feedback without weight updates.
Comprehensive Evaluation: Demonstrates that value-guided experience accumulation allows general-purpose models to master data-scarce hardware ecosystems, bridging the gap between frontier models and niche hardware.

4. Experimental Results

The authors built an NPU variant of KernelBench (Ascend C) and evaluated EvoKernel against baselines (Pass@k, Iterative Refinement, and Codex) using models like GPT-5.2, DeepSeek-V3.2, and Qwen3-Coder-30B.

Correctness Improvement: EvoKernel improved the functional correctness of frontier models from 11.0% to 83.0% on NPU benchmarks.
- On GPT-5.2, it achieved 98.5% Compilation Rate and 83.0% Accuracy, significantly outperforming the Codex baseline (83.0% CR, 46.0% Acc).
Performance Optimization: Through iterative refinement, EvoKernel achieved a median speedup of 3.60× over the initial feasible draft. Some operators saw speedups exceeding 200×.
Cross-Task Transfer:
- Difficulty Transfer: Learning on easier Level 1 operators accelerated solving Level 2 operators by 30% compared to starting from scratch.
- Backbone Transfer: Memory built by a strong model (GPT-5.2) significantly improved weaker models (DeepSeek, Qwen), boosting their compilation rates from ~20% to ~80%.
Generalization: The framework successfully transferred to out-of-distribution workloads, including the Attention Set and recent DeepSeek mHC kernels, achieving 78.6% correctness on Ascend Attention Set operators without specific training on them.

5. Significance

Democratizing Niche Hardware: EvoKernel demonstrates that data-scarce programming domains (like NPU kernel synthesis) can be mastered by general-purpose LLMs through non-parametric, memory-augmented approaches, reducing reliance on expensive, domain-specific fine-tuning.
Beyond Memorization: The results suggest that frontier LLMs possess strong in-context learning capabilities that can be unlocked via value-guided experience accumulation, rather than just relying on pre-training memorization.
Scalable Agent Design: The approach offers a blueprint for autonomous agents in other cold-start domains with binary verification signals, showing that agents can autonomously construct curricula (solving simple tasks first to aid complex ones) and bridge the gap between general AI and specialized engineering tasks.

In conclusion, EvoKernel proves that by treating kernel synthesis as a reinforcement learning task over a self-evolving memory, agents can overcome the "Data Wall" of emerging hardware architectures, achieving high correctness and performance without the need for massive labeled datasets or model retraining.