Resource-Efficient Iterative LLM-Based NAS with Feedback Memory

Imagine you are trying to design the perfect recipe for a cake, but you have a very strict budget: you can only bake one cake at a time, and you only have one oven (a single computer graphics card).

In the past, designing the perfect AI "brain" (a neural network) was like hiring a massive team of chefs who burned through thousands of ovens and ingredients to find the best recipe. This paper introduces a new, frugal way to do it using a smart, talking AI assistant (a Large Language Model or LLM) that learns from its own mistakes, all on a single home computer.

Here is how their system works, broken down into simple analogies:

1. The Problem: The "Throwaway" Chef

Usually, when an AI tries to design a new network, it guesses a design, tests it, and if it fails, it just throws that failure away and tries a completely new guess. It's like a chef baking a cake, burning it, and then immediately trying to bake a totally different cake without ever asking, "Why did the first one burn?"

2. The Solution: The "Memory Loop"

The authors created a closed-loop system (a cycle) where the AI doesn't just guess; it learns. Think of it as a Master Chef and a Critic working together in a tiny kitchen.

The Code Generator (The Chef): This is the AI that writes the "recipe" (the computer code for the neural network). It tries to bake a new cake based on what it knows.
The Evaluator (The Taster): This part bakes the cake for just one minute (one epoch of training) to see if it's edible. It doesn't wait for the cake to be perfect; it just checks if the basic structure holds together.
The Prompt Improver (The Critic): This is the most important part. If the cake burns or tastes bad, the Critic doesn't just say "Fail." It looks at why it failed, writes down a note, and tells the Chef exactly how to fix it for the next try.

3. The Secret Sauce: "Historical Feedback Memory"

This is the paper's biggest innovation. Imagine the Critic has a small notepad that can only hold the last 5 attempts.

The Markov Chain (The Sliding Window): Instead of remembering every cake the chef ever made (which would fill up the notepad and confuse the AI), the Critic only remembers the last 5 tries.
The Diagnostic Triple: Every time the Critic writes a note, it writes three things:
1. The Problem: "The cake collapsed because the flour ratio was wrong."
2. The Fix: "Next time, use less flour."
3. The Result: "We tried this, and it worked!" (or "It still failed, but here's why.")

By treating failures as valuable lessons rather than trash, the AI learns a pattern. It stops making the same mistakes and starts building on the few successes it has.

4. The "Dual-Brain" Strategy

To save mental energy (and computer memory), the system splits the work between two roles:

Role A (The Builder): Focuses only on writing the code.
Role B (The Doctor): Focuses only on analyzing why the code failed and suggesting fixes.

This is like having a specialist builder and a specialist doctor. They don't try to do everything at once, which keeps the "brain" from getting overwhelmed. Because they are sharing the same small computer memory, the AI naturally learns to build small, efficient cakes (compact models) that fit in the oven, rather than giant, impossible ones.

5. The Results: Small Budget, Big Wins

The researchers tested this on three different "taste tests" (datasets: CIFAR-10, CIFAR-100, and ImageNette) using three different AI assistants (DeepSeek, Qwen, and GLM).

The Cost: They ran the whole experiment for 2,000 attempts on a single, consumer-grade computer card (an RTX 4090). It took about 18 hours total.
The Outcome:
- DeepSeek started with a 28% success rate and climbed to 69%.
- Qwen started at 50% and reached 71.5%.
- GLM started at 43% and reached 62%.

Even though these models are small (under 7 billion parameters) and the AI wasn't "re-trained" or fine-tuned, the iterative feedback loop allowed them to discover much better designs than if they had just guessed once and stopped.

The Takeaway

This paper proves you don't need a supercomputer or a massive team to design great AI. You just need a smart, iterative process that remembers its mistakes, learns from them, and keeps trying. It turns the expensive, wasteful process of AI design into a low-budget, resource-efficient loop that anyone with a decent home computer can run.

In short: It's the difference between a chef who throws away a burnt cake and gives up, versus a chef who writes down why it burned, fixes the recipe, and bakes a better one next time.

1. Problem Statement

Neural Architecture Search (NAS) automates the design of deep neural networks but traditionally suffers from prohibitive computational costs (requiring thousands of GPU days) and reliance on constrained, predefined search spaces (e.g., cell-based structures). While recent Large Language Models (LLMs) have shown promise in generating neural network code, existing approaches typically rely on single-shot generation (one-pass prediction) or require fine-tuning massive models. These methods often discard failure trajectories, losing valuable learning signals, and fail to operate effectively in resource-constrained environments (e.g., single consumer-grade GPUs) without cloud infrastructure.

2. Methodology

The authors propose a closed-loop, iterative NAS pipeline that leverages frozen, instruction-tuned LLMs (≤7B parameters) to generate, evaluate, and refine convolutional neural network (CNN) architectures for image classification. The system operates without LLM fine-tuning and runs entirely on a single consumer GPU (RTX 4090).

Core Components

The pipeline consists of three interacting modules:

Code Generator: An LLM that produces executable PyTorch code (nn.Module) based on a prompt containing the task specification, the current best architecture, and improvement suggestions.
Evaluator: A module that validates the generated code (checking input/output shapes) and trains the model for one epoch on datasets (CIFAR-10, CIFAR-100, ImageNette) using SGD. The resulting top-1 accuracy serves as a fast proxy metric for architecture quality.
Prompt Improver: An LLM that analyzes the evaluation results alongside a Historical Feedback Memory to generate targeted suggestions for the next iteration.

Key Mechanisms

Historical Feedback Memory (Markovian Sliding Window):
- Instead of retaining the entire search history (which causes context overflow), the system maintains a sliding window of the last $K=5$ improvement attempts.
- Each entry is a structured diagnostic triple: (identified problem, suggested modification, resulting outcome).
- Crucially, code execution failures are treated as first-class learning signals rather than being discarded. This allows the LLM to learn causal patterns between design decisions and outcomes.
- This design adheres to the Markov property, where the next suggestion depends only on the current best architecture and the bounded recent history, ensuring constant context size.
Dual-LLM Specialization:
- The system splits the cognitive load between two roles (potentially using the same model instance): a Code Generator (focused on synthesis) and a Prompt Improver (focused on diagnostic reasoning). This reduces per-call cognitive load.
Hardware-Aware Search:
- Since the LLM inference and architecture training share the same limited VRAM (24GB), the search implicitly favors compact, memory-efficient models suitable for edge deployment.

3. Key Contributions

Closed-Loop Iterative NAS: A novel pipeline that enables progressive architectural discovery through code generation, evaluation, and prompt refinement, moving beyond single-shot generation.
Structured Failure Modeling: The introduction of a historical feedback memory that explicitly records failure diagnostics (problem, suggestion, outcome) in a sliding window, enabling the LLM to learn from errors rather than discarding them.
Resource-Efficient Paradigm: Demonstrated that small, frozen LLMs (≤7B parameters) can perform effective NAS on a single consumer GPU (RTX 4090) in ~18 hours for 2,000 iterations, without any LLM fine-tuning or cloud infrastructure.
Open Code Space Exploration: Unlike traditional NAS constrained to cell-based encodings, this method operates in an unconstrained open code space, allowing for the invention of novel architectural patterns.

4. Experimental Results

The pipeline was evaluated on CIFAR-10, CIFAR-100, and ImageNette using three distinct frozen LLMs: DeepSeek-Coder-6.7B, Qwen2.5-7B, and GLM-5.

Performance Gains (CIFAR-10 One-Epoch Proxy Accuracy):
- DeepSeek-Coder-6.7B: Improved from 28.2% (single-shot baseline) to 69.2% (Spearman $\rho = 0.75$ ).
- Qwen2.5-7B: Improved from 50.0% to 71.5% (Spearman $\rho = 0.56$ ). Despite a low success rate (18.8%), it achieved the highest peak accuracy by exploring ambitious architectures.
- GLM-5: Improved from 43.2% to 62.0% (Spearman $\rho = 0.42$ ) with the highest success rate (91.0%) in only 100 iterations.
Search Efficiency: A full 2,000-iteration search completed in approximately 18 GPU hours on a single RTX 4090.
Ablation Studies: Removing the historical feedback memory or the reference architecture caused the search to stagnate or degrade, confirming that modeling failure causality is critical for iterative improvement.
Cross-Dataset Generalization: The method showed consistent upward trends across all datasets, though performance varied based on input resolution and model specialization (e.g., DeepSeek struggled with high-resolution ImageNette due to context retention issues).

5. Significance and Impact

Democratization of NAS: This work establishes a low-budget, reproducible paradigm for NAS, making advanced architecture search accessible to researchers without access to massive compute clusters or proprietary frontier models.
Learning from Failure: By treating code execution errors as structured learning signals, the method overcomes a major limitation of previous LLM-based optimizers that discard failure trajectories.
Hardware Efficiency: The implicit bias toward compact models due to shared VRAM constraints makes this approach particularly relevant for edge AI and deployment on resource-limited devices.
Scalability: The use of bounded Markovian memory allows the system to scale to thousands of iterations without context window overflow, a common bottleneck in iterative LLM optimization.

In conclusion, the paper demonstrates that combining structured iterative feedback with small, frozen LLMs creates a powerful, efficient, and hardware-aware framework for automating neural network design.