Evolving Demonstration Optimization for Chain-of-Thought Feature Transformation

Imagine you are trying to teach a very smart, but slightly stubborn, chef (the Large Language Model or LLM) how to cook a perfect dish using a specific set of ingredients (your data).

The goal is Feature Transformation: taking raw ingredients (like flour, eggs, and sugar) and mixing them in clever ways (like making a batter, caramelizing sugar, or emulsifying butter) to create a new, tastier dish (better predictive performance).

Here is the problem: The chef is smart, but if you just give them a static recipe card with a few examples, they might get bored, repeat the same old tricks, or try to mix salt with chocolate because they don't understand the goal of the dish. They need better guidance.

This paper proposes a new way to train the chef called Evolving Demonstration Optimization. Instead of giving the chef a static recipe, you build a living, breathing cookbook that gets smarter every time the chef cooks.

Here is how the process works, broken down into simple steps:

1. The Problem with Old Methods

The "Blind Search" (Old AI): Imagine a robot trying to cook by randomly throwing ingredients into a pot. It tries millions of combinations. Most are inedible (invalid), and it takes forever to find a good one. It's inefficient and wasteful.
The "Static Prompt" (Current LLMs): Imagine giving the chef a single, unchanging recipe card with three examples. The chef follows it, but if the card is boring or repetitive, the chef just copies those three examples over and over. They don't learn how to improve; they just memorize the examples.

2. The New Solution: A "Living Cookbook"

The authors propose a three-stage loop that turns the chef's experience into a better teacher.

Stage 1: The "Taste Test" (RL Exploration)

First, we don't ask the chef to cook yet. We send a robot assistant (Reinforcement Learning) into the kitchen to experiment wildly.

The robot tries thousands of weird ingredient combinations.
It tastes the result immediately. If a combination tastes bad, it throws it away. If it tastes good, it saves the recipe.
Result: We now have a pile of "verified winners"—recipes that we know actually work. This is our Experience Library.

Stage 2: The "Cookbook Editor" (Refinement)

Now we take that pile of winning recipes and organize them for the chef. This is the most creative part:

Cleaning: We throw out any recipe that looks good on paper but would explode in the oven (checking for math errors or invalid data).
Storytelling (Chain-of-Thought): Instead of just listing recipes, we arrange them in a story. We show the chef: "First, we tried mixing A and B. That was okay. Then we added C. That was better. Finally, we heated it, and it was perfect." This shows the chef the path to improvement, not just the final result.
Diversity Check: We make sure the cookbook isn't just 100 variations of "Spaghetti." We ensure there are soups, salads, and desserts too. We use a "variety meter" (Entropy) to make sure the chef sees many different types of cooking styles.

Stage 3: The "Master Class" (Generation & Feedback)

Now, we hand this evolved, organized, diverse cookbook to the chef (the LLM).

The chef reads the stories and the progression of flavors.
The chef creates a new dish based on what they learned.
The Magic Loop: We taste the new dish. If it's delicious, we add it to the cookbook! If it's bad, we discard it.
Next time, the chef reads a better cookbook because it now includes the new, successful dish. The library evolves.

Why This is a Big Deal

It's Self-Improving: The system doesn't need to reprogram the chef. It just updates the context (the examples) the chef sees. It's like upgrading the chef's library of reference books rather than trying to rewrite the chef's brain.
It's Stable: Unlike the "Blind Search" which is chaotic, or the "Static Prompt" which gets stuck, this method consistently gets better over time.
It Works for Everyone: Whether you use a tiny, open-source chef or a massive, expensive commercial chef, this method works because it focuses on the examples, not the chef's internal code.

The Takeaway

Think of this paper as a smart mentorship program. Instead of telling a student (the AI) "Here is the answer," the system says, "Here is a story of how we got to the answer, here are the mistakes we avoided, and here is how we improved step-by-step."

By constantly updating this story with real-world success stories, the AI learns to cook better dishes (transform data better) without needing to be retrained from scratch. It turns the "prompt" from a static instruction into a dynamic, evolving experience.

Here is a detailed technical summary of the paper "Evolving Demonstration Optimization for Chain-of-Thought Feature Transformation."

1. Problem Statement

Feature Transformation (FT) is a data-centric AI task aimed at improving downstream predictive performance by deriving new features from original ones using mathematical operators (e.g., transforming $[a, b]$ into $[a/b, a-b]$ ).

Challenges:
- Combinatorial Explosion: The search space for feature-operator combinations grows exponentially with feature dimension, making exhaustive search infeasible.
- Inefficiency of Existing Methods:
  - Discrete Search (RL-based): Suffers from sparse rewards, biased exploration, and high invalidity rates (generating mathematically impossible or unstable transformations).
  - Latent Space Generation: Often produces "blind" transformations that mismatch training data or fail to be executable.
  - Static LLM Prompting: While Large Language Models (LLMs) have strong priors for valid transformations, current methods rely on static few-shot demonstrations. This leads to limited diversity, redundant outputs, and a lack of alignment with specific downstream objectives because the prompts do not evolve based on feedback.

Goal: To develop a framework that optimizes the context data (few-shot demonstrations) for LLM-driven FT, treating the prompt not as a static input but as an evolving, data-centric resource.

2. Methodology: The Proposed Framework

The authors propose a closed-loop, data-centric framework that optimizes context through three distinct stages. The core philosophy is "Context-as-Data," where few-shot examples are treated as an evolving experience library rather than fixed prompts.

Stage I: RL Exploration for High-Performing Sequences

Objective: Generate a foundational set of valid, high-performing transformation sequences to seed the experience library.
Mechanism: A Reinforcement Learning (RL) agent (inspired by GRFG) explores the feature space.
- State: Current transformed feature set.
- Action: Applying an operator to specific features.
- Reward: The improvement in downstream task performance (e.g., F1-score or 1-RAE).
Output: A set of verified transformation sequences ( $T_{RL}$ ) that serve as the initial "experience library" ( $E$ ). This ensures the LLM starts with task-aligned, high-quality signals rather than random noise.

Stage II: Three-Level Refinement for Context Construction

Raw sequences from Stage I are cleaned and reorganized to form Chain-of-Thought (CoT) style demonstrations.

Sequence Validation (Local Reliability):
- Checks for syntactic validity, numerical stability (e.g., avoiding division by zero), and minimum utility (discarding combinations that yield negative gains).
- Result: Drastically reduces the ratio of invalid or non-executable transformations.
CoT Trajectory Construction & Enhancement:
- Trajectory: Verified sequences are sorted by downstream performance to form a "Chain of Improvement," showing the LLM how features evolve from low to high performance.
- Enhancement: An LLM is used to propose intermediate steps or local variants to fill gaps between strong sequences, enriching the pattern library.
Entropy-Guided Diversity Selection:
- To prevent the library from collapsing into redundant patterns, the system selects a subset of experiences ( $S$ ) by optimizing a trade-off:
  $\max_{S} \left( \text{Quality} + \lambda \cdot \text{Coverage (Entropy)} - \mu \cdot \text{Redundancy} \right)$
- This ensures the few-shot context covers a broad range of transformation patterns while minimizing redundancy.

Stage III: Experience-Conditioned Generation & Write-Back

Generation: The refined CoT-style context guides the LLM to generate new candidate transformation sequences.
Verification: Generated sequences are validated and evaluated on the downstream task.
Closed-Loop Update: High-performing, verified sequences are written back into the experience library ( $E_{t+1} \leftarrow E_t \cup \text{New Candidates}$ ).
Result: The context evolves over iterations, continuously improving the quality and diversity of the guidance provided to the LLM.

3. Key Contributions

Context-as-Data Formulation: The paper reframes LLM-driven FT as a data-centric optimization problem. Instead of tuning model parameters, the method optimizes the few-shot experience library, treating demonstrations as reusable, adaptable tokens.
Closed-Loop Experience Construction: A novel pipeline that integrates RL exploration, multi-level refinement (validation, CoT organization, diversity selection), and iterative write-back. This creates a self-improving system that aligns LLM generation with downstream objectives.
Dynamic Transformation Trajectory View: The method treats examples not as static points but as parts of an evolving trajectory. By showing the LLM a "Chain of Improvement," it guides the model toward higher-performance regions of the search space more effectively than isolated examples.

4. Experimental Results

The framework was evaluated on diverse tabular benchmarks (UCI, Kaggle, OpenML) covering classification and regression tasks.

Performance vs. Baselines:
- Outperformed classical search-based methods (GRFG, MOAT) and automated feature engineering tools (AutoFeat, AFAT).
- Surpassed other LLM-based baselines (FeatLLM, CAAFE) in both average performance and stability.
- Achieved the best average ranking across all datasets.
Closed-Loop vs. One-Shot:
- Under the same evaluation budget, the closed-loop method showed consistent performance improvements over iterations.
- One-shot generation (even with re-sampling) was unstable and failed to show consistent gains, highlighting the value of the evolving experience library.
Ablation Studies:
- CoT Structure: Removing the CoT trajectory organization caused the largest performance drop, proving that showing the evolution of features is critical.
- Validation: Removing validity checks significantly increased the error ratio (invalid transformations).
- Diversity: Entropy-guided selection was crucial for reducing redundancy and improving coverage.
Transferability & Robustness:
- The framework worked effectively across various LLMs (API-based like GPT-4o/o1 and open-source like Llama-3, DeepSeek, Qwen).
- The transformed features remained robust across different downstream models (e.g., Random Forest, XGBoost, SVM), indicating the method learns generalizable patterns rather than overfitting to a specific model.

5. Significance and Insights

Paradigm Shift: The paper demonstrates that for LLM-driven tasks, optimizing the context data is often more effective and practical than optimizing model parameters or relying on static prompts.
LLM Behavior Insights:
- LLMs naturally prefer simpler, stable operators and tend to be conservative. The RL exploration helps bridge this gap by introducing complex, high-value transformations that the LLM might not generate spontaneously.
- LLMs can implicitly perform feature selection, recognizing strong original features in the context.
Practicality: The method is model-agnostic and can be deployed with any LLM (open or closed source), making it a flexible solution for data-centric AI.
Stability: By distilling verified experiences into a CoT trajectory, the method mitigates the "hallucination" and instability often associated with generative AI in scientific/mathematical tasks.

In conclusion, this work establishes that evolving demonstrations via a closed-loop, data-centric process significantly enhances the capability of LLMs to perform complex feature engineering, offering a stable, high-performance alternative to traditional search and static prompting methods.