LeJOT-AutoML: LLM-Driven Feature Engineering for Job Execution Time Prediction in Databricks Cost Optimization

Imagine you are the manager of a massive, high-speed delivery network (like a giant pizza chain, but for data). Your goal is to get thousands of orders delivered as cheaply as possible without them getting cold (latency).

To do this, you need to know exactly how long each delivery will take. If you guess wrong, you either send a tiny scooter for a huge order (it arrives late) or a massive truck for a single slice of pizza (you waste money on fuel).

This is the problem LeJOT tries to solve for Databricks (a giant cloud data platform). But here's the catch: predicting delivery time is incredibly hard because every order is different, and the traffic changes every second.

The Old Way: The "Senior Chef"

Traditionally, companies hired a team of expert chefs (data engineers) to write a manual recipe for predicting time.

The Problem: These chefs had to guess based on static rules. They looked at the menu (the code) and the size of the kitchen (the server), but they couldn't see the actual traffic jams or the burnt crusts happening in real-time.
The Result: It took them a month to write a new recipe, and even then, it often missed the subtle details that make a delivery slow or fast.

The New Way: LeJOT-AutoML (The "AI Super-Intern")

The authors built LeJOT-AutoML, which is like hiring a team of super-smart AI interns who never sleep, never get tired, and can read the entire history of the kitchen in seconds.

Here is how it works, broken down into simple steps:

1. The Detective (Feature Analyzer Agent)

Instead of guessing, this AI detective reads the "kitchen logs," the "menu," and the "weather reports" (historical data). It uses a special library of knowledge (RAG) to ask: "What actually makes a pizza take 20 minutes instead of 10?"

The Magic: It doesn't just look at the size of the pizza; it notices that "if the cheese is heavy and the oven is hot, the crust burns faster." It finds 200+ clues (features) that the human chefs missed.

2. The Builder (Feature Extraction Agent)

Once the detective has a list of clues, the Builder goes to work. It has a special set of tools (the MCP Toolchain) that let it peek into the real-time kitchen without breaking anything.

It checks the logs: "Did the oven jam?"
It checks the data: "Is the dough stuck in a corner?"
Safety First: Before it writes anything down, it runs the plan through a "Safety Gate." It asks: "Are we using information we shouldn't have yet?" (Like checking the delivery time before the pizza is even baked). If the answer is yes, it throws the plan away.

3. The Judge (Feature Evaluation Agent)

This agent tastes the soup. It looks at the clues the Builder found and asks: "Is this clue actually useful, or is it just noise?"

If a clue is unreliable, the Judge tells the Detective to try again.
This happens in a loop, refining the recipe over and over until it's perfect.

4. The Speed Run

The best part? While the human chefs took one month to write a new recipe, this AI team does it in 20 to 30 minutes.

The Results: Why It Matters

When the researchers tested this on real enterprise data:

More Clues: The AI found over 200 features to predict time, compared to the humans' 40.
Cost Savings: Because the AI predicts time so much better, the system can choose the perfect, cheapest server for every job. This saved the company 19% on their cloud bills.
Adaptability: If the "kitchen" changes (new servers, new software), the AI instantly updates its recipe. Humans take weeks to adapt; the AI takes minutes.

The Bottom Line

Think of LeJOT-AutoML as upgrading from a paper map to a live GPS with traffic cameras.

Old Way: "I think this road takes 20 minutes because it's usually 20 minutes."
New Way: "I see a traffic jam, a broken traffic light, and a detour. I'm rerouting you to save 15% on gas and get there faster."

It turns a slow, manual, error-prone process into a fast, self-improving machine that saves millions of dollars by simply understanding the "hidden" details of how data moves.

1. Problem Statement

In enterprise cloud environments like Databricks, job orchestration systems (e.g., LeJOT) aim to minimize cloud costs by selecting the most cost-effective compute configurations while adhering to latency and dependency constraints. The core challenge lies in accurately predicting job execution times under heterogeneous instance types and non-stationary runtime conditions.

Current pipelines face four critical obstacles:

Runtime-Dependent Signals: High-impact performance factors (e.g., partition pruning effectiveness, data skew, shuffle amplification, and executor scheduling) only emerge at runtime and are invisible to static analysis.
Data Fragmentation: Relevant signals are scattered across disparate sources: execution logs, metadata stores, job scripts, and configuration histories.
Manual Engineering Bottlenecks: Traditional feature engineering requires deep domain expertise in Spark SQL and platform internals. It is slow, brittle, and often lags behind evolving workloads.
Stale Predictors: Slow retraining cycles lead to outdated models when workload drift occurs, degrading orchestration quality and increasing costs.

2. Methodology: LeJOT-AutoML Framework

The authors propose LeJOT-AutoML, an agent-driven AutoML framework that embeds Large Language Model (LLM) agents throughout the machine learning lifecycle. It utilizes a Model Context Protocol (MCP) toolchain to bridge the gap between static artifacts and dynamic runtime data.

Core Architecture

The system operates in two phases: Automated Training and Online Inference.

A. The Agent Ecosystem

Feature Analyzer Agent (FAA):
- Uses Retrieval-Augmented Generation (RAG) to query a domain knowledge base (Spark SQL practices, platform rules).
- Analyzes heterogeneous inputs (logs, metadata, scripts) to propose a structured list of candidate feature templates.
Feature Extraction Agent (FExA):
- Translates FAA's proposals into executable code.
- Invokes the MCP toolchain to materialize features:
  - Metadata Tools: Schema, partitions, table statistics.
  - Log/Trace Tools: Stage/task timing, shuffle volumes.
  - Sandbox Tools: Read-only SQL queries to inspect execution plans without modifying data.
- Handles normalization, encoding, and data quality checks.
Feature Evaluation Agent (FEvA):
- Evaluates feature quality (coverage, stability, distribution shifts) and utility (importance, redundancy).
- Provides feedback to FAA/FExA to iteratively refine the pipeline.
Model Selector:
- Trains and selects the best predictor (e.g., XGBoost, LightGBM) based on performance metrics.

B. Safety Mechanisms
To ensure reliability in an enterprise setting, two strict Safety Gates filter all generated code before execution:

Code-Completion Checker: Verifies syntactic completeness, valid imports, and defined variables.
Data-Leakage Checker: Ensures features are computable only from information available before the scheduling decision (preventing the model from "cheating" by using post-run execution times).

C. Mathematical Formulation
The system optimizes for a trade-off between prediction accuracy and extraction latency. It selects a feature set $F$ that minimizes loss $L$ while satisfying a runtime budget $B$ :
$\min_{F} E[\ell(M(x_F), y)] + \lambda \sum_{f \in F} c(f, d)$
Subject to $\sum c(f, d) \leq B$ , where $c(f, d)$ is the extraction cost.

3. Key Contributions

LLM-Powered AutoML Pipeline: The first framework to embed LLM agents across analysis, tool invocation, feature extraction, validation, and model selection for enterprise job runtime prediction.
Agent-Tool Collaboration via MCP: Successfully combines LLM planning with tool-based execution to extract dynamic, runtime-derived features (e.g., shuffle amplification, skew severity) that are inaccessible to purely static analysis.
Iterative Evaluation with Safety Gates: Introduces a feedback loop driven by an evaluation agent, secured by code-completion and data-leakage checks, enabling rapid, safe, and continuous model adaptation.

4. Experimental Results

The system was evaluated on enterprise Databricks workloads, comparing the AutoML approach against traditional manual feature engineering.

Feature Diversity: LeJOT-AutoML generated 200+ features (including log profiling, time-series, and driver node history) compared to 40+ in manual engineering.
Speed: The feature engineering and evaluation loop was reduced from ~1 month (manual) to 20–30 minutes (AutoML).
Prediction Accuracy:
- Manual: $R^2 = 0.91$ , MAPE = 19.49%.
- AutoML: $R^2 = 0.81$ , MAPE = 20.13%.
- Note: While manual engineering achieved slightly higher accuracy, the AutoML approach is competitive and significantly faster to deploy.
Cost Savings: Integrated into the LeJOT orchestration pipeline, the AutoML solution achieved a 19.01% cost saving in the deployment setting.
Iterative Improvement: Over three evaluation iterations, the AutoML model's MAE dropped from 247.95 to 145.64, and $R^2$ improved from 0.61 to 0.81, demonstrating the effectiveness of the feedback loop.

5. Significance and Conclusion

LeJOT-AutoML represents a paradigm shift in cloud cost optimization by automating the most labor-intensive part of ML pipelines: feature engineering.

Scalability: It enables continuous learning and adaptation to workload drift without human intervention, addressing the "staleness" problem in dynamic cloud environments.
Practicality: Despite a slight accuracy gap compared to expert-crafted features, the system's ability to generate hundreds of features in minutes and achieve nearly 20% cost savings makes it highly valuable for large-scale operations.
Future Direction: The authors note that current limitations lie in capturing configuration-change trajectories and pricing context. Future work aims to enrich the tool interface with these signals to further close the accuracy gap with manual engineering.

In summary, LeJOT-AutoML demonstrates that LLM-driven agents, when coupled with safe toolchains, can effectively replace brittle manual processes, delivering a scalable, self-improving system for cloud resource optimization.