MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLMs for Clinical Tabular Prediction

Imagine you are a doctor trying to predict which patients might get sick or pass away soon. You have a massive spreadsheet (a "tabular dataset") filled with numbers: age, heart rate, blood pressure, lab results, and more.

In the world of machine learning, there's a long-standing debate: Should we use simple, classic math tools (like decision trees) or complex, deep neural networks (like the ones that power AI art) to make these predictions?

Surprisingly, for medical spreadsheets, the simple tools often win. But they need help. They need someone to look at the raw numbers and say, "Hey, if you combine age and blood pressure in this specific way, it tells us something new." This is called Feature Engineering.

Traditionally, this was done by human experts. It was slow, expensive, and hard to scale. Then came LLMs (Large Language Models) like the one you are talking to now. They know a lot of medical facts. But early attempts to use them were like throwing a dart at a board while blindfolded: they guessed random combinations without checking if the computer model actually needed them.

Enter MedFeat.

Here is how MedFeat works, explained through a simple analogy:

The Analogy: The Master Chef and the Picky Eater

Imagine you are a Master Chef (The LLM) trying to create a new dish to impress a very Picky Eater (The Machine Learning Model).

The Problem: The Picky Eater has a specific palate.
- If the Eater is a Logistic Regression model, they only like simple, straight-line flavors. They can't taste complex curves unless you explicitly mix ingredients for them.
- If the Eater is XGBoost (a tree-based model), they are great at spotting complex patterns on their own, but they might miss subtle, long-term trends or global statistics.
- Old methods would just throw random ingredients at the Eater and see what sticks.
- MedFeat is different. It asks the Eater, "What are you struggling to taste right now?" and then tells the Chef exactly what to cook.
The "Feedback Loop" (The Taste Test):
MedFeat doesn't just guess. It runs a simulation:
- Step 1: It trains the model on the current data.
- Step 2: It uses a tool called SHAP (think of this as a "flavor analyzer") to see which ingredients the model is already using and which ones it is ignoring.
- Step 3: It tells the LLM: "The model is good at spotting high heart rates, but it's missing the instability of the heart rate over time. Also, it's ignoring the patient's age. Go make a new ingredient that combines those two."
The "Island" Strategy (Don't Eat the Whole Buffet):
Medical spreadsheets have hundreds of columns. If you ask the LLM to look at all of them at once, it gets confused (like trying to read a whole library in one second).
- MedFeat groups the most important ingredients into small "Islands" (tiny subsets of data).
- It sends just one Island to the LLM at a time. This keeps the instructions short, focused, and cheap to run.
The Memory Bank (Learning from Mistakes):
If the LLM suggests a "feature" (a new calculation) that makes the model worse, MedFeat remembers: "Don't do that again." If it suggests something great, it remembers: "Do more of that." This creates a cycle of continuous improvement.

Why is this a big deal?

It's Privacy-Safe: The LLM never sees the actual patient names or private records. It only sees the "recipe" (metadata) and the "flavor scores" (importance rankings). No patient data leaves the hospital.
It's Robust: The paper tested this on data from different years and different hospitals. The features MedFeat discovered (like "how unstable a patient's vitals are") worked well even when the data changed. It's like discovering a universal law of physics rather than a rule that only works on Tuesdays.
It's Explainable: Because the LLM is guided by the model's actual needs, the new features it creates make sense to doctors. They aren't just random math; they are clinically meaningful insights.

The Result

In the paper, MedFeat was tested on predicting things like 24-hour mortality (will the patient die in the next day?) and heart failure.

Without MedFeat: The models were okay, but missed subtle signals.
With MedFeat: The models got significantly better at spotting high-risk patients, even without needing hours of expensive tuning.

In short: MedFeat is a smart assistant that talks to a computer model, asks it what it's missing, and then uses a medical expert AI to invent the perfect new data points to fill those gaps. It turns a messy spreadsheet into a crystal-clear crystal ball for patient care.

1. Problem Statement

Clinical prediction tasks often rely on tabular data (e.g., Electronic Health Records) characterized by severe class imbalance, heterogeneous feature types, irregular missingness, and complex temporal patterns.

The Gap: While deep learning excels in vision and language, classical machine learning models (e.g., XGBoost, Logistic Regression) often outperform neural networks on clinical tabular data. However, these models rely heavily on manual feature engineering, which is time-consuming, requires deep domain expertise, and struggles to scale due to the exponential growth of the transformation search space.
Limitations of Existing Automation:
- Traditional AFE (AutoFeat, OpenFE): Rely on predefined operators (arithmetic, non-linear transforms) and lack clinical justification. They often fail to consider the specific inductive biases of the downstream model.
- LLM-based AFE (CAAFE, FeatLLM, OCTree): While they use LLMs to generate features, they suffer from three main flaws:
  1. Model-Agnostic Design: They treat all downstream models the same, ignoring that a feature useful for Logistic Regression (which needs explicit non-linearities) might be redundant for XGBoost (which learns interactions implicitly).
  2. Inefficient Prompting: They send all feature names to the LLM, causing prompt length to grow linearly with dimensionality, diluting attention on critical signals.
  3. Lack of Explainability: They rely solely on validation scores for selection, which can be unstable in noisy, imbalanced clinical data, and often ignore feature importance signals.

2. Methodology: MedFeat Framework

MedFeat is an iterative, feedback-driven framework that integrates Large Language Models (LLMs) with model-awareness and explainability signals (SHAP values) to automate feature engineering.

Core Workflow

The framework operates in a loop over $T$ iterations:

Baseline Training: A downstream model (e.g., XGBoost) is trained on the current feature set.
Explainability & Profiling: SHAP values are computed on the validation set to determine feature importance. Features are profiled by type (static vs. temporal) and metadata.
Island Sampling (Importance-Weighted):
- Instead of feeding all features to the LLM, MedFeat constructs "Feature Islands"—small subsets of features sampled based on their SHAP importance scores.
- This bounds token usage, reduces prompt length, and focuses the LLM on high-value regions of the feature space.
Model-Aware Generation:
- The LLM is prompted with the specific downstream model's characteristics (e.g., "XGBoost learns non-linear splits; propose features it cannot easily learn, such as complex temporal patterns or global statistics").
- The prompt includes the feature island, model constraints, and a Memory Bank of previously successful and failed transformations to guide reasoning.
- The LLM generates executable Python code for new feature transformations.
Validation & Selection:
- Generated features are applied to the dataset. A new model is trained, and performance is evaluated.
- Winner-Takes-All: The best-performing island is selected. If the improvement exceeds a tolerance threshold ( $\beta$ ), the features are accepted; otherwise, the iteration is rejected.
- The baseline model and SHAP scores are updated for the next iteration.

Key Design Principles

Privacy-Preserving: No raw patient-level data is sent to the LLM. Only feature metadata, importance scores, and aggregated feedback are used.
Feedback Loop: The system tracks success/failure history to avoid repeating low-value proposals and reinforces patterns that consistently improve performance.

3. Key Contributions

First Model-Aware LLM Framework: MedFeat is the first to adapt LLM feature proposals to the specific representational limits of the downstream learner (e.g., generating explicit interactions for Logistic Regression vs. robust temporal patterns for XGBoost).
Explainability-Driven Guidance: It uses SHAP values not just for post-hoc interpretation but as a control signal to prioritize feature sampling and condition the LLM prompt, ensuring generated features target the model's actual predictive gaps.
Island Search Strategy: By sampling small, importance-weighted feature subsets ("islands"), MedFeat solves the prompt-length scalability issue and improves generation quality by reducing context dilution.
Robustness & Generalizability: The framework demonstrates that features engineered by MedFeat generalize across distribution shifts (e.g., from ICU cohorts to general hospital populations) and temporal drifts, reducing the need for frequent retraining.

4. Experimental Results

The authors evaluated MedFeat on three diverse clinical datasets: IORD (Oxfordshire), MIMIC-IV (ICU), and HRS (Health and Retirement Study), covering tasks like 24-hour mortality, heart failure prediction, and 10-year mortality.

Performance Gains:
- XGBoost: MedFeat achieved the best AUC on all 5 tasks and best F1 on 3 tasks in the default (untuned) regime. For 24-hour inpatient mortality, it improved AUC from 0.686 to 0.740 (+7.87%).
- Logistic Regression: MedFeat consistently outperformed baselines, showing significant gains in both AUC and F1, as engineered features directly expand the linear model's capacity.
- Comparison: MedFeat outperformed classical AFE (AutoFeat, OpenFE) and other LLM-based methods (CAAFE, FeatLLM, OCTree). Notably, while other methods often improved one metric at the expense of another (e.g., higher F1 but lower AUC), MedFeat provided stable improvements across metrics.
Hyperparameter Optimization (HPO): Even after extensive HPO, MedFeat retained significant gains, particularly in F1 scores for highly imbalanced tasks, proving that the engineered features provide structural value beyond what tuning alone can achieve.
Ablation Studies:
- Removing Model-Awareness caused a significant drop in F1 (up to 35.9% on imbalanced tasks), confirming the necessity of tailoring features to the learner.
- Removing Feature Importance Guidance resulted in consistent AUC drops (2–3%), validating the use of SHAP as a critical signal.
Generalizability: Features generated on MIMIC-IV (ICU) successfully transferred to IORD (general inpatient) data, improving stability and reducing run-to-run variance.

5. Significance

Practical Deployment: MedFeat offers a scalable, interpretable, and privacy-compliant pathway for automating feature engineering in healthcare, addressing the "black box" nature of deep learning and the manual burden of classical ML.
Robustness to Drift: By focusing on clinically meaningful, model-aware structures rather than spurious correlations, MedFeat features act as drift-robust representations, maintaining performance over time without constant retraining.
Bridging the Gap: The work demonstrates that LLMs, when guided by domain knowledge and model constraints, can effectively augment classical machine learning pipelines, which remain the gold standard for clinical tabular prediction.

In conclusion, MedFeat represents a shift from "brute-force" search to intelligent, feedback-driven feature discovery, successfully integrating the reasoning capabilities of LLMs with the statistical rigor required for clinical decision support.

MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLMs for Clinical Tabular Prediction

The Analogy: The Master Chef and the Picky Eater

Why is this a big deal?

The Result

1. Problem Statement

2. Methodology: MedFeat Framework

Core Workflow

Key Design Principles

3. Key Contributions

4. Experimental Results

5. Significance

More like this

DualDynamics: Synergizing Implicit and Explicit Methods for Robust Irregular Time Series Analysis

Robot Collapse: Supply Chain Backdoor Attacks Against VLM-based Robotic Manipulation

ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis

SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding

Advanced Assistance for Traffic Crash Analysis: An AI-Driven Multi-Agent Approach to Pre-Crash Reconstruction