AI-Driven Feature Selection Using Only Survey Variable Descriptions: Large Language Models Identify Adolescent Vaping Predictors

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery: Why do some teenagers start vaping, while others don't?

Usually, to solve this, you'd need to interview thousands of kids, asking them hundreds of questions about their friends, family, school, and feelings. You'd then feed all that data into a super-computer to find the patterns. But what if you couldn't see the answers? What if you only had the list of questions themselves?

That is exactly what this paper does. The researchers asked a new kind of "super-intelligence" (called a Large Language Model or LLM) to look only at the descriptions of survey questions and guess which ones would be the most important clues to solve the mystery.

Here is the breakdown of their adventure:

1. The Cast of Characters

The Mystery: Predicting if a 12-to-16-year-old who has never used tobacco will start vaping in the next year.
The Data: A massive survey called the PATH study, which has over 200 different questions (variables) about kids' lives.
The Detectives (The AI): The researchers didn't use just one detective; they hired four different "super-brains" (GPT-4o, LLaMA 3.1, Qwen 2.5, and DeepSeek-V3). These are advanced AI models that are really good at understanding human language.

2. The Challenge: The "Menu" vs. The "Meal"

Normally, to train a computer to predict something, you need the Meal (the actual data: "Kid A said yes, Kid B said no").
But in this experiment, the researchers only gave the AI the Menu (the list of question titles and descriptions, like "How often do your friends smoke?" or "Do you think vaping is dangerous?").

They asked the AI: "Based on the description of this question, how important is it for predicting if a kid will start vaping?"

The AI had to use its "common sense" and knowledge of the world to rank the questions, without ever seeing a single real answer from a real kid.

3. The Experiment

The researchers asked the four AI detectives to pick their top 50 clues (questions), then their top 40, then 30, and so on, all the way down to 10.

Then, they took those AI-selected clues and fed them into a standard computer program (called LightGBM) to see if it could actually predict the future. They compared this to a program that tried to use all 200 questions at once.

4. The Results: The AI Got It Right!

The results were surprisingly impressive:

Agreement: Even though the four AI models were built differently and trained on different data, they mostly agreed on the same clues. It's like four different experts looking at a menu and all pointing to the same three ingredients as the most important for the recipe.
The "Sweet Spot": When the AI picked just 30 questions, the computer program predicted the outcome better than when it tried to use all 200 questions.
- Analogy: It's like trying to find a needle in a haystack. The AI didn't just find the needle; it told you exactly which 30 pieces of hay to look at, ignoring the other 170 that were just distractions.
The Winner: The model named Qwen 2.5 was the star of the show, achieving the highest accuracy with just 30 selected variables.

5. Why This Matters (The "So What?")

This is a big deal for three reasons:

Privacy: You don't need to see the private answers of thousands of kids to find the important patterns. You just need the list of questions. This protects privacy.
Speed & Cost: Instead of running complex, expensive computer simulations on massive datasets, you can just ask an AI to read the survey questions and tell you what matters. It's a "lightweight" solution.
Reliability: The fact that different AIs agreed on the same factors (like peer pressure, family influence, and risk perception) proves that these are the real drivers of vaping, not just random noise in the data.

The Bottom Line

Think of this study as teaching a computer to be a smart editor. Instead of drowning in a sea of 200 survey questions, the AI can read the "table of contents" and tell researchers, "Hey, you only really need to focus on these 30 chapters to understand the story."

This opens the door for faster, cheaper, and more private ways to study health problems, using the power of AI to cut through the noise and find the signal.

1. Problem Statement

The study addresses the challenge of identifying reliable predictors for Electronic Nicotine Delivery Systems (ENDS) initiation among adolescents using high-dimensional survey data.

Context: Traditional statistical methods (e.g., regression) and standard machine learning feature selection techniques (e.g., RFE, Lasso, SHAP) often rely on iterative retraining with raw data, require expert domain knowledge for covariate selection, and can be unstable or sensitive to sample-specific biases.
Gap: While Large Language Models (LLMs) have shown promise in text analysis, their ability to perform zero-shot feature selection using only textual variable descriptions (without accessing raw individual-level data) in tobacco regulatory science remains under-explored.
Goal: To evaluate whether instruction-tuned LLMs can effectively identify the most predictive survey variables for adolescent vaping initiation solely based on variable names and descriptions, thereby creating a privacy-preserving and scalable framework.

2. Methodology

Data Source and Preprocessing

Dataset: Data from the Population Assessment of Tobacco and Health (PATH) Study, specifically merged waves 4.5 (Dec 2017–Dec 2018) and 5 (Dec 2018–Nov 2019).
Population: 7,943 tobacco-naïve adolescents aged 12–16 years at baseline (Wave 4.5).
Outcome: Binary ENDS use status at Wave 5 (used in the past 30 days: Yes/No).
Variables: Started with 1,396 survey variables; filtered to 214 variables after removing those with >2.5% missing values, low variation, or irrelevance.
Privacy Constraint: The LLMs were provided only with the variable names and their textual descriptions. No raw survey data or individual responses were fed into the LLMs.

LLM-Based Feature Selection

Models Evaluated: Four state-of-the-art instruction-tuned LLMs:
1. GPT-4o
2. LLaMA 3.1-70B
3. Qwen 2.5-72B-Instruct
4. DeepSeek-V3
Process:
- Each variable (name + description) was prompted to the LLMs to assign an importance score (0–1) regarding its predictive power for future ENDS use.
- Stability Check: The process was repeated for 15 independent runs per model to calculate mean importance scores and assess consistency (using Relative Mean Deviation, Coefficient of Variation, and Variance).
- Selection: Variables were ranked by mean scores. Top- $k$ subsets were created for $k \in \{50, 45, 40, 35, 30, 25, 20, 15, 10\}$ .

Predictive Modeling

Classifier: LightGBM (Gradient Boosting Decision Trees) was used to predict ENDS use status.
Training Strategy:
- Data split: 80% training, 20% hold-out test.
- Hyperparameter tuning: Optuna (Bayesian optimization) with 5-fold cross-validation to maximize CV-AUC.
- Class imbalance handling: is_unbalance=True.
- Evaluation: Models were trained 100 times with different random seeds (1–100) to compute mean AUC and standard deviation.
Baseline: A LightGBM model trained on all 214 variables (AUC ~0.768).

3. Key Contributions

Zero-Shot Feature Selection: Demonstrated that LLMs can identify high-value predictors using only textual descriptions, eliminating the need to expose sensitive raw data during the feature selection phase.
Cross-Model Consistency: Showed that diverse LLM architectures (from different vendors and training bases) converge on a highly overlapping set of predictors (31 variables were common across all top-50 lists), suggesting a robust, shared semantic understanding of risk factors.
Performance Superiority: Proved that models trained on LLM-selected subsets (specifically 30–40 features) outperformed or matched models trained on the full 214-variable dataset, effectively reducing dimensionality without sacrificing predictive power.
Privacy-Preserving Framework: Established a workflow where feature engineering is decoupled from data access, enhancing privacy and scalability for public health research.

4. Results

Stability and Consistency

Internal Consistency: All four LLMs demonstrated high stability across 15 runs (RMD: 0–0.15; CV: 0–0.12).
Overlap: The top 50 variables selected by the four models shared 31 variables in common. These variables spanned critical domains:
- Peer and household influence.
- Risk perception regarding tobacco.
- Exposure to tobacco-related cues and advertisements.
- Personal attitudes and intentions.

Predictive Performance (AUC)

Baseline: LightGBM on all 214 variables achieved an AUC of 0.768 (SD: 0.027).
Best Performance:
- Qwen 2.5-72B-Instruct achieved the highest AUC of 0.791 (SD: 0.024) using only 30 features.
- LLaMA 3.1-70B achieved an AUC of 0.789 (SD: 0.029) with 40 features.
- GPT-4o achieved 0.784 (SD: 0.027) with 35 features.
- DeepSeek-V3 achieved 0.772 (SD: 0.032) with 35 features.
Comparison: The LLM-selected subsets consistently outperformed the full-dataset baseline, particularly in the 30–40 feature range.

5. Significance and Implications

Efficiency: The approach significantly reduces computational costs and data dimensionality while improving model interpretability.
Validation: The selected variables align with established tobacco regulatory research and previous studies using SHAP/XGBoost, validating the semantic reasoning capabilities of LLMs in epidemiology.
Scalability: This framework is applicable to other high-dimensional public health datasets where data privacy is a concern or where rapid variable screening is needed before data collection.
Future Directions: The authors suggest that while current results are promising, future work should explore domain-specific fine-tuning of LLMs and soft-thresholding integration to capture weak but complementary predictors.

Conclusion: The study confirms that instruction-tuned LLMs are effective tools for text-based feature selection in behavioral health, offering a scalable, interpretable, and privacy-preserving alternative to traditional data-driven feature engineering.