Classification of Adolescent Drinking via Behavioral, Biological, and Environmental Features: A Machine Learning Approach with Bias Control

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

🍷 The Big Picture: Finding the "Drinkers" Without the Crystal Ball

Imagine you are a school principal trying to figure out which students are sneaking alcohol. You have a huge list of data on every student: their grades, how they sleep, what their friends are like, their personality, and their family history.

The problem? Most students don't drink. If you just look at the data, it's like looking for a needle in a haystack where 80% of the hay is actually just "non-drinkers."

Also, there are two big traps in this data:

The Age Trap: Older teens are more likely to drink simply because they are older. If your computer program just learns "Older = Drinks," it's cheating. It's not finding the real reasons; it's just guessing based on birthdays.
The "Other Drugs" Trap: If you ask, "Do they smoke?" and they say "Yes," the computer might guess they drink too. But that's circular logic. We want to know why they drink, not just that they smoke.

The Goal: The researchers wanted to build a smart computer program that can spot a teen who drinks, using only everyday questions (like "How do you sleep?" or "What do you think about parties?"), while ignoring the "cheating" clues like age and other drugs.

🛠️ The Solution: "FocalTab" – The Super Detective

The team built a new tool they call FocalTab. Think of it as a super-detective with two special powers:

1. The "TabPFN" Brain (The Experienced Intern)

Usually, to teach a computer to recognize patterns, you have to feed it thousands of examples and let it study for a long time.

The Analogy: Imagine a new intern who has to read every single book in the library to learn how to spot a thief.
The Innovation: TabPFN is like an intern who has already read millions of different books (synthetic data) before they even got to your office. They already know the general rules of how data works. They can look at your specific list of students and say, "I've seen patterns like this before," almost instantly.

2. The "Focal Loss" Lens (The Magnifying Glass)

Remember the "needle in a haystack" problem? Most students are non-drinkers. If the computer just tries to be right most of the time, it will just guess "Non-drinker" for everyone and get 80% accuracy. That's useless!

The Analogy: Imagine a security guard who only cares about catching the bad guys. If he ignores the 99 good guys to focus entirely on the 1 bad guy, he might miss the bad guy.
The Innovation: "Focal Loss" is a special rule that tells the computer: "Stop worrying about the easy cases (the non-drinkers). Focus all your energy on the hard cases (the drinkers)." It forces the computer to pay extra attention to the minority group so it doesn't ignore them.

🚫 The "Bias Control" Filter

Before the detective starts working, they have to clean the evidence. The researchers did two crucial things:

The Age Filter: They took all the data points that were just a result of getting older (like "I have a driver's license" or "I have more money") and mathematically removed them. Now, the computer has to figure out who drinks based on behavior, not just birthdays.
The "Other Drugs" Filter: They removed any questions about smoking or marijuana. They wanted to see if the computer could spot alcohol use on its own, without relying on the fact that the kid also smokes weed.

🏆 The Results: Who Won the Game?

The researchers tested their new detective (FocalTab) against old-school detectives (like standard Random Forests or simple math models).

The Old Detectives: When they were allowed to use "Age" and "Other Drugs" as clues, they were great at guessing. But as soon as you took those clues away, they crashed. They started guessing "Non-drinker" for almost everyone, failing to spot the actual drinkers. Their accuracy dropped to near-random guessing.
The New Detective (FocalTab): Even when stripped of the "cheating" clues (Age and Other Drugs), FocalTab still performed incredibly well.
- It correctly identified 80% of the drinkers.
- It correctly identified 80% of the non-drinkers.
- Why? Because it learned the real signs of drinking behavior, not just the easy shortcuts.

🔍 What Did the Computer Actually Learn?

After the computer got good at its job, the researchers asked it, "What clues did you use?" (This is called SHAP analysis).

The computer didn't care about height or weight. It cared about these three things:

The "Party" Mindset: Teens who thought drinking would make them more fun, cooler, or better at socializing were more likely to drink.
The "Worry" Factor: Teens with high anxiety, panic attacks, or PTSD were more likely to drink (perhaps to "self-medicate" or calm down).
The "Lifestyle" Clues:
- Sleep: Teens with messy sleep schedules.
- Friends: Teens who struggled to make friends or hung out in unsupervised groups.
- Money: Teens who had more disposable cash to spend on fun things.

💡 The Takeaway

This paper proves that we don't need expensive brain scans (MRIs) to predict if a teen is drinking. We just need to ask the right questions about their life, their feelings, and their habits.

By using a smarter computer model that ignores "cheating" clues (like age) and focuses hard on the rare cases (the drinkers), we can build better tools to catch at-risk teens early and help them before they get into serious trouble. It's like upgrading from a rusty metal detector to a high-tech scanner that ignores the rocks and only beeps for the gold.

1. Problem Statement

The paper addresses the critical need for early identification of adolescent alcohol use to facilitate targeted interventions. While machine learning (ML) has been applied to adult alcohol use disorder (AUD), its application to adolescents faces several methodological challenges:

Data Accessibility: Previous studies often rely on expensive neuroimaging (MRI/fMRI), limiting scalability. Clinical measurements (demographics, interviews) are more practical but underutilized for this specific classification task.
Confounding Bias:
- Age Bias: Alcohol use prevalence increases significantly with age. Models often inadvertently learn age-related developmental patterns rather than alcohol-specific signals.
- Substance Use Bias: Other substance use (tobacco, cannabis) is highly correlated with alcohol use. Including these as features can lead to "data leakage," artificially inflating performance without capturing independent risk factors.
Class Imbalance: Adolescent datasets typically contain far more non-drinkers than drinkers (e.g., ~5:1 ratio). Standard ML approaches struggle with this, often achieving high sensitivity (detecting drinkers) but failing specificity (misclassifying non-drinkers as drinkers).
Generalizability: Many prior studies focus on narrow age ranges or specific cohorts, limiting applicability to the broader adolescent population (ages 12–22).

2. Methodology

The authors propose FocalTab, a novel framework integrating TabPFN (a transformer-based foundation model for tabular data) with Focal Loss to address class imbalance and bias.

A. Data Source and Preprocessing

Dataset: Baseline data from the NCANDA (National Consortium on Alcohol and Neurodevelopment in Adolescence) study.
Sample: 801 participants (661 non-drinkers, 140 drinkers) aged 12–22.
Feature Selection:
- 167 baseline features were retained across 13 categories (e.g., alcohol expectancy, personality, psychiatric symptoms, socioeconomic status, daily routine).
- Exclusion Criteria: Constant features, features with >50% missing values, and all direct substance use variables were removed to prevent circularity.
Confound Regression: To mitigate age bias, variables highly correlated with age ( $|\rho| > 0.3$ ) were excluded. Moderately correlated variables were residualized using linear (for numeric) or logistic (for binary) regression against age. The age variable itself was removed from the feature set.

B. Model Architecture: FocalTab

Base Model (TabPFN): A pretrained transformer-based model that performs in-context learning. Unlike traditional models requiring iterative training on the target dataset, TabPFN approximates Bayesian inference in a single forward pass using a prior trained on synthetic datasets.
Loss Function (Focal Loss): To handle the 5:1 class imbalance, the authors replaced standard Cross-Entropy with Focal Loss. This function down-weights easy-to-classify majority samples (non-drinkers) and focuses training on hard-to-classify minority samples (drinkers).
- Formula: $FL(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)$
Training Strategy:
- 5-fold Cross-Validation (CV).
- Hyperparameter tuning (learning rate, optimizer, focal loss $\alpha$ and $\gamma$ ) via grid search on the validation set.
- Strict separation of training, validation, and test sets to ensure unbiased evaluation.

C. Experimental Design

The study evaluated the model across four variable selection strategies to test robustness against bias:

w/ Age w/ Substances: Full feature set (baseline for comparison).
w/ Age w/o Substances: Excluded substance use, kept age-correlated features.
w/o Age w/ Substances: Excluded age-correlated features, kept substance use.
w/o Age w/o Substances (Strict Setting): Excluded both substance use and age-correlated features; age effects regressed out.

The model was also compared against Logistic Regression, Random Forest, MLP (standard and with Focal Loss), and TabPFN (without Focal Loss) under different imbalance handling strategies (Original, SMOTE, Balanced Downsampling).

3. Key Contributions

Clinical-Only Framework: Developed a high-performance classifier using exclusively clinical measurements (no neuroimaging), enhancing scalability for real-world screening.
Bias Control: Implemented a rigorous preprocessing pipeline to remove age and substance use confounds, ensuring the model learns alcohol-specific patterns rather than developmental or co-occurring substance trends.
Algorithmic Innovation: Introduced FocalTab, combining the in-context learning capabilities of TabPFN with Focal Loss to effectively handle severe class imbalance without synthetic oversampling (SMOTE).
Broad Age Range: Expanded the study scope to ages 12–22, capturing the full trajectory of adolescent neurodevelopment.

4. Results

The performance was evaluated using Accuracy, F1-score, Sensitivity, Specificity, and AUC.

Superior Performance in Strict Settings: In the most stringent setting (w/o Age w/o Substances), FocalTab achieved:
- Accuracy: 84.3%
- Specificity: 80.0% (Crucial for correctly identifying non-drinkers)
- Sensitivity: 80.0%
- AUC: 0.902
- Comparison: Competing models (Logistic Regression, Random Forest, MLP) collapsed in this setting, with specificity dropping to near-chance levels (12–24%). For example, Random Forest specificity dropped from 76.7% (with biases) to 15.3% (without biases).
Impact of Bias Removal:
- Removing age and substance use variables caused a drastic performance drop in all baseline models, confirming they relied heavily on these confounds.
- FocalTab maintained robust performance, indicating it successfully learned independent behavioral and environmental predictors.
Class Imbalance Handling:
- Focal Loss vs. SMOTE: FocalTab (using Focal Loss on the original imbalanced data) significantly outperformed models using SMOTE. SMOTE actually degraded TabPFN's specificity to 10.7%, whereas FocalTab maintained 80.0%.
- This demonstrates that algorithmic weighting (Focal Loss) is superior to synthetic data generation for this specific clinical task.
Feature Importance (SHAP Analysis):
- The top predictors identified by SHAP were:
  1. Alcohol Expectancies: Beliefs about social enhancement, sexual benefits, and cognitive/motor improvement.
  2. Psychiatric Symptoms: Panic, OCD, and PTSD.
  3. Lifestyle/Environment: Sleep schedules, ease of making friends, unstructured nighttime activities, and spending habits.

5. Significance and Conclusion

This study demonstrates that accurate classification of adolescent drinking is possible using only accessible clinical data, provided that rigorous bias control and advanced handling of class imbalance are applied.

Clinical Utility: The model offers a scalable tool for early screening in primary care or school settings where MRI is unavailable.
Methodological Rigor: The paper highlights the danger of "confound leakage" in ML studies. By explicitly removing age and substance use variables, the authors reveal that previous high-accuracy models may have been measuring developmental stage rather than alcohol risk.
Future Directions: The authors plan to validate the framework on independent cohorts and extend the approach to longitudinal prediction of drinking trajectories.

In summary, FocalTab represents a significant advancement in adolescent health informatics, offering a robust, interpretable, and bias-controlled solution for identifying at-risk youth without relying on expensive imaging or confounded data.