Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records

Imagine a massive library where every book is a story about a family in trouble. These aren't novels; they are official reports written by social workers investigating child welfare cases. Inside these reports, there are thousands of pages of handwritten notes and typed summaries describing what happened in a home.

Often, these notes mention that a parent was using drugs or alcohol. But here's the problem: the computer systems that store these reports are very old and simple. They can only check a box that says "Drugs: Yes" or "Drugs: No." They can't tell the difference between a parent struggling with alcohol, another with meth, or a third with opioids. It's like having a grocery list that just says "Fruit" without telling you if it's apples, bananas, or grapes. This makes it hard for agencies to understand the specific problems families face or to track how drug trends change over time.

The New "Smart Librarian"

This paper is about testing a new kind of "Smart Librarian"—a small, powerful computer program (called a Small Language Model) that lives on a local computer, not on the internet.

Think of this program as a very sharp, well-read intern who has read millions of these family stories. The researchers wanted to see if this intern could do more than just say "Yes, drugs are mentioned." They wanted to know: Can this intern read the story and tell you exactly which drug is being talked about?

The Experiment: A Two-Step Dance

The researchers set up a two-step test using 900 real family stories:

Step 1 (The Gatekeeper): First, the program checks if there is any mention of substance use at all. (We already knew it was good at this).
Step 2 (The Detective): If it finds substance use, the program acts like a detective. It reads the text carefully to decide: Is this about alcohol? Cannabis? Opioids? Stimulants? Sedatives? Hallucinogens? Or Inhalants?

They asked the computer to do this for seven different categories of drugs, based on the official medical guide (DSM-5).

The Results: A Star Performer with a Few Hiccups

The results were surprisingly good, like a student acing a difficult exam with a few tricky questions.

The A-Students: For five of the seven categories (Alcohol, Cannabis, Opioids, Stimulants, and Sedatives), the computer was almost perfect. It agreed with human experts 94% to 100% of the time.
- Analogy: Imagine a human expert and the computer both reading a story. If the story says "The father smelled like beer," they both agree: "That's Alcohol." If it says "Found a bag of white powder," they both agree: "That's likely an Opioid or Stimulant." They were on the same page almost every time.
The Struggling Students: Two categories performed poorly: Hallucinogens and Inhalants.
- Why? This is where the "Smart Librarian" got confused by wordplay.
- The "Gas" Trap: The word "gas" can mean a car running on fuel, a chemical in a lab, or someone sniffing glue (an inhalant). The computer sometimes saw "gas" and thought, "Ah, inhalant!" when the story was actually just about a broken pipe in the house.
- The "Acid" Trap: Similarly, "acid" can mean a chemical solvent used to make other drugs, or it can mean LSD (a hallucinogen). The computer got tripped up by these double meanings.
- The Rarity Problem: These drugs are also very rare in the reports. When something is rare, even a few mistakes make the computer look bad because it's guessing wrong more often than it's guessing right.

Why This Matters: Privacy and Power

The most exciting part of this study isn't just that the computer is smart; it's where it lives.

The "Cloud" vs. The "Local" Server: Big, famous AI models (like the ones you might chat with online) live in giant data centers far away. Sending sensitive family secrets to the internet is risky and expensive.
The Local Solution: This "Small Language Model" fits on a standard computer in the social worker's office. It never sends data out. It's like having a private detective who works in your own living room, reading your files without ever leaving the house.

The Bottom Line

This paper proves that we don't need super-computers the size of a building to understand these complex stories. A smaller, local computer can read thousands of family reports and tell us, "Hey, in this county, opioid use is going down, but meth use is going up."

It turns messy, handwritten notes into clear, organized data. This helps social workers and researchers spot trends, fix problems faster, and help families more effectively—all while keeping their private information safe and sound.

In short: They taught a small, local computer to read between the lines of family stories, and for the most part, it did a fantastic job. It's a new tool that turns old, dusty files into a crystal-clear map of what's really happening in the community.

Here is a detailed technical summary of the paper "Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records."

1. Problem Statement

Child welfare agencies generate vast amounts of unstructured free-text narratives (investigation summaries, case notes) containing critical information about parental substance use. However, administrative databases typically only capture substance involvement as a generic binary flag (present/absent), collapsing complex patterns into undifferentiated data. This limits the ability to:

Track divergent trends among specific substances (e.g., the shift from opioids to stimulants or cannabis).
Assess substance-specific risks.
Evaluate targeted interventions.

While Large Language Models (LLMs) have been validated for binary classification (detecting if a substance problem exists), it remains untested whether smaller, locally deployable models (4–70 billion parameters) possess the linguistic capability to perform multi-label classification of specific substance types (e.g., distinguishing alcohol from opioids) within the complex, context-dependent language of child welfare records.

2. Methodology

Data Source

Dataset: Child maltreatment investigation narratives from a Midwestern U.S. state (2013–2024).
Sample: Records previously identified as containing substance-related problems (SRP) were processed for specific classification.

Classification Pipeline (Two-Stage Approach)

Stage 1 (Binary Detection): Narratives are screened for the presence of any substance-related problem (validated in prior work; not the focus of this study).
Stage 2 (Specific Classification): Narratives flagged as SRP-positive are classified into seven DSM-5 aligned categories:
- Alcohol
- Cannabis
- Opioid
- Stimulant
- Sedative/Hypnotic/Anxiolytic
- Hallucinogen
- Inhalant
- Note: Categories are not mutually exclusive; a single record can trigger multiple labels.

Model Configuration

Model: gpt-oss:20b (a 20-billion-parameter open-source LLM).
Deployment: Locally hosted on dual NVIDIA A6000 GPUs (48 GB VRAM each) to ensure data privacy (no external data transmission).
Quantization: 4-bit precision.
Parameters: Temperature set to 0.2 to maximize consistency; 8,192-token context window.
Prompting: A single prompt instructed the model to classify all seven categories simultaneously, requiring verbatim text extracts as evidence.

Validation Design

Sample Size: 900 stratified cases for human review:
- 700 cases (100 per substance category) classified as positive by the model (for Precision).
- 100 cases classified as SRP-absent (for Recall/False Negative analysis).
- 100 cases classified as SRP-present but with no specific substance assigned (for Edge Case analysis).
Human Review: Expert reviewers compared model outputs against full narratives to determine correctness and extraction validity.
Metrics:
- Inter-method Reliability: Cohen's Kappa ( $\kappa$ ) and Prevalence-Adjusted Bias-Adjusted Kappa (PABAK).
- Precision: Percentage of correct positive classifications.
- Test-Retest Stability: Agreement rates across ~15,000 records processed in two independent runs.

3. Key Results

Inter-Method Reliability & Precision
Five of the seven categories achieved "almost perfect" agreement ( $\kappa \geq 0.81$ ) with human experts:

Alcohol & Opioid: 100% precision, $\kappa = 1.00$ .
Cannabis & Stimulant: 99.0% precision, $\kappa = 0.99$ .
Sedative/Hypnotic/Anxiolytic: 92.0% precision, $\kappa = 0.94$ .

Poor Performance Categories
Two low-prevalence categories performed poorly and were deemed unsuitable for high-precision application:

Hallucinogen: 56.1% precision ( $\kappa = 0.63$ ).
Inhalant: 35.0% precision ( $\kappa = 0.42$ ).
Reasoning: High false positives due to terminology overlap with non-substance contexts (e.g., "gas" or "paint" referring to household hazards rather than abuse).

Extraction Quality

The model extracted 1,412 text phrases as evidence.
Validity: 90.5% of extractions were valid and accurately represented the target substance.
Verbatim Accuracy: 92.8% of extractions were exact matches; the remainder were semantically accurate paraphrases.

Test-Retest Stability

Across ~15,000 records processed twice, agreement ranged from 92.1% (Stimulant) to 99.1% (Alcohol/Cannabis).
This confirms that the low-temperature setting effectively constrained the probabilistic variability of the LLM.

4. Key Contributions

Validation of Small Models: Demonstrates that a 20-billion-parameter model (significantly smaller than frontier models) is sufficient for granular, multi-label substance classification in administrative text.
Privacy-Preserving Workflow: Proves that high-accuracy classification can be achieved via local deployment, eliminating the need to transmit sensitive child welfare data to cloud-based commercial APIs.
Granular Data Enrichment: Provides a method to transform historical, unstructured text into structured, substance-specific variables without requiring changes to caseworker documentation practices.
Identification of Failure Modes: Clearly delineates where small models struggle (low-prevalence, high-ambiguity categories like inhalants) versus where they excel (high-prevalence, specific terminology like opioids).

5. Significance and Implications

Operational Feasibility: Child welfare agencies can now leverage existing hardware to unlock deep insights from their historical data without incurring per-token costs or compromising data confidentiality.
Policy and Surveillance: Enables retrospective analysis of substance trends (e.g., tracking the shift from the opioid crisis to stimulant surges) and allows for the evaluation of whether interventions align with specific substance risks.
Research Utility: Facilitates new research questions linking specific substance types to case outcomes, service referrals, and placement decisions, which were previously impossible due to data limitations.
Realistic Benchmarking: The study acknowledges that perfect agreement is unattainable due to inherent ambiguity in human-written narratives; the model's performance falls within the range of inter-rater disagreement expected among human experts.

Conclusion: The study confirms that small, locally hosted LLMs are a viable, reliable, and secure tool for moving child welfare data analysis from binary detection to specific, DSM-5-aligned substance classification, provided that low-prevalence, high-ambiguity categories are handled with caution or excluded.

Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records

The New "Smart Librarian"

The Experiment: A Two-Step Dance

The Results: A Star Performer with a Few Hiccups

Why This Matters: Privacy and Power

The Bottom Line

1. Problem Statement

2. Methodology

3. Key Results

4. Key Contributions

5. Significance and Implications

More like this

One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations

MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

ConFu: Contemplate the Future for Better Speculative Sampling

SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

Automated Thematic Analysis for Clinical Qualitative Data: Iterative Codebook Refinement with Full Provenance