Multi-Criteria Validation of LLM-Inferred Depression Severity from Outpatient Psychiatry Notes

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Idea: Giving a Voice to Silent Notes

Imagine you are a doctor treating a patient for depression. You write a detailed note after every visit describing how the patient is feeling, how they are sleeping, and how they are functioning. However, in the digital world of hospitals, these notes are often just "text on a page." They sit there, unread by computers, while the hospital's database only tracks simple checkboxes (like "Did the patient fill out a mood survey today?").

The problem? Patients don't fill out those mood surveys every time they visit. So, researchers and doctors are missing a huge chunk of the story.

This paper asks a simple question: Can a super-smart computer (an Artificial Intelligence or "LLM") read those messy, handwritten-style doctor's notes and figure out exactly how depressed a patient is, just as well as if the patient had filled out a survey?

The Experiment: The "AI Detective"

The researchers took a massive library of 91,000 real doctor's notes from a large hospital system in Boston. They fed these notes into a powerful AI (specifically, a version of OpenAI's GPT-5.2) and gave it a special job:

"Read this doctor's note. Ignore any numbers the patient wrote down. Based only on the doctor's description of the patient's mood and behavior, tell me: How severe is this person's depression?"

The AI was asked to give three different types of scores, like a translator speaking three different languages:

PHQ-9: The standard survey patients take.
HAM-D: A detailed checklist doctors use.
CGI-S: A simple "How bad is it?" global rating (1 to 7).

The Results: How Did the AI Do?

To see if the AI was telling the truth, the researchers compared its guesses against three different "gold standards."

1. The "Self-Report" Test (Did the AI match the patient?)

The Analogy: Imagine the patient fills out a survey saying, "I feel a 6 out of 10." The AI reads the doctor's note and guesses, "I think the patient is at a 6."
The Result: The AI was pretty good. It matched the patient's own survey about 67% of the time. It wasn't perfect, but it was close enough to be useful.

2. The "Expert Judge" Test (Did the AI match human experts?)

The Analogy: Two human experts (a psychiatrist and a psychologist) read the notes and gave their own scores. Then, they compared their scores to the AI's scores.
The Result: The AI did amazingly well. In fact, the AI agreed with the human experts better than the two human experts agreed with each other! This suggests the AI is very good at understanding the nuance in the text.

3. The "Crystal Ball" Test (Did the AI predict the future?)

The Analogy: The researchers asked: "If the AI says a patient is very depressed, is that patient more likely to go to the Emergency Room or need to switch their medication later?"
The Result: Yes. The AI's scores were just as good at predicting these future crises as the actual surveys or the doctors' own quick risk assessments. This proves the AI isn't just guessing; it's actually understanding the severity of the illness.

The Catch: Not Everyone Gets the Same Score

The researchers found a worrying gap. The AI was slightly less accurate when reading notes about Black and Hispanic patients compared to White patients.

The Metaphor: Imagine the AI is a translator. It speaks English perfectly, but it stumbles a bit when the doctor uses specific cultural phrases or slang common in Black or Hispanic communities. The AI might miss the depth of the sadness because it's not "listening" to the cultural context correctly.
Why it matters: This is a major red flag. If we use this tool in the real world, we don't want it to underestimate the pain of minority patients.

The Bottom Line: Why This Matters

Think of the hospital's electronic records as a library. Right now, the library only has a few books on the "Depression" shelf (the surveys). The rest of the information is scattered in thousands of unorganized notebooks (the doctor's notes).

This study shows that AI can act as a librarian who can read all those notebooks, summarize the stories, and put them on the shelf in an organized way.

Why is this a big deal?

More Data: We can now study depression using every visit, not just the ones where a survey happened.
Better Research: Scientists can study how drugs work or how genetics affect depression with much more accurate data.
Better Care: In the future, a doctor could walk into a room, and the computer could instantly say, "Based on your last 10 visits, your depression has been getting worse, even though you didn't fill out a survey today."

The Warning:
The authors are careful to say this isn't ready for prime time yet. It needs to be tested in other hospitals and fixed so it works equally well for people of all races and backgrounds. But, it's a powerful first step toward turning "text on a page" into life-saving data.

1. Problem Statement

Longitudinal measurement of depression severity in outpatient psychiatric care is hindered by the infrequent administration of standardized assessments (e.g., PHQ-9), which appear in fewer than 50% of visits. Consequently, Electronic Health Record (EHR)-based research often relies on indirect proxies (billing codes, medication changes) that fail to capture symptom severity directly. While clinical notes contain rich, unstructured data regarding symptoms and functional impairment, traditional Natural Language Processing (NLP) methods are labor-intensive and struggle with the context-dependent nature of severity inference. Although Large Language Models (LLMs) show promise in understanding clinical text, their validity for inferring standardized depression severity scores has not been rigorously established across multiple validity domains (convergent, predictive, discriminant) or demographic groups.

2. Methodology

Study Design and Population

Data Source: 91,651 outpatient psychiatry visit notes from 8,287 adult patients across 58 clinics within the Mass General Brigham (MGB) healthcare system (2015–2021).
Cohorts:
- MDD Cohort: Patients with a lifetime diagnosis of Major Depressive Disorder (MDD) and no comorbid major psychiatric conditions (OCD, SUD, Bipolar, Schizophrenia).
- Diagnosis-Stratified Cohort: Patients with exactly one primary psychiatric diagnosis (including GAD, OCD, SUD, Bipolar, Schizophrenia) to test diagnostic specificity.
Data Preprocessing: Notes were concatenated by provider/patient/day. Patient-reported outcome sections (PHQ-9, GAD-7) were redacted prior to LLM inference to prevent information leakage.

LLM Inference Pipeline

Model: OpenAI GPT-5.2 (deployed via HIPAA-compliant Microsoft Azure infrastructure).
Task: The model was prompted to infer three distinct depression severity metrics from the clinical notes:
1. PHQ-9: Patient-reported, itemized score.
2. HAM-D: Clinician-assessed, itemized score.
3. CGI-S: Clinician-assessed global severity rating (adapted from a transdiagnostic framework).
Prompting Strategy: Tested multiple formulations (Full-scale, Name-only, Item-level) to assess sensitivity.

Validation Framework
The study evaluated the LLM-inferred scores against four validity domains:

Convergent Validity: Compared against:
- Patient-reported PHQ-9 (n=3,757).
- Treating-clinician Suicide Risk Assessment (SRA) (n=2,985).
- Independent study-clinician chart review (CGI-S ratings on 125 notes by 2 raters).
Predictive Validity: Survival analysis (Andersen–Gill Cox models) predicting:
- Antidepressant switching/augmentation.
- Psychiatric Emergency Department (ED) visits.
Diagnostic Specificity: Classification rates of moderate depression across different diagnostic groups (MDD vs. GAD, OCD, SUD, Bipolar, Schizophrenia).
Consistency: Evaluation of correlation stability across demographic subgroups (race, ethnicity, age, gender) and clinics (meta-analytic heterogeneity).

3. Key Results

Convergent Validity

vs. Patient-Reported PHQ-9: Moderate agreement ( $\kappa = 0.64$ , $r = 0.67$ ). The LLM achieved an AUC of 0.83 in classifying moderate depression (PHQ-9 $\ge$ 10).
vs. Study-Clinician Chart Review: Strong agreement between LLM CGI-S and clinician ratings ( $r = 0.86$ ; $\kappa$ with Rater 1 = 0.79). Notably, the LLM showed higher agreement with individual raters than the raters showed with each other ( $\kappa = 0.59$ ).
vs. Treating-Clinician SRA: Moderate convergence (AUC = 0.69 for classifying suicide risk).

Predictive Validity

Antidepressant Switching: LLM CGI-S predicted switching with a C-index of 0.60, comparable to patient-reported PHQ-9 and clinician SRA.
Psychiatric ED Visits: LLM CGI-S showed strong predictive power (C-index = 0.63). A one-point increase in LLM CGI-S was associated with a 45% higher hazard of an ED visit. Joint modeling of LLM scores and PHQ-9 did not significantly improve discrimination, suggesting the LLM captures similar prognostic information to the PHQ-9.

Diagnostic Specificity

The LLM demonstrated high specificity to depression. In the MDD-only cohort, 40% of visits were classified as moderately depressed (CGI-S $\ge$ 4), compared to <10% for GAD, OCD, SUD, and Schizophrenia. Bipolar disorder showed an intermediate rate (17%), consistent with the high prevalence of depressive symptoms in that population.

Consistency and Fairness

Clinics: Low heterogeneity across clinics ( $I^2 < 0.1$ ).
Demographics: Correlations were generally consistent but significantly lower for Black ( $r=0.48$ ) and Hispanic ( $r=0.43$ ) patients compared to White and non-Hispanic patients. This suggests potential disparities in documentation or model performance across racial/ethnic groups.

4. Key Contributions

Comprehensive Validation: Provides the first multi-criteria validation (convergent, predictive, discriminant, and consistency) of LLM-inferred depression severity in a large, real-world EHR dataset.
Bridging the Data Gap: Demonstrates that LLMs can extract clinically meaningful severity scores from routine notes where standardized scales (PHQ-9) are missing (only 10.8% of visits had a recorded PHQ-9).
Superiority to Traditional NLP: Shows that LLMs can infer complex, context-dependent severity metrics without the hand-crafted feature engineering required by traditional NLP pipelines.
Insight into Documentation: The finding that LLM scores correlate more strongly with clinician chart reviews than patient-reported PHQ-9 suggests that clinical notes encode dimensions of severity (functional impairment, clinician judgment) that patient self-reports may miss.

5. Significance and Implications

Research Enablement: This approach enables longitudinal, standardized phenotyping of depression severity for large-scale genetic, pharmacoepidemiologic, and treatment effectiveness studies using real-world evidence.
Clinical Monitoring: Offers a pathway for "measurement-based care" by recovering severity data from visits where patients did not complete standardized questionnaires.
Future Directions: While promising, the study highlights the need for multi-site validation and specific attention to reducing performance disparities across racial and ethnic groups. The authors caution that these tools are currently for research and quality improvement, not for direct clinical decision-making without further validation.

Limitations Noted: Single healthcare system (MGB), modest chart-review sample size, potential non-random missingness of PHQ-9 data, and lack of generalizability to primary care settings.