Multi-Criteria Validation of LLM-Inferred Depression Severity from Outpatient Psychiatry Notes

This study demonstrates that a HIPAA-compliant large language model can accurately infer depression severity scores from outpatient psychiatric notes, showing strong agreement with clinical assessments and predictive validity for treatment changes and emergency visits, thereby enabling standardized longitudinal phenotyping for real-world research and routine monitoring.

Cudic, M., Meyerson, W. U., Wang, B., Yin, Q., Khadse, P. N., Burke, T., Kennedy, C. J., Smoller, J. W.

Published 2026-03-12
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Idea: Giving a Voice to Silent Notes

Imagine you are a doctor treating a patient for depression. You write a detailed note after every visit describing how the patient is feeling, how they are sleeping, and how they are functioning. However, in the digital world of hospitals, these notes are often just "text on a page." They sit there, unread by computers, while the hospital's database only tracks simple checkboxes (like "Did the patient fill out a mood survey today?").

The problem? Patients don't fill out those mood surveys every time they visit. So, researchers and doctors are missing a huge chunk of the story.

This paper asks a simple question: Can a super-smart computer (an Artificial Intelligence or "LLM") read those messy, handwritten-style doctor's notes and figure out exactly how depressed a patient is, just as well as if the patient had filled out a survey?

The Experiment: The "AI Detective"

The researchers took a massive library of 91,000 real doctor's notes from a large hospital system in Boston. They fed these notes into a powerful AI (specifically, a version of OpenAI's GPT-5.2) and gave it a special job:

"Read this doctor's note. Ignore any numbers the patient wrote down. Based only on the doctor's description of the patient's mood and behavior, tell me: How severe is this person's depression?"

The AI was asked to give three different types of scores, like a translator speaking three different languages:

  1. PHQ-9: The standard survey patients take.
  2. HAM-D: A detailed checklist doctors use.
  3. CGI-S: A simple "How bad is it?" global rating (1 to 7).

The Results: How Did the AI Do?

To see if the AI was telling the truth, the researchers compared its guesses against three different "gold standards."

1. The "Self-Report" Test (Did the AI match the patient?)

  • The Analogy: Imagine the patient fills out a survey saying, "I feel a 6 out of 10." The AI reads the doctor's note and guesses, "I think the patient is at a 6."
  • The Result: The AI was pretty good. It matched the patient's own survey about 67% of the time. It wasn't perfect, but it was close enough to be useful.

2. The "Expert Judge" Test (Did the AI match human experts?)

  • The Analogy: Two human experts (a psychiatrist and a psychologist) read the notes and gave their own scores. Then, they compared their scores to the AI's scores.
  • The Result: The AI did amazingly well. In fact, the AI agreed with the human experts better than the two human experts agreed with each other! This suggests the AI is very good at understanding the nuance in the text.

3. The "Crystal Ball" Test (Did the AI predict the future?)

  • The Analogy: The researchers asked: "If the AI says a patient is very depressed, is that patient more likely to go to the Emergency Room or need to switch their medication later?"
  • The Result: Yes. The AI's scores were just as good at predicting these future crises as the actual surveys or the doctors' own quick risk assessments. This proves the AI isn't just guessing; it's actually understanding the severity of the illness.

The Catch: Not Everyone Gets the Same Score

The researchers found a worrying gap. The AI was slightly less accurate when reading notes about Black and Hispanic patients compared to White patients.

  • The Metaphor: Imagine the AI is a translator. It speaks English perfectly, but it stumbles a bit when the doctor uses specific cultural phrases or slang common in Black or Hispanic communities. The AI might miss the depth of the sadness because it's not "listening" to the cultural context correctly.
  • Why it matters: This is a major red flag. If we use this tool in the real world, we don't want it to underestimate the pain of minority patients.

The Bottom Line: Why This Matters

Think of the hospital's electronic records as a library. Right now, the library only has a few books on the "Depression" shelf (the surveys). The rest of the information is scattered in thousands of unorganized notebooks (the doctor's notes).

This study shows that AI can act as a librarian who can read all those notebooks, summarize the stories, and put them on the shelf in an organized way.

Why is this a big deal?

  1. More Data: We can now study depression using every visit, not just the ones where a survey happened.
  2. Better Research: Scientists can study how drugs work or how genetics affect depression with much more accurate data.
  3. Better Care: In the future, a doctor could walk into a room, and the computer could instantly say, "Based on your last 10 visits, your depression has been getting worse, even though you didn't fill out a survey today."

The Warning:
The authors are careful to say this isn't ready for prime time yet. It needs to be tested in other hospitals and fixed so it works equally well for people of all races and backgrounds. But, it's a powerful first step toward turning "text on a page" into life-saving data.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →