Evaluating Large Language Models for Assessment of… — Plain-Language Explanation

Original authors: Zhu, T., Tashevski, A., Taquet, M., Azis, M., Jani, T., Broome, M. R., Kabir, T., Minichino, A., Murray, G. K., Nour, M. M., Singh, I., Fusar-Poli, P., Nevado-Holgado, A., McGuire, P., Oliver, D.

Published 2026-04-04

📖 5 min read🧠 Deep dive

View on medRxiv ↗PDF ↗

CC BY 4.0

Original authors: Zhu, T., Tashevski, A., Taquet, M., Azis, M., Jani, T., Broome, M. R., Kabir, T., Minichino, A., Murray, G. K., Nour, M. M., Singh, I., Fusar-Poli, P., Nevado-Holgado, A., McGuire, P., Oliver, D.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery: Who is at risk of developing a serious mental health condition called psychosis?

Currently, the only way to solve this mystery is to send a highly trained specialist (a "detective") to interview a person for up to two hours. The detective listens carefully, takes notes, and then uses their years of experience to decide if the person is at risk. This is like having a master chef taste a soup to decide if it needs more salt. It's accurate, but it's slow, expensive, and there aren't enough master chefs to go around. As a result, many people slip through the cracks and don't get help until it's too late.

This paper asks a bold question: Can we teach a super-smart computer (an AI) to be the detective?

Here is the story of what they found, explained simply:

1. The Experiment: Teaching the AI to Listen

The researchers took 678 recordings of these interviews (specifically, the first 30 minutes of the conversation). They fed these transcripts into 11 different "Large Language Models" (LLMs). Think of these LLMs as different students in a class:

The "Small" Students: Fast, cheap, but maybe a bit naive (like a smart high schooler).
The "Medium" Students: Balanced (like a college senior).
The "Big" Students: Massive, powerful, and expensive (like a PhD professor with a library in their head).

The AI's job was to listen to the text and do two things:

Give a Score: Rate how severe and frequent the strange thoughts or feelings were (on a scale of 0 to 6).
Make a Verdict: Decide if the person is "At Risk" or "Not At Risk."

2. The Results: The AI Can Do the Job!

The results were surprisingly good.

The Big Winners: The largest AI models (the "PhD professors") got it right about 80% of the time. They were incredibly good at spotting the warning signs (93% sensitivity), meaning they rarely missed someone who was actually at risk.
The Trade-off: Because they were so eager to catch every case, they sometimes sounded the alarm for people who were actually fine (a "false positive"). However, in a screening situation, it's often better to be a little too cautious than to miss a real danger.
The Small Students: Even the smaller, cheaper AI models did a decent job. They weren't perfect, but they were surprisingly competitive, especially considering they run much faster and cost less to operate.

3. The "Hallucination" Check: Did the AI Make Things Up?

A major fear with AI is that it might "hallucinate"—make up facts that aren't there. The researchers checked this carefully.

The Good News: The AI was very faithful to the text. It rarely made up symptoms.
The Bad News: When it did make a mistake, it usually over-diagnosed. For example, if someone said, "I felt suspicious because I was bullied," the AI might mark that as a serious mental health symptom, whereas a human might realize it's a normal reaction to being bullied. The AI sometimes treats normal human worries as medical emergencies.

4. Is the AI Fair?

The researchers checked if the AI treated different groups of people fairly (based on age, race, gender, or where they were interviewed).

The Verdict: The AI was mostly fair across age, race, and gender.
The Glitch: The AI performed differently depending on where the interview happened. Interviews from different cities or clinics had different "accents" or styles, and the AI got confused by these regional differences. It's like an AI trained on New York English struggling to understand a specific dialect in Texas.

5. The "Speed vs. Power" Dilemma

The researchers also looked at the cost.

The Big Models are like Ferraris: They are incredibly fast and powerful, but they need a massive fuel tank (huge computer memory) and are expensive to run.
The Small Models are like Hybrid Cars: They are slower and less powerful, but they are efficient and can run on much smaller, cheaper computers.
The Sweet Spot: They found a "Goldilocks" model (a medium-sized one) that offered a great balance: good accuracy without needing a supercomputer.

The Bottom Line

This paper shows that AI can act as a powerful assistant for mental health screening.

Imagine a future where a doctor doesn't have to spend two hours manually scoring an interview. Instead, the AI listens to the recording, instantly highlights the risky parts, gives a preliminary score, and writes a summary. The human doctor then just reviews the AI's work to make the final call.

This doesn't replace the doctor; it gives them a super-powered magnifying glass. It could help us screen millions more people, catch risks earlier, and get help to the people who need it before their condition gets worse.

In short: We are teaching computers to listen to our stories and spot the warning signs of mental illness. They aren't perfect yet, but they are getting very good at it, and they could soon help us save lives by making mental health care faster and more accessible.

1. Problem Statement

The prevention of psychosis relies on the early detection of individuals in the Clinical High Risk for Psychosis (CHR-P) state. However, current detection rates are low (5–14% of those who eventually develop psychosis are identified early) due to significant bottlenecks in the assessment process:

Scalability Issues: Standardized assessments (e.g., PSYCHS, CAARMS, SIPS) require highly trained clinicians to conduct semi-structured interviews lasting up to two hours.
Subjectivity and Variability: Interpretation of nuanced symptoms is subjective, leading to inter-rater and inter-site variability.
Resource Constraints: The reliance on specialist human raters limits the population-level reach of preventive care.

The study investigates whether Large Language Models (LLMs) can automate the extraction of clinically meaningful information from interview transcripts to support scalable, accurate, and reproducible psychosis risk assessment.

2. Methodology

Data Source and Preprocessing

Dataset: The study utilized the Accelerating Medicines Partnership Schizophrenia (AMP-SCZ) dataset.
Sample: 678 partial interview transcripts (first 30 minutes) from 373 participants (77.7% CHR-P, 22.3% healthy controls).
Ground Truth: Transcripts were paired with researcher-rated symptom severity and frequency scores across 15 symptom domains (e.g., unusual thoughts, auditory/visual hallucinations, somatic abnormalities).
Preprocessing: Transcripts were segmented into 15 domain-specific instances ( $n=4,691$ ), normalized for speaker turns, and de-identified.

Model Selection and Deployment

Models Evaluated: 11 open-weight LLMs ranging from 1B to 80B parameters (e.g., Llama-3.3-70B, Qwen3-Next-80B, Gemma-3n, Phi-3).
Deployment: All models were deployed locally on secure institutional infrastructure (King's College London and University of Oxford) to ensure data privacy and avoid transmitting sensitive mental health data to external APIs.
Inference Configuration: Deterministic decoding (greedy search, temperature=0) was used to maximize reproducibility.

Prompting Strategy

Structured Reasoning: The study employed Chain-of-Thought (CoT) prompting.
Task: For each of the 15 symptom domains, models were instructed to:
1. Analyze the transcript text.
2. Infer severity and frequency scores (0–6 scale) based on explicit evidence.
3. Generate a brief, evidence-based rationale summary.
4. Output results in a strict JSON schema.
Post-Processing: Automated validation and repair pipelines were used to handle formatting errors (e.g., regex extraction for missing JSON blocks).

Evaluation Metrics

Classification: Accuracy, Sensitivity, Specificity, Precision, F1-score, and Matthews Correlation Coefficient (MCC) for CHR-P status detection.
Regression/Agreement: Pearson correlation ( $r$ ), Intraclass Correlation Coefficient (ICC), and Area Under the Curve (AUC) for symptom severity and frequency scores.
Fairness: Demographic parity and equalized odds across age, ethnicity, language, gender, and site.
Qualitative: Expert review of summary reports for accuracy, confabulation (hallucination), and omission rates.
Efficiency: Peak GPU memory usage and token generation speed (tokens/sec).

3. Key Results

Classification Performance (CHR-P Detection)

Model Size Correlation: Performance scaled with model size.
Top Performers:
- Llama-3.3-70B-Instruct: Achieved the highest accuracy (0.802), with high sensitivity (0.934) and moderate specificity (0.580).
- Qwen3-Next-80B-A3B-Instruct: Comparable performance (Accuracy 0.793, Sensitivity 0.934).
Small Models: Even smaller models (e.g., Gemma-3n-E4B-it) achieved competitive performance (Accuracy 0.766) with significantly lower computational costs.
Bias Pattern: Models exhibited a systematic tendency toward overestimation (high sensitivity, lower specificity), which is acceptable in a screening context to minimize false negatives.

Symptom Scoring Agreement

Correlation: LLM-generated scores showed strong agreement with researcher ratings.
- Llama-3.3-70B: ICC for severity = 0.743, ICC for frequency = 0.748.
- Qwen3-Next-80B: ICC for severity = 0.767, ICC for frequency = 0.749.
Domain Variability: Performance was highest for perceptual abnormalities (auditory/visual) and lowest for complex or context-dependent domains (e.g., somatic perceptual abnormalities, erotomanic ideas).
Error Types: Disagreements were mostly between adjacent ordinal categories.

Algorithmic Fairness

Demographics: Minimal disparities were observed across age, ethnicity, first language, and gender.
Site Variability: The most significant performance disparities occurred across sites (e.g., Melbourne vs. others), likely due to differences in interviewing styles or recruitment strategies rather than model bias.

Summary Report Quality

Accuracy: 93.3% of summaries fully represented transcript content.
Confabulation: Clinically relevant confabulation (inventing distress or functional impairment) occurred in only 2.7% of reports.
Omissions: Minor omissions occurred in ~1.8% of cases; no safety concerns were omitted.
Expert Validation: Scores derived only from LLM summaries correlated strongly with both LLM scores and researcher scores ( $r \approx 0.86$ ).

Failure Modes

Expert analysis of the 15 worst-performing cases revealed:

Over-pathologization (53%): Models scored normal human experiences (e.g., mistrust due to bullying) as clinically significant.
Context Misplacement (27%): Information relevant to one domain was discussed in another section of the interview, leading to missed or incorrect scoring.
Ground Truth Discrepancy (13%): Cases where the LLM was arguably correct based on the text, suggesting researcher ratings may use external context not in the transcript.

Compute-Performance Trade-offs

Larger models required significantly more GPU memory and had lower throughput.
Gemma-3n-E4B-it was identified as a "sweet spot" candidate, offering strong F1 scores with low memory footprint (~~21 GB) and high throughput (~~7.6 tokens/sec), making it feasible for edge deployment.

4. Key Contributions

First Systematic Evaluation: This is the first study to evaluate open-weight LLMs for structured psychosis risk assessment using psychometric interview transcripts.
Scalable Framework: Demonstrated a pipeline for automated scoring that reduces reliance on scarce specialist raters while maintaining high correlation with human experts.
Privacy-Preserving Approach: Validated the use of locally deployed open-weight models, addressing ethical concerns regarding sensitive mental health data transmission.
Fairness and Safety Analysis: Provided a rigorous assessment of algorithmic fairness and safety, showing minimal demographic bias but highlighting site-specific variability and low rates of dangerous hallucinations.
Practical Deployment Insights: Mapped the trade-off between model accuracy and computational cost, identifying specific models suitable for resource-constrained clinical environments.

5. Significance and Implications

Clinical Impact: LLMs can serve as a "human-in-the-loop" tool to triage patients, automate initial scoring, and generate draft summaries, thereby increasing the scalability of early psychosis detection services.
Research Utility: The approach can standardize symptom scoring in multi-site research studies, reducing inter-rater variability.
Future Directions: The study suggests that while larger models perform best, smaller, efficient models are viable for deployment. Future work should focus on:
- Fine-tuning models on specific clinical datasets.
- Addressing site-specific variability through localized prompt engineering.
- Integrating real-time scoring during live interviews.
- Evaluating "reasoning" models while mitigating reasoning hallucinations.

Conclusion: The study concludes that open-weight LLMs have the potential to transform psychosis risk assessment from a bottlenecked, specialist-dependent process into a scalable, standardized, and efficient workflow, provided they are implemented within a supervised, human-in-the-loop framework.

Evaluating Large Language Models for Assessment of Psychosis Risk