Sentiment in Clinical Notes: A Predictor for Length of Stay?

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine a hospital as a busy airport. The "Length of Stay" (LOS) is simply how long a passenger (the patient) stays at the gate before boarding their flight home (being discharged). Hospital managers need to predict this accurately to know how many gates to keep open and how many staff to schedule.

Usually, they look at the passenger's ticket and passport (structured data like age, blood pressure, and lab results) to make this guess. But this study asked a different question: Can we guess the flight time just by reading the pilot's handwritten logbook?

Here is the story of that experiment, broken down simply:

The Experiment: Reading Between the Lines

The researchers took 4,503 admission notes for patients with pneumonia. These notes are the "pilot's logbooks"—unstructured, messy paragraphs written by doctors describing what's wrong with the patient.

They wanted to see if the tone or mood of these notes (Sentiment Analysis) could predict how long the patient would stay. They used four different "readers" to analyze the text:

The Rule-Book Readers (VADER & TextBlob): These are like strict grammar teachers who follow a list of rules. If they see the word "bad," they mark it negative.
The Context Reader (Longformer): This is a smart student who can read a whole long essay and understand how the beginning connects to the end.
The Super-Brain (GPT-oss-20B): This is a massive Artificial Intelligence (AI) that has read almost everything on the internet. They asked it two things:
- "How negative is this note?" (Sentiment)
- "How long will this patient stay?" (Direct Guess)

The Results: A Surprising Twist

1. The Mood Doesn't Match the Medicine
The researchers thought that if a doctor wrote a very "negative" or "scary" note, the patient would stay longer.

The Reality: The connection was very weak. It's like trying to guess how long a movie will last just by looking at the color of the poster.
Why? Doctors are trained to be robots. They write facts: "Patient has fever," "Patient is intubated." They don't write, "Oh no, this is terrible!" Even though the situation is bad, the words don't sound emotional. The "Rule-Book Readers" got confused because they were looking for human emotions (like anger or sadness) that simply aren't there in medical charts.

2. The "Super-Brain" Got the Hint
The big AI (GPT) was asked to guess the length of stay directly.

The Result: It did better than the mood detectors, but it was still only a "C-" student. It could guess slightly better than random chance, but it wasn't a crystal ball.
The Catch: It was incredibly slow. While the simple readers could analyze 100 notes in a few seconds, the Super-Brain took over 6 minutes to do the same job. It's like using a supercomputer to calculate 2+2; it works, but it's overkill and too slow for a busy airport.

3. The "Context Reader" Was the Best of the Small Guys
The Longformer model (the smart student) was the most efficient. It didn't need to be a giant AI to find a tiny signal in the text. It could spot patterns in the long notes that the simple rule-followers missed, but it still only explained about 2% of the variation in stay times.

The Big Takeaway: Why is this so hard?

Think of a clinical note like a weather report written by a robot.

If you ask a human, "Is it a bad day?" they might say, "Yes, it's storming!" (Negative sentiment).
But the robot says, "Precipitation: 100%. Wind: 50mph." (Neutral sentiment).

The study found that the "robot language" of doctors is too objective. The words "severe" or "critical" don't trigger the same "negative" alarm in AI models as the word "sad" does. Therefore, trying to predict a patient's stay based on the emotional tone of the note is like trying to predict the stock market by reading the weather report—it's the wrong tool for the job.

The Conclusion

The study concludes that while AI can find a tiny, hidden signal in these notes, it's not good enough to run the hospital on its own.

Don't throw away the structured data: The "ticket and passport" (age, labs, vitals) are still the best predictors.
Don't rely on "mood": Doctors aren't writing diaries; they are writing medical facts.
The Future: We need to build AI that is smart enough to read the "robot language" and understand that "intubated" means "very sick," without needing to feel "sad" about it. Until then, the best way to predict how long a patient stays is to look at their hard numbers, not their doctor's mood.

1. Problem Statement

Length of Stay (LOS) is a critical metric for hospital operational efficiency and quality of care. While existing prediction models rely heavily on structured data (demographics, vitals, labs, comorbidities), they often fail to capture the nuances of diagnostic uncertainty and disease complexity found in unstructured clinical narratives.

The study addresses two specific questions:

Can sentiment analysis (extracting emotional tone from text) serve as a proxy for clinical severity and predict LOS?
Do Large Language Models (LLMs) outperform traditional Natural Language Processing (NLP) models (rule-based and encoder-based) in this task, particularly when using zero-shot prompting for direct outcome estimation?

2. Methodology

Dataset

Source: Baylor St. Luke's Medical Center (Epic EHR).
Population: 4,503 adult patients (≥18 years) admitted with Community-Acquired Pneumonia (CAP) between June 2013 and June 2023.
Data Extracted: Admission History and Physical (H&P) notes written by physicians, fellows, or residents.
Preprocessing:
- Standardization (lowercasing, whitespace normalization).
- Filtering: Used fuzzy regex matching to extract physician-generated narrative sections (History, Assessment, Plan) while removing auto-generated templates and filler text.
- Chunking: Long notes were split into sentence-level chunks to fit model input limits.

Models Evaluated

The study compared four distinct NLP approaches:

VADER: A rule-based sentiment analyzer.
TextBlob: A rule-based sentiment analyzer.
Longformer: An encoder-based transformer model pre-trained for long-document understanding.
GPT-oss-20B: An open-source Large Language Model (LLM) run locally (LM Studio) to ensure patient privacy.

Experimental Design

Sentiment Analysis: All models generated scores from -1 (negative/unfavorable) to 1 (positive/favorable).
Direct LOS Estimation (Zero-Shot): The LLM (GPT-oss-20B) was additionally prompted to act as a "medical administrator" to directly estimate LOS on a scale of -1 (very long) to 1 (very short), bypassing the sentiment intermediate step.
Statistical Analysis:
- Correlation: Pearson correlation coefficients ( $r$ ) between model outputs and actual LOS.
- Predictive Power: Linear regression to determine the coefficient of determination ( $R^2$ ).
- Agreement: Intraclass Correlation Coefficient (ICC) to measure inter-model agreement.
- Significance: Adjusted for multiple comparisons using the Benjamini-Hochberg method ( $\alpha = 0.05$ ).

3. Key Results

Predictive Performance

Sentiment Models: All sentiment-based approaches showed statistically significant but weak correlations with actual LOS.
- Best Performer (Sentiment): Longformer achieved the highest variance explained ( $R^2 = 0.019$ ) and a correlation of $r = -0.119$ .
- Worst Performer (Sentiment): TextBlob showed negligible predictive power ( $R^2 = 0.000$ , $r = -0.030$ ).
- LLM Sentiment: The LLM acting as a sentiment classifier performed similarly to Longformer ( $R^2 = 0.008$ , $r = -0.118$ ).
Direct LLM Estimation: When prompted to estimate LOS directly (skipping sentiment analysis), the LLM significantly outperformed all sentiment-based methods.
- Correlation: $r = -0.218$ ( $p < 0.001$ ).
- Variance Explained: $R^2 = 0.017$ .
- Note: The negative correlation indicates that as the "sentiment" score became more negative (or the direct estimate indicated "longer stay"), the actual LOS increased.

Model Agreement & Efficiency

Inter-Model Agreement: Poor agreement between models (Single-measure ICC = 0.059), suggesting different models are capturing different, non-overlapping signals or noise.
Computational Cost:
- Fastest: TextBlob (2.6 seconds per 100 notes).
- Slowest: GPT-oss-20B (~370 seconds per 100 notes).
- The LLM was roughly 140x slower than the rule-based TextBlob.

4. Key Contributions

First Comparison of LLM Sentiment vs. Traditional NLP in LOS: This is the first study to directly compare LLM-driven sentiment extraction against rule-based and encoder-based models specifically for predicting hospital LOS in a CAP cohort.
Demonstration of Prompting Impact: The study highlights that task framing matters. An LLM prompted to "predict LOS" outperformed the same LLM prompted to "analyze sentiment," suggesting that direct outcome mapping is more effective than using sentiment as a proxy for clinical severity.
Quantification of Clinical Text Utility: It provides empirical evidence that while unstructured text contains latent prognostic signals, the objective, non-evaluative nature of clinical documentation limits the efficacy of traditional sentiment analysis (which relies on emotional valence).

5. Significance and Discussion

The "Signal-to-Noise" Problem: The study concludes that clinical notes are often "noisy" due to auto-generated text and templates. Furthermore, clinical severity (e.g., "septic," "intubated") does not necessarily map to negative emotional sentiment in standard NLP models, making sentiment a poor proxy for disease severity.
Role of LLMs: While LLMs showed the strongest correlation, their high computational cost and reliance on zero-shot inference (without fine-tuning) limit their immediate clinical deployment. The study suggests that fine-tuning (e.g., LoRA) or Retrieval Augmented Generation (RAG) could significantly improve performance.
Future Directions: The authors argue that future predictive systems should not rely on sentiment alone. Instead, they should integrate computationally efficient NLP models (like Longformer) to extract latent complexity from unstructured notes, combining them with established structured variables (vitals, labs) to create robust, multimodal LOS prediction systems.

Conclusion: Sentiment analysis alone is a weak predictor for Length of Stay due to the objective nature of medical writing. However, LLMs capable of direct outcome estimation show promise, provided that computational efficiency and domain-specific fine-tuning are addressed in future research.