Temporally Phenotyping GLP-1RA Case Reports with Large Language Models: A Textual Time Series Corpus and Risk Modeling

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to understand a patient's medical journey, but instead of a neat, organized timeline on a spreadsheet, you have a messy, handwritten diary full of stories. Some entries say, "I felt sick three days after starting the new pill," while others say, "Two weeks later, the doctor changed the dose."

This is the problem with Type 2 Diabetes research, specifically regarding a popular class of drugs called GLP-1RAs (the "weight-loss and diabetes" drugs like Ozempic or Wegovy). Doctors know these drugs work, but they don't fully understand the long-term story: When do side effects happen? How does the disease progress over years?

Most medical data is like a library where books are sorted by title but have no page numbers. You know the story exists, but you can't find the specific chapters to see the sequence of events.

Here is how this paper solves that puzzle, using a few creative analogies:

1. The Problem: The "Messy Diary"

Traditional medical records are like a checklist. They tell you what happened (e.g., "Patient took drug X") and when (e.g., "January 5th"), but they often miss the story in between. They are great for short-term hospital stays but terrible for tracking a patient's life over 10 years.

On the other hand, Case Reports (stories doctors write about specific patients) are like rich, detailed diaries. They contain the full narrative: "The patient felt nauseous, stopped the drug, tried a different one, and two months later, their blood sugar improved." But these diaries are written in plain English, making them impossible for computers to read and analyze quickly.

2. The Solution: The "AI Translator"

The researchers built a digital translator using Large Language Models (LLMs)—the same kind of smart AI that powers chatbots.

Think of the LLM as a super-intelligent librarian who can read thousands of these messy medical diaries and instantly turn them into a structured movie script.

Input: "Patient started semaglutide on Monday. By Wednesday, they had a headache. Two weeks later, they were hospitalized for kidney issues."
Output (The Timeline):
- Time 0: Start Drug.
- Time +2 days: Headache.
- Time +14 days: Hospitalization (Kidney).

The researchers taught this AI to extract these events and assign them a specific time (in hours or days) relative to when the patient started treatment. They created a database of 136 of these "movie scripts" specifically for GLP-1 drugs.

3. The Quality Check: The "Human Editors"

To make sure the AI wasn't just hallucinating (making things up), the researchers hired two human medical experts to act as editors.

The experts manually read the same 136 stories and wrote their own timelines.
They compared the AI's timelines against the human editors' timelines.
The Result: The best AI (GPT-5) was incredibly accurate. It caught about 87% of the important events and got the order of events right 84% of the time. It was almost as good as the human experts, but it could do it in seconds instead of hours.

4. The Discovery: The "Respiratory Shield"

Once they had these clean, organized timelines, they ran a statistical test to see if taking the GLP-1 drugs changed the risk of certain problems.

Imagine they are looking for a shield that protects patients from specific dangers.

Heart & Kidneys: The data was a bit fuzzy here. The AI couldn't clearly say if the drugs made heart or kidney problems better or worse in these specific stories.
Lungs (Respiratory): Here, the pattern was clear. Patients who took the GLP-1 drugs were much less likely to develop respiratory (lung) problems compared to those who didn't. The risk dropped significantly.

This finding is like discovering that while the drug might not fix the car's engine (heart) or brakes (kidneys) in every story, it definitely acts as a strong umbrella against the rain (lung issues). This matches what other studies have suggested, giving researchers more confidence in the drug's safety profile.

5. Why This Matters

This paper is a proof of concept. It shows that we don't need to wait for perfect, structured data to understand long-term health trends. We can use AI to turn messy, unstructured stories into powerful data.

The Analogy: Before, trying to find a pattern in medical stories was like trying to find a needle in a haystack by looking at the whole haystack at once. Now, the AI acts as a magnet that pulls out the needles (events) and arranges them in a neat row, so we can actually see the pattern.

The Takeaway

The researchers have built a new tool that turns stories into data. They proved that AI can read medical case reports, understand the timeline of a patient's life, and help doctors predict risks. In this specific test, it suggested that GLP-1 drugs might offer a surprising bonus: better lung health.

The best part? They are releasing this "AI translator" and the cleaned-up data for free, so other scientists can use it to study heart disease, cancer, or any other condition where the "story" matters more than the "checklist."

1. Problem Statement

Type 2 Diabetes (T2D) progression and the long-term effects of Glucagon-like peptide-1 receptor agonists (GLP-1RAs) are critical areas of study. However, existing research faces significant limitations:

Data Fragmentation: Traditional studies rely on structured Electronic Health Records (EHRs) and claims data, which often lack the narrative context required to reconstruct medication-centered disease dynamics (e.g., indications for tolerability, specific side effects, and clinical decision points).
Temporal Ambiguity in Text: Unstructured clinical narratives (case reports) contain rich longitudinal details, but event timing is often expressed in relative, free-text terms (e.g., "two weeks after starting semaglutide") rather than absolute timestamps.
Scarcity of Annotated Corpora: Progress in clinical temporal reasoning is hindered by a lack of large, richly annotated datasets. Previous efforts (e.g., i2b2 2012) were limited by small, single-institution datasets and metadata-based timestamps that miss fine-grained event sequences.

The core challenge is to convert unstructured clinical narratives into structured, time-resolved clinical timelines to enable longitudinal risk forecasting and personalized treatment planning.

2. Methodology

The authors developed a pipeline to extract, annotate, and evaluate textual time series (TTS) from case reports.

A. Data Extraction

Source: PubMed Open Access (PMOA) repository (1.48M manuscripts).
Filtering:
1. Identified 145,571 candidate case reports using regex patterns (e.g., "case report," "year-old").
2. Filtered for single-patient reports using an LLM-based filter (124,699 reports).
3. Selected GLP-1RA cohorts via keyword matching against a curated lexicon (including drug names like semaglutide, liraglutide, and class-level terms).
4. Final Dataset: 136 GLP-1RA case reports.

B. Textual Time Series (TTS) Annotation

LLM-as-Annotator: Multiple Large Language Models (DeepSeek R1, Llama3.3, GPT5, O1, O3, O4mini) were prompted to extract TTS from the 136 reports.
Definition of TTS: A set $S = \{(e_1, t_1), ..., (e_n, t_n)\}$ $S = {(e_{1}, t_{1}), ..., (e_{n}, t_{n})}$ where $e_i$ $e_{i}$ is a clinical finding and $t_i$ $t_{i}$ is the time in hours relative to a reference point ( $t=0$ $t = 0$ ).
- Reference Point: Hospital admission (if stated) or the earliest clinical encounter.
- Events: Symptoms, diagnoses, procedures, treatments, outcomes, and pertinent negatives.
- Normalization: Natural language time expressions (e.g., "3-day history") were converted to hour offsets. Events before $t=0$ are negative; events after are positive.
Feature Extraction: GPT5 was used to extract demographics (age, sex, ethnicity) and generate diagnosis lists, which were then mapped to Unified Medical Language System (UMLS) concepts using ScispaCy.

C. Evaluation Framework

Gold Standard: Two clinically trained experts manually annotated the 136 reports independently to create a reference timeline.
Metrics:
- Event Matching: Recursive best-match procedure using PubMedBERT sentence embeddings. A match is defined if the cosine distance $\le 0.1$ .
- Temporal Concordance: C-index (probability that matched event pairs maintain the same temporal order).
- Timestamp Accuracy: Area Under the Log-Time CDF (AULTC), which quantifies how concentrated timestamp errors are near zero (higher is better).

D. Downstream Analysis

Survival Modeling: A Cox proportional hazards model was used to analyze time-to-onset for kidney, cardiovascular, and respiratory outcomes.
Cohorts:
- Treatment: GLP-1RA users with diabetes (initiation within 72 hours of $t=0$ ).
- Control: Non-users and late-initiators (treated as unexposed).
Adjustments: Models were adjusted for age and sex.

3. Key Contributions

Novel Corpus: Creation of the first GLP-1RA textual time-series corpus derived from 136 PMOA case reports, converting unstructured narratives into structured, time-stamped timelines.
Gold Standard Benchmark: A manually annotated reference set by two domain experts to rigorously evaluate LLM temporal extraction capabilities.
LLM Benchmarking: Comprehensive evaluation of multiple foundation and instruction-tuned LLMs, demonstrating that GPT5 achieves the best trade-off between event coverage and temporal fidelity.
Clinical Utility Demonstration: Successful application of the extracted timelines to perform time-to-event survival analysis, revealing outcome-specific associations between GLP-1RA exposure and clinical sequelae.
Open Release: The authors plan to release both the LLM-extracted timelines and expert annotations as a pilot benchmark corpus.

4. Results

Descriptive Statistics

Demographics: Median age 49; nearly balanced sex distribution (49% male, 49% female). Ethnicity was rarely reported (78% unspecified).
Timeline Density: Timelines varied from tens to hundreds of timesteps (median ~50–110 events).
Temporal Span: Median follow-up was 7 years (2,565 days), with a mean of 11 years, reflecting the longitudinal nature of case reports.
Diagnoses: The cohort was dominated by cardiometabolic conditions (Hypertension, Obesity, Type 2 Diabetes). Diabetes prevalence in the cohort was 87.9% (vs. 11.3% in general US adults), confirming the selection bias inherent in case reports.

Model Performance

Best Performer: GPT5 consistently achieved the highest event match rates (0.871) and reliable temporal sequencing (Concordance 0.843) across symptoms, diagnoses, and treatments.
Comparison: GPT5 outperformed other models (O3, O4mini, Llama variants) and even showed better temporal ordering than one of the human annotators (Annotator 2) at similar match rates.
Human Baseline: Inter-annotator agreement was 0.811 (match rate), 0.798 (concordance), and 0.702 (AULTC).

Survival Modeling Findings

Respiratory Outcomes: GLP-1RA users showed a significantly lower risk of respiratory sequelae compared to non-users (Hazard Ratio [HR] = 0.259, $p = 0.040$ ). This aligns with prior literature suggesting respiratory benefits.
Cardiovascular Outcomes: No clear association was found (HR = 0.927, $p = 0.835$ ).
Kidney Outcomes: The point estimate suggested a higher hazard (HR = 1.675), but the result was not statistically significant ( $p = 0.239$ ). The authors attribute this to case-report selection bias and limited covariate adjustment rather than a true biological effect.

5. Significance and Limitations

Significance:

Bridging the Gap: The study demonstrates that LLMs can effectively transform unstructured clinical narratives into structured, time-resolved data, overcoming the limitations of structured EHRs for long-term trajectory modeling.
Scalability: The "LLM-as-annotator" approach offers a scalable alternative to the costly and time-intensive manual annotation of clinical timelines.
Risk Modeling: The framework successfully enabled time-to-event analysis, proving that textual time series can support downstream clinical utility and risk stratification.

Limitations:

Selection Bias: Case reports inherently overrepresent rare, severe, or complex cases and are not representative of the general population.
Annotation Cost: While LLMs scale the process, the gold standard requires expensive expert manual annotation.
Temporal Precision: Outcomes are based on "time to first documentation" in the text, which may differ from biological onset.
Error Propagation: Reliance on LLMs for extraction introduces potential subtle errors in event identification and timestamping that may affect downstream analyses.

Future Directions:
The authors suggest extending this framework to other disease domains (e.g., sepsis, imaging-focused reports) and integrating multimodal data (lab trends, imaging) to reconcile differences between documented and perceived event timing.