PaReGTA: An LLM-based EHR Data Encoding Approach to Capture Temporal Information

Imagine you are trying to understand a patient's health story by looking at their medical records. These records are like a massive, messy library of notes, prescriptions, and test results collected over years.

The problem? Most computer programs treat these records like a grocery list. They just count how many times "aspirin" or "depression" appears. They throw away the order of events and when they happened. It's like trying to understand a movie by just counting how many times the word "car" appears, without knowing if the car chase happened at the beginning or the end.

This paper introduces PaReGTA, a new way to teach computers how to read these medical stories properly. Here is how it works, broken down into simple concepts:

1. Turning Data into a Story (The "Translator")

Instead of feeding the computer a spreadsheet of codes, PaReGTA acts like a translator. It takes raw medical data (like "Lasmiditan 100mg taken on Sept 1st") and turns it into short, readable sentences, almost like a diary entry.

The Magic: It doesn't just say "Medicine taken." It says, "62 days after the last visit, the patient took Lasmiditan."
Why it matters: This preserves the timeline. It tells the computer that time has passed between events, which is crucial for understanding diseases like migraines that change over time.

2. The "Smart Reader" (The LLM)

The system uses a pre-trained "Smart Reader" (a type of Large Language Model, or LLM). Think of this reader as a super-librarian who has already read millions of books and knows how words relate to each other.

The Trick: Instead of teaching the librarian from scratch (which takes forever and needs huge data), the researchers give the librarian a quick "refresher course" (called SimCSE) specifically on migraine patient notes.
The Result: The librarian learns to understand that "Lasmiditan" and "Migraine" are closely linked, even if they haven't seen that exact combination before. It creates a "mental map" (embedding) of the patient's health.

3. The "Time Machine" (Hybrid Pooling)

Once the librarian has read all the patient's visits, how do we summarize the whole story into one score?

The Problem: If you just average everything, the most recent visit (which might be the most important) gets lost in the noise of visits from 10 years ago.
The Solution: PaReGTA uses a hybrid spotlight.
- Spotlight 1 (Recency): It shines a bright light on the most recent visits because they usually tell us the most about the patient's current state.
- Spotlight 2 (Importance): It also shines a light on visits that are globally important, even if they happened a while ago (like a major surgery or a chronic condition diagnosis).
The Mix: It combines these two spotlights to create a single, perfect summary of the patient's health journey.

4. The "What-If" Detective (PaReGTA-RSS)

One of the biggest complaints about AI in medicine is that it's a "black box"—you get a prediction, but you don't know why.

The Innovation: The authors created a tool called PaReGTA-RSS. Imagine you are a detective trying to solve a crime. You ask, "What if the suspect hadn't taken this specific medicine?"
How it works: The system takes the patient's story, removes a specific factor (like "high blood pressure"), re-reads the story, and sees how the prediction changes.
The Output: It gives a score: "Removing 'Depression' from this patient's story changed the prediction by X amount." This tells doctors exactly which factors are driving the AI's decision, making it trustworthy.

Why is this a Big Deal?

It works with messy data: Real-world medical records are messy. Drugs are listed by brand names, not categories. PaReGTA can read the raw brand names and understand them because of its "Smart Reader" training, so it doesn't need expensive, manual cleanup.
It works with small groups: You don't need millions of patients to train it. Because it starts with a smart pre-trained reader, it works well even with smaller groups of patients (like the 39,000 migraine patients they tested it on).
It beats the old ways: In their tests, PaReGTA was much better at predicting whether a migraine was "chronic" (severe/long-term) or "episodic" (occasional) than the old methods that just counted words.

In a nutshell: PaReGTA turns a messy pile of medical receipts into a coherent story, uses a smart reader to understand the timeline, and then acts as a detective to explain why it thinks a patient is at risk. It bridges the gap between complex AI and real-world doctors.

1. Problem Statement

Electronic Health Records (EHRs) contain rich longitudinal data (diagnoses, prescriptions, comorbidities) that are crucial for modeling disease trajectories. However, current practical approaches face significant limitations:

Loss of Temporal Information: Traditional methods (one-hot encoding, aggregated count vectors) collapse visit-level records into unordered summaries, discarding clinically meaningful temporal dynamics (e.g., the sequence of disease emergence).
Limitations of Sequence Models: While Recurrent Neural Networks (RNNs) and Transformers preserve sequence data, they are often data-hungry, computationally expensive, and unstable when applied to sparse, irregular, and heterogeneous real-world EHR data.
Data Heterogeneity & Normalization: EHR medication records often exist as raw product names rather than standardized concepts (e.g., RxNorm). Standardizing these requires costly manual mapping or rigid taxonomies, which hinders scalability.
Interpretability Gap: Deep learning models (including LLMs) act as "black boxes." Standard feature importance tools (like SHAP or LIME) are computationally prohibitive when applied to pipelines involving text-to-embedding conversion and pooling, as they require re-running the entire pipeline for every perturbation.

2. Methodology: PaReGTA

The authors propose PaReGTA (Patient Representation Generation with Temporal Aggregation), an end-to-end framework that converts structured EHR data into fixed-dimensional patient representations using Large Language Models (LLMs) while preserving temporal cues.

A. Framework Overview

The pipeline consists of three main stages:

Visit-Level Textualization:
- Raw EHR records are partitioned into clinically meaningful concepts (medications, comorbidities).
- Records are converted into short, templated sentences at the visit level (not patient level).
- Temporal Tokenization: Explicit temporal cues are injected into the text. The paper evaluates several strategies: absolute dates, inter-visit gaps (e.g., "214 days after previous"), month aggregations, and time since the last visit.
- Raw Medication Handling: Unlike baselines that map drugs to classes, PaReGTA uses raw product names as recorded in the EHR, leveraging the pre-trained LLM's semantic knowledge to understand drug families and indications without manual mapping.
Domain Adaptation via Contrastive Learning (SimCSE):
- A pre-trained sentence-embedding model (GTE-base-v1.5) is used as the base encoder.
- To adapt the model to the specific clinical domain without labeled sentence pairs, the authors employ unsupervised SimCSE. This involves applying dropout noise to the same input sentence to create positive pairs, optimizing an InfoNCE loss to pull semantically similar embeddings together and push unrelated ones apart.
Hybrid Temporal Pooling:
- Visit-level embeddings are aggregated into a single patient-level vector using a hybrid weighting scheme:
  - Time-Decay Weighting: Assigns higher weights to recent visits ( $r_i = e^{-\gamma(t_N - t_i)}$ ).
  - Attention-Based Weighting: Identifies globally informative visits regardless of recency by computing similarity to a global context vector.
- The final representation is a convex combination of these weights, followed by L2 normalization.

B. Interpretability: PaReGTA-RSS

To address the "black box" nature of LLM encoders, the authors introduce PaReGTA-RSS (Representation Shift Score):

Mechanism: For a specific clinical factor (e.g., a medication class), the factor is removed from the visit-level text to create a perturbed version.
Calculation: The model generates embeddings for both the original and perturbed texts. The shift in the representation ( $\Delta r = r_{clean} - r_{perturbed}$ ) is projected through a downstream classifier (specifically Logistic Regression) to calculate the change in the decision score (logit).
Output: This yields a signed, additive attribution score quantifying the contribution of that factor to the prediction, applicable at both the cohort and individual patient levels.

3. Key Contributions

PaReGTA Framework: A novel, model-agnostic encoding pipeline that transforms longitudinal EHRs into visit-level templated text, utilizes lightweight contrastive fine-tuning (SimCSE), and employs hybrid temporal pooling. It avoids training from scratch, making it effective for data-limited cohorts.
Temporal Tokenization Ablation: A systematic evaluation of how different temporal representations (absolute date vs. inter-visit gap) affect downstream performance, identifying "inter-visit gap" as the most effective cue.
Robustness to Heterogeneous Medication Data: Demonstrates the ability to encode raw, product-level medication strings directly, bypassing the need for expensive concept mapping and handling missing/confounded data gracefully.
PaReGTA-RSS: A novel factor importance method tailored for LLM-based encoders that quantifies clinical factor importance via representation shifts, enabling interpretable AI in healthcare.
Real-World Validation: Extensive validation on the All of Us (AoU) Research Program dataset (39,088 migraine patients), showing superior performance over sparse baselines and instability in deep sequential models.

4. Results

The study was evaluated on a migraine type classification task (Chronic vs. Episodic) using the AoU dataset.

Performance:
- PaReGTA significantly outperformed traditional baselines (One-hot encoding and Count Bag-of-Codes).
- Best Performance: Using the Gap temporal tokenization scheme with LightGBM, PaReGTA achieved 92.33% Accuracy and 0.9524 AUC.
- Baseline Comparison: One-hot encoding achieved ~84% Accuracy and ~0.76 AUC.
- Deep Learning Baselines: Standard deep sequential models (RETAIN, T-LSTM) failed to converge or yield stable results on this specific cohort, highlighting the robustness of the PaReGTA approach in data-sparse/irregular settings.
Temporal Ablation:
- Gap (inter-visit time) outperformed Date (absolute) and Last (time since last visit).
- Removing temporal tokens entirely caused a significant performance drop, confirming the necessity of temporal context.
Embedding Quality:
- The combination of visit-level segmentation and SimCSE fine-tuning resulted in embeddings with superior Uniformity and Isotropy compared to naive full-text concatenation or untuned models.
Factor Importance (RSS):
- The method successfully identified clinically relevant factors. For example, OnabotulinumtoxinA (Botox) and CGRP-targeting therapies showed the highest importance for chronic migraine, aligning with clinical guidelines (prophylactic use).
- Subgroup analysis revealed heterogeneity: Depression and PTSD were more influential for male patients, while Fibromyalgia and Temporomandibular Disorders were more influential for female patients.

5. Significance

Clinical Utility: PaReGTA bridges the gap between the theoretical power of LLMs and the practical constraints of clinical data (sparsity, heterogeneity, lack of standardization). It allows for high-performance modeling without requiring massive, perfectly curated datasets.
Interpretability: By introducing PaReGTA-RSS, the work solves a critical barrier to clinical adoption: the inability to explain why an LLM-based model made a specific prediction. It provides actionable, factor-level insights for clinicians.
Scalability: The framework is modular. It can easily swap in newer, stronger sentence-embedding models without re-engineering the entire pipeline, ensuring longevity as LLM technology advances.
Real-World Robustness: The ability to handle raw medication strings and irregular visit intervals makes this approach immediately applicable to diverse, multi-institutional EHR systems where data harmonization is incomplete.

In conclusion, PaReGTA demonstrates that leveraging pre-trained LLMs with lightweight domain adaptation and explicit temporal encoding is a superior strategy for EHR analysis compared to both traditional sparse representations and complex deep sequence models, particularly when interpretability and data heterogeneity are primary concerns.

PaReGTA: An LLM-based EHR Data Encoding Approach to Capture Temporal Information

1. Turning Data into a Story (The "Translator")

2. The "Smart Reader" (The LLM)

3. The "Time Machine" (Hybrid Pooling)

4. The "What-If" Detective (PaReGTA-RSS)

Why is this a Big Deal?

1. Problem Statement

2. Methodology: PaReGTA

A. Framework Overview

B. Interpretability: PaReGTA-RSS

3. Key Contributions

4. Results

5. Significance

More like this

Entropy After for reasoning model early exiting

Alternatives to the Laplacian for Scalable Spectral Clustering with Group Fairness Constraints

A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

FLeX: Fourier-based Low-rank EXpansion for multilingual transfer