HealthFormer: Dual-level time-aware Transformers for… — Plain-Language Explanation

Original authors: Körösi-Szabo, P., Kovacs, G., Csiszarik, A., Forrai, B., Laki, J., Szocska, M., Kovats, T.

Published 2026-03-27

📖 5 min read🧠 Deep dive

Original authors: Körösi-Szabo, P., Kovacs, G., Csiszarik, A., Forrai, B., Laki, J., Szocska, M., Kovats, T.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine your medical history not as a giant, messy pile of paper files, but as a storybook where every chapter is a visit to a doctor, a trip to the hospital, or a new prescription.

The problem with this storybook is that it's written in a very strange way:

The pages are scattered: Some chapters are written every day, others only once every five years.
The language is mixed: One page might have a diagnosis code, another a surgery code, and another a medication code, all jumbled together.
The gaps are confusing: Sometimes the time between chapters matters a lot (a fever today vs. a fever next year), but computers usually just count the pages, ignoring the time in between.

Enter HealthFormer. Think of it as a super-smart librarian who has read millions of these medical storybooks and learned how to understand the story, not just the words.

Here is how it works, broken down into simple concepts:

1. The Two-Level Reading Strategy (The "Dual-Level" Part)

Most computer programs try to read a medical record by flattening it into a long list of words. HealthFormer is smarter. It reads in two layers:

Layer 1: The "Event" Reader (Intra-Event):
Imagine you walk into a doctor's office. You might have a fever, a rash, and a prescription for antibiotics all at once. A normal computer might see "Fever," "Rash," "Antibiotic" as three separate, unrelated items.
HealthFormer's first job is to look at that specific visit and say, "Ah, these three things happened together in this specific context." It bundles them into a single "event package" before moving on. It understands that a rash and an antibiotic often go hand-in-hand during a specific visit.
Layer 2: The "Timeline" Reader (Inter-Event):
Once it has the "event packages," it looks at the whole timeline. It asks, "How long was it between this visit and the last one?"
Instead of just counting "Visit 1, Visit 2," it uses a special Time-Sense. It knows that a gap of 2 days is very different from a gap of 2 years. It uses a mathematical trick (called ALiBI) that lets it pay more attention to recent events while still remembering what happened years ago, without getting confused by the irregular gaps.

2. Learning Without a Teacher (Self-Supervised Pretraining)

You might ask, "How does this librarian learn?" It didn't have a teacher telling it, "This patient will get cancer." Instead, it played a massive game of "Fill in the Blanks" using millions of anonymous medical records from Hungary.

It was given four challenges:

Hide and Seek (Masked Prediction): The computer covered up a diagnosis code (like "Diabetes") and tried to guess it based on the other codes in the same visit and the patient's history.
Guess the Type: It covered up the type of visit (e.g., "Was this a surgery or a check-up?") and had to guess the type based on the surrounding visits.
The Crystal Ball (Next Event): It looked at today's visit and tried to guess what kind of visit would happen next.
Time Travel (Time Prediction): It tried to guess exactly how many days it would be until the next visit.

By playing these games millions of times, HealthFormer learned the hidden patterns of human health. It learned that certain codes often appear together, and that time gaps are crucial clues.

3. The Magic of "Fine-Tuning"

Once the librarian has read millions of books and learned the patterns, it becomes a universal expert.

If you want to predict Colorectal Cancer, you don't need to build a new computer from scratch. You just take this smart librarian, show it a few examples of cancer patients, and say, "Hey, look for these specific patterns." The librarian instantly adapts.

The paper tested this by trying to predict two types of cancer (Colon and Prostate) 30, 60, and 90 days before they were officially diagnosed.

The Result: HealthFormer was significantly better than traditional methods (like simple math models that just count how many times a patient visited a doctor). It caught the signs of cancer much earlier and more accurately.

Why This Matters (The "So What?")

It respects the messiness: Real life isn't a neat spreadsheet. People get sick at weird times. HealthFormer handles the chaos naturally.
It understands context: It knows that a "Surgery" code means something different if it's followed by "Recovery" a week later versus "Complication" a day later.
It's a general tool: Once trained, it can be used for any prediction task (predicting heart failure, predicting hospital readmission, etc.) without needing to be rebuilt from the ground up.

In a nutshell: HealthFormer is an AI that learned to read the complex, irregular, and messy story of human health by understanding both the individual chapters (visits) and the time gaps between them, allowing doctors to spot serious illnesses like cancer much earlier than before.

1. Problem Statement

Longitudinal Electronic Health Records (EHRs) present unique modeling challenges due to their irregularity and heterogeneity:

Irregular Sampling: Clinical events occur at irregular intervals ranging from days to years, making standard fixed-step sequential models (like standard RNNs or Transformers with positional encodings) suboptimal.
Heterogeneous Event Composition: A single clinical encounter (e.g., a hospital visit) often bundles multiple codes from different systems (diagnoses, procedures, medications) simultaneously.
Limitations of Existing Methods:
- Flattening: Many models flatten complex encounters into a single token or an unordered "bag of codes," erasing the internal structure and co-occurrence relationships of codes within an event.
- Coarse Time Modeling: Existing approaches often use bucketed time or simple positional encodings, failing to capture the clinically informative nuances of continuous time gaps.
- Task Specificity: Many models require task-specific feature engineering or architecture changes for new prediction endpoints.

The authors aim to develop a pretraining framework that preserves the event structure, explicitly models elapsed time, and produces transferable patient representations adaptable to new supervised tasks without custom engineering.

2. Methodology: HealthFormer Architecture

HealthFormer is a dual-level, time-aware Transformer designed to model patient trajectories as sequences of typed administrative events.

A. Data Representation

Event-Centric Timeline: Patient history is normalized into an ordered sequence of typed events ( $E = [e_1, ..., e_T]$ ).
Event Types: Seven distinct types are defined (GP Visit, Outpatient, Inpatient, Surgery, Medication Fill, Ambulance, Death).
Modular Composition: Each event contains a specific subset of domain tokens (ICD-10 diagnoses, hPCS procedures, ATC medications, facility metadata) based on the event type (see Table 1 in the paper).
Hierarchical Tokenization:
- Hierarchical Expansion: ICD and ATC codes are decomposed into their ancestor paths (depth 0, 1, 2) to leverage hierarchy.
- Hashing: High-cardinality metadata (e.g., facility IDs) are hashed to fixed buckets.

B. Dual-Level Architecture

The model separates encoding into two distinct stages:

Intra-Event Encoder (Within-Event):
- Goal: Aggregate heterogeneous tokens within a single event into a single event embedding.
- Mechanism: Uses code-specific embedding modules (handling different vocabularies) projected to a shared dimension. It employs attention pooling where the query is derived from the event type, and keys/values come from the event's tokens. This allows the model to weigh the importance of specific codes (e.g., a diagnosis vs. a medication) based on the event context.
Inter-Event Encoder (Across-Event):
- Goal: Model the longitudinal trajectory of the patient.
- Time Integration:
  - Date Encoder: Adds an explicit time embedding (using Time2Vec) to each event representation based on absolute date and derived signals (age, gaps).
  - Continuous-Time Attention Bias (ALiBI): Instead of positional encodings, the model uses a log-scaled ALiBI bias. The attention bias between event $i$ and $j$ is calculated as $B_{i,j} = -s_h \cdot \log(1 + \Delta t_{i,j})$ , where $\Delta t$ is the elapsed time. This allows the model to handle arbitrary time gaps and learn relative temporal dependencies without relying on fixed time buckets.

C. Self-Supervised Pretraining

The model is pretrained on a massive Hungarian national dataset (~10M individuals, 12 years) using a multi-task objective consisting of four heads:

Code-level MLM: Masked Language Modeling for individual tokens within an event (per-domain cross-entropy).
Event-level MLM: Full-event masking where the model must predict the type of the masked event based on surrounding context.
Next-Event Prediction: Predicting the type and codes of the next event in the sequence (causal masking applied).
Time-to-Next-Event ( $\Delta t$ ) Regression: Predicting the exact number of days until the next event.

3. Key Contributions

Event-Centric Representation: A unified framework that treats administrative records as sequences of typed events, preserving the internal structure of complex encounters rather than flattening them.
Dual-Level Time-Aware Transformer: A novel architecture that decouples intra-event composition from inter-event dynamics, utilizing continuous-time ALiBI biases for robust temporal modeling.
Multi-Task Pretraining: A comprehensive self-supervised strategy that jointly optimizes code prediction, event type inference, and temporal regression.
Transferability: Demonstrates that a single pretrained backbone can be fine-tuned for diverse downstream tasks (e.g., cancer prediction) without task-specific architectural changes.

4. Results

The authors evaluated HealthFormer on incident cancer prediction (Colorectal Cancer and Prostate Cancer) across 30, 60, and 90-day horizons.

Performance:
- Colorectal Cancer (CRC): Achieved test AUCs of 0.81 / 0.75 / 0.73 for 30/60/90-day horizons.
- Prostate Cancer: Achieved test AUCs of 0.94 / 0.87 / 0.84 for 30/60/90-day horizons.
Comparison: The end-to-end fine-tuned HealthFormer significantly outperformed strong logistic regression baselines, including:
- Demographics only (Age/Sex).
- Utilization summaries.
- Bag-of-codes (static counts).
- Time-decay bag-of-codes (the strongest baseline), outperforming it by ~0.13 AUC for CRC and ~0.09–0.11 AUC for Prostate cancer.
Embedding Analysis: Visualization of ICD-10 embeddings revealed that the self-supervised pretraining induced a geometry consistent with the clinical hierarchy. Codes with similar clinical meanings clustered together, and the structure aligned with the ICD depth levels, suggesting the model learned clinically interpretable representations.

5. Significance and Conclusion

Clinical Utility: HealthFormer provides a representation that is not only predictive but also interpretable. The learned embeddings align with medical hierarchies, allowing clinicians to inspect the "neighborhood" of learned diagnoses.
Scalability & Adaptation: The framework eliminates the need for manual feature engineering for new endpoints. The same pretrained model can be adapted to various risk stratification or forecasting tasks via standard fine-tuning.
Handling Real-World Data: By explicitly modeling irregular time gaps and heterogeneous event compositions, HealthFormer addresses the specific complexities of administrative health data better than previous sequential models.
Future Work: The authors note the need for broader benchmarking across different endpoints, external validation, and the integration of unstructured data (e.g., clinical notes) and laboratory values.

In summary, HealthFormer represents a significant step forward in EHR modeling by combining hierarchical event encoding with continuous-time attention mechanisms, resulting in a robust, transferable, and clinically interpretable foundation model for longitudinal health data.

HealthFormer: Dual-level time-aware Transformers for irregular electronic health record events