Generating Counterfactual Patient Timelines from Real-World Data

Imagine you are a doctor standing at a crossroads. A patient is sick, and you have to decide: Do I give them this medicine? Do I send them home? What if they were 10 years older? What if their fever was higher?

In the real world, you can only take one path. You can't go back in time to try a different route and see what would have happened. This is the problem of "Counterfactuals" (thinking about "what if" scenarios).

This paper introduces a new kind of AI Time Machine that helps doctors explore these "what if" scenarios without risking a real patient's life.

The Core Idea: The "Medical Story Generator"

Think of a patient's medical history not as a boring spreadsheet, but as a long, complex story.

Chapter 1: The patient arrives with a cough.
Chapter 2: They get a blood test.
Chapter 3: The doctor prescribes a pill.
Chapter 4: They get better (or worse).

The researchers built an AI that learned how to write these stories. They fed it 400 million pages of real medical stories from over 300,000 patients. The AI didn't just memorize them; it learned the rhythm and logic of how diseases progress and how doctors react.

It's like teaching a child to write by letting them read every book in a library. Eventually, the child can write a new story that sounds exactly like the real ones, even if they've never seen that specific plot before.

The Experiment: The "What If" Game

To test if their AI was smart enough, the researchers used it on patients who had COVID-19 in 2023. They asked the AI to rewrite the stories of these patients by changing just one detail, like a director editing a movie script:

The "Older" Edit: "What if this patient was 15 years older?"
The "Sicker" Edit: "What if their inflammation (CRP) was much higher?"
The "Kidney" Edit: "What if their kidney function was worse?"

Then, they let the AI generate the next 7 days of the story for each of these "what if" versions.

The Results: Did the AI Get It Right?

The AI acted like a seasoned doctor. When they changed the inputs, the AI's predicted outcomes matched real-world medical logic perfectly:

If the patient was older: The AI predicted a higher chance of death. (Makes sense: older bodies are more fragile).
If inflammation was high: The AI predicted the patient would stay in the hospital longer and was more likely to die. (Makes sense: high inflammation means a severe infection).
If kidney function was bad: The AI predicted the doctor would stop giving a specific drug called Remdesivir.
- Why? Because that drug can hurt kidneys. The AI "learned" this rule from the millions of real stories it read, even though no one explicitly programmed it with a rulebook. It just figured out the pattern: Bad kidneys = No Remdesivir.

Why Is This a Big Deal?

1. The "In-Silico" Clinical Trial
Usually, to test a new treatment or understand a risk, we need to run expensive, slow, and sometimes risky clinical trials on real humans. This AI allows us to run "In-Silico" (computer-based) trials. We can simulate thousands of "what if" scenarios in minutes to see what might happen, helping doctors make better decisions before they ever treat a real person.

2. It Learned Without a Teacher
The AI wasn't taught with a textbook. It wasn't told, "If CRP goes up, mortality goes up." It learned this on its own by reading the data. This is called Self-Supervised Learning. It's like a student who learns physics just by watching how balls fall, without ever opening a physics book.

3. Personalized Medicine
Imagine a doctor saying to a patient: "Based on your age and blood work, if we don't start treatment today, the AI predicts your risk of staying in the hospital for a week goes up by 20%." This tool could help doctors give highly personalized advice.

The Catch (Limitations)

The authors are honest about the flaws:

It's a Simulator, Not a Crystal Ball: It predicts probabilities, not certainties.
It's Still New: They only tested it on COVID-19. They need to prove it works for cancer, heart disease, and other conditions.
Complexity: Changing multiple things at once (e.g., "What if they are older AND have bad kidneys AND take a different drug?") is still very hard for the AI to get right.

The Bottom Line

This paper shows that we are moving toward a future where AI can act as a "Flight Simulator" for medicine. Just as pilots practice in a simulator before flying a real plane, doctors might soon use these AI models to practice different treatment strategies on virtual patients, ensuring that when they treat real people, they are making the safest, most informed choices possible.

1. Problem Statement

Counterfactual simulation—predicting "what-if" scenarios under alternative clinical conditions—holds significant promise for personalized medicine and in-silico clinical trials. However, generating these simulations at the patient level has remained infeasible due to the complexity of modeling:

Physiological Dynamics: Patient trajectories result from complex, non-linear physiological responses.
Clinician Decision-Making: Treatment choices are driven by dynamic, context-dependent factors that are difficult to capture with traditional statistical or rule-based models.
Data Limitations: Existing methods often lack the capacity to learn and reproduce the full temporal dynamics of patient histories from real-world data without explicit causal labels.

2. Methodology

Data Acquisition and Preprocessing

Source: Longitudinal Electronic Health Records (EHR) from the University of Tokyo Hospital (Jan 2011 – Dec 2023).
Scale: Over 300,000 unique patients and 400 million timeline entries.
Data Types: Demographics (sex), admissions/discharges, medication prescriptions (ATC codes), diagnoses (ICD-10), and laboratory results (JLAC10 codes).
Temporal Resolution: Entries were organized chronologically with minute-level resolution.
- Numeric Labs: Normalized via percentile scaling and discretized into 2,000 uniform bins (0th–100th percentiles).
- Time Intervals: Segmented into 10-minute bins for intervals <24 hours; coarser bins (1-day to 180-day) for longer intervals.
Split Strategy: Data from 2011–2022 was used for training; 2023 data (specifically COVID-19 admissions) was held out exclusively for counterfactual simulation to ensure temporal out-of-sample validation.

Model Architecture

The study utilizes a decoder-only Transformer architecture (based on GPT-2) adapted for autoregressive patient timeline generation.

Input Encoding:
- Categorical Tokens: Diagnoses, medications, and codes are embedded via learned embedding layers.
- Numeric Tokens: Processed through a dedicated numeric encoder (two feed-forward layers with GeLU activation).
- Temporal Tokens: Patient age is represented as a 5D vector (years, months, days, hours, minutes) and passed through a temporal encoder.
- Admission Encoding: A learnable vector is added to entries occurring during hospital admission periods.
Architecture Details:
- 12 decoder-only transformer blocks.
- Model dimension ( $d_{model}$ ): 768; Feed-forward dimension ( $d_{ff}$ ): 3072; 12 attention heads.
- Maximum sequence length: 2048 tokens.
- Vocabulary size: ~18,000 unique items.
Training Objective: Self-supervised learning minimizing cross-entropy loss. The model predicts the next timeline entry conditioned on all previous entries ( $P(e_{t+1} | e_1, ..., e_t)$ ).
Training Specs: Trained for 100 epochs on 4 NVIDIA A100 GPUs using AdamW optimizer with cosine annealing learning rate scheduling.

Counterfactual Simulation Protocol

Task: Simulate 7-day outcomes for patients hospitalized with COVID-19 in 2023.
Intervention: The model was prompted with modified input features while keeping the rest of the patient history constant:
1. Age: +5, +10, +15 years.
2. Serum CRP: +50, +100, +150, +200 mg/L.
3. Serum Creatinine: +1, +3, +5 mg/dL.
Generation Method: Monte Carlo sampling ( $S=256$ simulations per patient) to estimate event probabilities (e.g., mortality, drug prescription).
Evaluation: Compared simulated event rates ( $R_s$ ) against real-world event rates ( $R_r$ ) to verify if the model reproduced known clinical trends (e.g., higher mortality with higher CRP).

3. Key Contributions

First Generative Foundation Model for Patient Timelines: This is the first study to demonstrate a transformer-based autoregressive model capable of generating full, clinically plausible patient trajectories under counterfactual conditions without explicit causal supervision.
Self-Supervised Learning of Clinical Dynamics: The model successfully captured complex physiological and treatment-response relationships purely from real-world observational data, without rule-based constraints or outcome annotations.
Scalable Simulation Framework: The system generated over 622,000 counterfactual trajectories in under 10 hours, averaging 0.052 seconds per simulation, demonstrating feasibility for large-scale in-silico trials.
Validation of "What-If" Reasoning: The study proved that modifying specific input tokens (age, labs) leads to output changes that align with established medical knowledge.

4. Results

The model's counterfactual outputs aligned with established clinical patterns:

Age Modification:
- Mortality: Increased significantly with simulated age increments (strongest at +15 years, $p=0.0055$ ).
- Other Outcomes: Remdesivir use and length of stay did not show consistent trends, likely due to the cohort's already advanced mean age (>65) and hospital discharge regulations during the pandemic.
Serum CRP Modification (Inflammation):
- Mortality: Increased with higher CRP levels ( $p < 0.05$ for high increments).
- Remdesivir Use: Increased with higher CRP, reflecting clinical practice of treating more severe inflammatory responses.
- Length of Stay: Significantly increased for CRP +150 and +200 mg/L.
Serum Creatinine Modification (Kidney Function):
- Remdesivir Use: Decreased significantly with impaired kidney function ( $p < 0.001$ ), correctly reflecting clinical caution regarding renal toxicity.
- Mortality: Increased with worsening kidney function ( $p < 0.001$ ).
- Length of Stay: Showed a trend toward increase, though not statistically significant.

5. Significance and Implications

Personalized Medicine: Enables dynamic risk assessment by simulating how specific patient attributes (e.g., lab values) influence outcomes, potentially guiding treatment timing.
Data Augmentation: Can generate synthetic data for underrepresented scenarios (e.g., high-mortality cases) to improve training datasets for other AI models.
Clinical Decision Support: Offers a tool for physicians to visualize potential outcomes of different clinical states, enhancing patient-physician communication.
Future Research: Establishes a foundation for in-silico trials where interventions can be tested on virtual patient cohorts, potentially reducing the cost and time of traditional clinical trials.

Limitations Noted: The study focused on single-variable modifications (not simultaneous multi-variable changes) and did not simulate counterfactual treatment choices (e.g., "what if we gave Drug A instead of Drug B") due to the complexity of causal inference modeling, which remains an open challenge.