In-Context Learning of Temporal Point Processes with Foundation Inference Models

Imagine you are trying to predict the future of a chaotic city. You have a map of every bus, taxi, and pedestrian moving around, but the traffic patterns change constantly. Sometimes a bus causes a traffic jam (excitation), sometimes a road closure stops everything (inhibition), and sometimes nothing happens at all.

For decades, scientists have built "traffic cops" (AI models) to predict these events. But there was a catch: you had to hire a new, specialized traffic cop for every single city. If you wanted to predict traffic in New York, you trained a model on New York data. If you wanted to predict it in Tokyo, you had to start from scratch and train a completely new model. It was slow, expensive, and inefficient.

This paper introduces a revolutionary new approach called FIM-PP (Foundation Inference Model for Point Processes). Think of it as training a super-intelligent, universal traffic detective who learns the rules of traffic, not just the specific traffic of one city.

Here is how it works, broken down into simple concepts:

1. The Problem: The "One-Size-Fits-None" Trap

Traditional AI models for event prediction (like when a tweet goes viral, when a stock trades, or when a neuron fires) are like custom-tailored suits. They fit one specific dataset perfectly but are useless on any other. If the data changes even slightly, the model breaks.

2. The Solution: The "Universal Detective"

The authors created a model that doesn't just memorize data; it learns the underlying physics of time and events.

The Training Ground (Synthetic Data): Instead of just looking at real-world data (which is messy and limited), they created a massive, artificial universe in a computer. They simulated millions of different "worlds" with different rules:
- Some worlds where events trigger more events (like a viral tweet).
- Some where events stop other events (like a roadblock).
- Some where events happen randomly (like raindrops).
- They even invented weird, complex patterns the model had never seen before.
The "In-Context" Superpower: Once the model was trained on this massive synthetic universe, it became a Foundation Model. This means it has a "common sense" about how time and events interact.

3. How It Works in Real Life: The "Context" Trick

Here is the magic part. When you want to use this model on a new real-world problem (like predicting taxi rides in London), you don't retrain it. You just show it a few examples of the current situation.

The Analogy: Imagine you are a detective. You haven't seen a specific crime before. But you walk into the room, look at the clues (the "context" of recent events), and your brain instantly says, "Ah, this looks like the pattern I studied in my training. The next move is likely X."
Zero-Shot Learning: The model can often make accurate predictions immediately, without any extra training. It's like a chef who has tasted every spice in the world and can instantly guess the recipe of a new dish just by smelling it.
Few-Shot Learning: If the new situation is very strange, the model can be "fine-tuned" in just a few minutes (instead of days) to adapt perfectly.

4. What Can It Do?

The paper tested this "Universal Detective" on five different real-world scenarios:

Taxi Rides: Predicting when and where taxis will be picked up or dropped off.
Online Shopping: Guessing what a user will buy next on a site like Amazon or Taobao.
Social Media: Predicting the next retweet on Twitter.
Stack Overflow: Guessing when a user will earn a new badge.

The Result?

Without extra training (Zero-Shot): It performed just as well as the best specialized models that had been trained specifically for those tasks.
With a little extra training (Fine-tuned): It became the best model in the room, beating all the specialized competitors.

5. Why This Matters

This is a huge shift in how we do AI.

Before: "I have a new dataset? Okay, let me spend 4 hours training a new model from scratch."
Now: "I have a new dataset? Let me feed it to the Foundation Model. It understands the rules of time. It's ready in seconds."

The Bottom Line

The authors have built the first "Google Translate" for time-based events. Just as Google Translate learned the rules of language so it could translate any language without needing a new dictionary for each one, FIM-PP has learned the rules of time so it can predict any sequence of events, from stock markets to social media, instantly and accurately.

It turns the complex math of "Temporal Point Processes" into a tool that is flexible, fast, and ready for the real world.

1. Problem Statement

The paper addresses the challenge of modeling Marked Temporal Point Processes (MTPPs), which are stochastic processes describing sequences of asynchronous events, each associated with a specific type (mark).

Current Limitations: Existing neural approaches (e.g., Neural Hawkes Processes, Transformers for TPPs) typically require training a separate, specialized model for every new dataset. This "train-from-scratch" paradigm is data-inefficient, computationally expensive, and lacks transferability. Models cannot generalize to unseen dynamical regimes without retraining.
The Gap: While foundation models have revolutionized NLP and are emerging in other dynamical systems (ODEs, SDEs), there is no corresponding foundation model for point processes that can perform zero-shot inference or rapid few-shot adaptation across diverse event sequences.

2. Methodology: FIM-PP

The authors propose FIM-PP (Foundation Inference Model for Point Processes), a transformer-based architecture designed to perform amortized inference and in-context learning. Instead of learning a specific intensity function for one dataset, the model learns to infer the conditional intensity function of any MTPP given a context of event sequences.

A. Synthetic Pretraining Strategy

To train a generalizable model, the authors constructed a massive synthetic dataset:

Distribution Definition: They defined a broad prior over MTPPs using a generalized conditional intensity function:
$\lambda(t, \kappa | H_t) = \max\left(0, \mu_\kappa(t) + \sum_{(t', \kappa') \in H_t} z_{\kappa\kappa'} \gamma_{\kappa\kappa'}(t - t') \right)$
This formulation allows for:
- Base Intensities ( $\mu$ ): Constant, sinusoidal (periodic), or Gamma-distributed (high initial excitation).
- Interaction Kernels ( $\gamma$ ): Exponential decay, Rayleigh (non-monotonic), or zero (Poisson).
- Interaction Types ( $z$ ): Randomized coefficients to simulate excitatory ( $+1$ ), inhibitory ($-1$), or neutral ($0$) interactions between event types.
Data Scale: The dataset comprises 72,000 distinct processes and 14.4 million events, covering a wide range of time scales, sparsity levels, and mark counts (up to 22 marks).

B. Model Architecture

FIM-PP utilizes a hierarchical Transformer architecture to process event sequences:

Context Encoding: A set of $m$ context sequences $\{S_j\}$ is processed. Each sequence is embedded (time, mark, inter-event time) and passed through a Transformer Encoder. A fixed-query attention mechanism aggregates each sequence into a single context embedding $c_j$ . These are then combined via a second encoder.
History Encoding: For a target time $t$ , the history of events $H_t$ is embedded and passed through a Transformer Decoder.
In-Context Attention: The decoder uses the history embeddings as queries and the aggregated context embeddings as keys and values. This allows the model to "read" the dynamics from the context sequences to inform the prediction for the target history.
Intensity Parametrization: The final embedding is concatenated with the target mark and projected via MLPs to estimate parameters ( $\hat{\mu}, \hat{\alpha}, \hat{\beta}$ ) for a flexible intensity function:
$\hat{\lambda}(t, \kappa | H_t) = \hat{\mu} + (\hat{\alpha} - \hat{\mu}) \exp(-\hat{\beta}(t - t_{last}))$
This parametrization captures both the baseline rate and the decay of excitation/inhibition, capable of approximating complex kernels (e.g., power-law, Rayleigh) despite being trained on exponential kernels.

C. Training and Inference

Objective: The model is trained to minimize the Negative Log-Likelihood (NLL) of the next event in a target sequence, conditioned on the context.
Zero-Shot Inference: Once pretrained, the model can be applied to real-world datasets without further training by feeding observed sequences as the "context."
Fine-Tuning: The model can be rapidly fine-tuned on specific target data (using the same context-to-target setup) to adapt to dataset-specific patterns.

3. Key Contributions

First Foundation Model for TPPs: FIM-PP is the first model explicitly designed to infer MTPPs in an in-context learning setting, enabling zero-shot prediction.
Synthetic Data Framework: The authors introduced a robust synthetic data generation pipeline that induces a strong prior over diverse dynamical regimes (excitatory, inhibitory, periodic, Poisson), enabling generalization to real-world data.
Transformer-Based Intensity Estimation: They demonstrated that a Transformer can learn to estimate conditional intensity functions directly from event histories and context, bypassing the need for explicit integral evaluations during inference (unlike traditional thinning algorithms).
Efficiency: The model (16M parameters) can be fine-tuned in minutes on consumer hardware, significantly faster than training specialized baselines from scratch (which can take hours).

4. Experimental Results

The model was evaluated on five real-world benchmark datasets: TAXI, TAOBAO, STACKOVERFLOW, AMAZON, and RETWEET.

Zero-Shot Performance:
- FIM-PP (zero-shot) achieved performance comparable to state-of-the-art specialized models (e.g., CDiff, Dual-TPP, Attentive NHP) on multi-event prediction metrics (OTD, sMAPE).
- It successfully generalized to unseen processes, including those with power-law kernels (which were not present in the pretraining distribution), demonstrating strong out-of-distribution capabilities.
Fine-Tuned Performance:
- After minimal fine-tuning (few minutes), FIM-PP (f) consistently outperformed all baselines across all datasets and prediction horizons ( $N=5, 10, 20$ ).
- It achieved the best results in Optimal Transport Distance (OTD) and event count error (RMSEe) on most datasets.
Next-Event Prediction:
- While zero-shot performance was strong on event timing, it struggled with specific mark patterns (e.g., strict alternation in the TAXI dataset) that were rare in the synthetic prior.
- Fine-tuning successfully resolved these issues, allowing the model to capture dataset-specific oscillatory patterns.

5. Significance and Impact

Paradigm Shift: The paper moves the field of temporal point processes from "model-per-dataset" to "one-model-for-all," aligning TPPs with the foundation model paradigm seen in LLMs.
Practical Utility: By eliminating the need for expensive retraining for every new application, FIM-PP lowers the barrier to entry for deploying advanced event prediction in dynamic environments (e.g., finance, epidemiology, social media).
Interpretability: Unlike black-box generative diffusion models, FIM-PP explicitly estimates the conditional intensity function, allowing for the analysis of causal structures (excitation/inhibition) and physical observables.
Future Direction: The work highlights that while synthetic pretraining provides a strong prior, expanding the prior to include more complex real-world patterns (like strict alternation) is a key area for future improvement to enhance zero-shot capabilities further.

In conclusion, FIM-PP demonstrates that amortized inference via in-context learning is a viable and powerful strategy for temporal point processes, offering a flexible, efficient, and high-performing alternative to traditional specialized modeling.

In-Context Learning of Temporal Point Processes with Foundation Inference Models

1. The Problem: The "One-Size-Fits-None" Trap

2. The Solution: The "Universal Detective"

3. How It Works in Real Life: The "Context" Trick

4. What Can It Do?

5. Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology: FIM-PP

A. Synthetic Pretraining Strategy

B. Model Architecture

C. Training and Inference

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank