Aurora: Towards Universal Generative Multimodal Time Series Forecasting

Imagine you are trying to predict the weather. If you only look at the temperature graph from the last week, you might guess it will keep raining. But what if you also read a news headline saying, "A massive cold front is sweeping in from the Arctic"? Suddenly, your prediction changes. You realize the rain might turn into a blizzard.

This is the core problem the paper Aurora solves.

The Problem: The "Blind" Forecaster

For a long time, computer models for predicting time series (like stock prices, traffic, or weather) have been like blindfolded chefs. They can taste the ingredients (the past numbers) and guess the flavor of the soup (the future numbers).

However, they often fail when the "recipe" changes.

Scenario A: A traffic graph looks like a busy morning commute.
Scenario B: A traffic graph looks exactly the same, but it's actually a parade route.

If the model only sees the numbers, it predicts "heavy traffic." But in Scenario B, the traffic is actually moving slowly because of a parade. The model fails because it doesn't know the context (the text description or the image of the parade).

The Solution: Aurora, the "Multimodal Detective"

The authors introduce Aurora, the first "Foundation Model" that can see, read, and predict all at once. Think of Aurora not as a blind chef, but as a super-detective who has three tools:

The Time Lens: Looks at the numbers (the past data).
The Reading Glasses: Reads the text descriptions (e.g., "NVIDIA announced a partnership," or "A flood warning is in effect").
The Camera: Looks at images generated from the data (which show the shape and patterns of the numbers).

How It Works (The Magic Trick)

1. The "Distillation" (Finding the Clues)

Aurora doesn't just read every word in a 10-page report or look at every pixel in a photo. That would be too slow.

Analogy: Imagine you are a detective summarizing a 500-page case file. You don't read every word; you extract the key clues.
Aurora does this: It uses "Token Distillation" to ignore the boring stuff and focus only on the critical words in the text or the most important shapes in the image that actually affect the future.

2. The "Guided Attention" (Listening to the Right Voice)

Once Aurora has the clues, it needs to decide how much weight to give them.

Analogy: Imagine you are driving. Your eyes (the data) see the road, but your GPS (the text) says, "Road closed ahead."
Aurora's "Modality-Guided Attention": This is like a smart co-pilot. It tells the model, "Hey, the numbers look normal, but the GPS says 'Road Closed,' so pay attention to the end of the road, not the beginning." It forces the model to focus on the parts of the history that match the new information.

3. The "Prototype Bank" (The Crystal Ball)

This is the most creative part. When predicting the future, most models start with a blank slate (random noise) and try to guess the shape.

Analogy: Imagine you are trying to draw a picture of a future storm. Instead of starting with a blank white paper, you start with a stencil of a storm.
Aurora's "Prototype Bank": It has a library of 1,000 "future shapes" (prototypes) like "sudden spike," "slow decline," or "steady cycle." Based on the text and images, it picks the best stencil (prototype) to start with.
Flow Matching: Then, it gently morphs that stencil into the final prediction. This is much faster and more accurate than guessing from scratch.

Why Is This a Big Deal?

Most current models are specialists.

One model is great at electricity prices but terrible at stock markets.
Another model needs you to re-train it every time you change the topic.

Aurora is a "Generalist" (Zero-Shot):
You can show it a dataset it has never seen before (like a new type of sensor data), give it a text description, and it will say, "Ah, this looks like a 'sudden drop' pattern I've seen in other contexts. Here is my prediction."

The Results

The paper tested Aurora on 5 major benchmarks (like TimeMMD and TSFM-Bench).

The Score: Aurora beat the previous "State-of-the-Art" models by a significant margin (often reducing errors by 20-30%).
The Versatility: It works whether you give it text, images, or just numbers. It works for deterministic forecasts (one exact answer) and probabilistic forecasts (a range of possibilities with confidence levels).

In a Nutshell

Aurora is like upgrading from a calculator to a consultant.

Old Models: "The numbers went up, so I predict they will go up more."
Aurora: "The numbers went up, but the text says 'market saturation,' and the image shows a plateau. Therefore, I predict they will level off."

It's a universal tool for decision-making that understands that context is king.

Here is a detailed technical summary of the paper "AURORA: Towards Universal Generative Multimodal Time Series Forecasting".

1. Problem Statement

Time series forecasting faces a critical challenge in cross-domain generalization. While historical data patterns often appear similar across different domains (e.g., traffic flow vs. stock prices), the underlying future trends can diverge significantly due to domain-specific characteristics.

Limitations of Unimodal Foundation Models: Existing foundation models (e.g., Sundial, Chronos) are trained on massive time-series corpora but lack explicit access to domain-specific knowledge contained in auxiliary modalities (text, images). Consequently, they struggle to adapt when similar historical patterns lead to different futures based on external context.
Limitations of End-to-End Multimodal Models: Current multimodal supervised models (e.g., GPT4MTS, CALF) integrate text or image data but are typically tailored for specific end-to-end training scenarios. They lack the ability to perform zero-shot inference on unseen cross-domain tasks and often fail to support generative probabilistic forecasting effectively.

The paper argues that the next generation of time series models must be multimodal foundation models capable of zero-shot cross-domain inference by explicitly leveraging domain knowledge from text and images.

2. Methodology: The Aurora Architecture

Aurora is the first Multimodal Time Series Foundation Model designed for generative probabilistic forecasting. It is pretrained on a Cross-Domain Multimodal Time Series Corpus containing time series data paired with sample-wise text descriptions and endogenous images. The architecture consists of two main phases:

A. Aurora Encoder: Cross-Modality Fusion

The encoder processes three modalities: Time Series, Text, and Images.

Multimodal Tokenization:
- Time Series: Processed via Instance Normalization and non-overlapping Patching (similar to PatchTST).
- Images: Endogenous images are generated by rendering the time series into 2D structures based on periodicity, then resized for ViT input.
- Text: Tokenized using a BERT vocabulary.
Token Distillation: To handle information redundancy, the model employs VisionDistiller and TextDistiller. These use learnable query vectors (semantic clustering centroids) and Multi-head Cross-Attention to compress the raw image and text tokens into distilled, key-information tokens ( $K_{image}$ and $K_{text}$ ).
Modality-Guided Multi-head Self-Attention (MG-MSA): This is the core innovation for temporal modeling.
- Instead of standard self-attention, Aurora uses TextGuider and VisionGuider to compute correlations between the time series tokens and the distilled text/image tokens.
- These correlations are fused into a Correlation Matrix ( $Corr$ ) which bridges the time series modality with external domain knowledge.
- The $Corr$ matrix is injected into the Self-Attention mechanism ( $S = (Q \cdot K^T + Corr) / \sqrt{d}$ ), guiding the model to focus on specific time steps that align with the domain context provided by text and images.
Modality Fuser: The final temporal representations are fused with the distilled text and image features via Cross-Attention to create a unified multimodal representation ( $X_{fuse}$ ).

B. Aurora Decoder: Prototype-Guided Flow Matching

The decoder generates future tokens using a novel generative approach.

Condition Decoding: A Causal-Transformer and Cross-Transformer generate multimodal conditions ( $X_{cond}$ ) for the future horizon based on the fused representations.
Prototype Bank & Retrieval:
- A Prototype Bank contains $M$ learnable prototypes initialized with trigonometric, exponential, and polynomial bases to represent various periodic and trend patterns.
- A PrototypeRetriever (Transformer-based) takes text and image representations as input to retrieve a weighted combination of these prototypes ( $\tilde{P}$ ). This provides an intelligent starting point for generation, encoding the expected "future shape" (trend/periodicity) derived from domain knowledge.
Prototype-Guided Flow Matching:
- Unlike DDPM (which starts from Gaussian noise), Aurora uses Flow Matching (an ODE solver) starting from the retrieved prototype $\tilde{P}$ plus noise.
- The model learns a velocity field $v_\theta$ to map the initial prototype to the target ground truth.
- The objective minimizes the difference between the predicted velocity and the target velocity field, conditioned on the multimodal context. This simplifies the generation process and enhances stability.

3. Key Contributions

First Multimodal Time Series Foundation Model: Aurora pioneers the pretraining of a foundation model on a cross-domain multimodal corpus, enabling zero-shot inference across diverse scenarios.
Novel Encoding Mechanism: Introduction of Modality-Guided Self-Attention, which explicitly injects domain-specific knowledge from text and images into temporal feature extraction, significantly improving cross-domain adaptability.
Prototype-Guided Flow Matching: A new decoding strategy that replaces random noise initialization with retrieved prototypes containing trend and periodicity information. This enhances the efficiency and accuracy of generative probabilistic forecasting.
Comprehensive Performance: Aurora supports unimodal, multimodal, deterministic, and probabilistic forecasting tasks within a single unified framework.

4. Experimental Results

Aurora was evaluated on 5 well-recognized benchmarks: TimeMMD, TSFM-Bench, ProbTS, TFB, and EPF.

Multimodal Zero-Shot (TimeMMD): Aurora outperformed unimodal foundation models (Sundial, VisionTS) with an average MSE reduction of 27.0% and 31.2%, respectively. It also surpassed full-shot multimodal supervised models (GPT4MTS, CALF) despite being trained on only 10% of the data (few-shot), achieving an average MSE reduction of 12.8% and 24.5%.
Unimodal Zero-Shot (TSFM-Bench & ProbTS): Even without text/image inputs (simulating modality absence), Aurora achieved state-of-the-art performance. It reduced MSE by 15.1% compared to Time-MoE and 22.9% compared to ROSE on deterministic tasks. On probabilistic tasks, it reduced CRPS by 21.5% vs. CSDI and 38.3% vs. MOIRAI.
Short-Term Forecasting (TFB & EPF): Aurora demonstrated superior performance on datasets with limited historical context, outperforming both foundation models and full-shot supervised models (e.g., TimeXer, iTransformer).
Ablation Studies: Removing the Modality-Guided Attention or the Prototype mechanism caused significant performance drops, confirming the necessity of both components.

5. Significance

Universal Applicability: Aurora bridges the gap between domain-specific knowledge and general time series modeling, offering a "plug-and-play" tool for decision intelligence in complex, cross-domain scenarios.
Generative Probabilistic Forecasting: By integrating Flow Matching with prototype retrieval, it provides robust uncertainty quantification, which is crucial for risk-sensitive applications like finance and healthcare.
Efficiency: The distillation and prototype-guided mechanisms allow the model to achieve high accuracy with fewer parameters and less training data compared to full-shot supervised baselines.
Open Science: The authors have released the code and model checkpoints, fostering reproducibility and further research in multimodal time series analysis.

In conclusion, Aurora represents a paradigm shift in time series forecasting, moving from unimodal pattern recognition to multimodal, knowledge-guided generative modeling, effectively solving the cross-domain generalization problem.