MM-ISTS: Cooperating Irregularly Sampled Time Series Forecasting with Multimodal Vision-Text LLMs

Imagine you are trying to predict the weather for a small town, but you have a very strange problem: your weather station is broken. Sometimes it sends you a temperature reading, sometimes a wind speed, and sometimes it sends nothing at all. Worse, the readings arrive at random times—sometimes every hour, sometimes every three days, and sometimes two sensors report at the same time while others are silent.

This is the world of Irregularly Sampled Time Series (ISTS). It's how real-world data often looks in hospitals, traffic systems, and climate science. Most computer models are like students who only know how to study when the teacher hands them a neat, perfect schedule. If the schedule is messy, they get confused.

The paper you shared introduces MM-ISTS, a new "super-student" that doesn't just look at the messy numbers; it looks at the numbers, sees a picture of them, and reads a story about them all at the same time.

Here is how MM-ISTS works, explained through simple analogies:

1. The Problem: The "Messy Notebook"

Traditional models try to force this messy data into a neat grid (like a spreadsheet with empty cells). But when you force a jagged, irregular timeline into a straight line, you lose the most important clues: when the data was missing and how long the gaps were. It's like trying to understand a conversation by only reading the words, ignoring the long pauses and who was speaking.

2. The Solution: The "Three-Pronged Detective"

MM-ISTS acts like a detective who uses three different tools to solve the case, rather than just one.

Tool A: The "Specialized Sketch" (Vision)

Instead of just looking at numbers, MM-ISTS turns the data into a 3-channel image (like an RGB photo).

Channel 1 (The Picture): Shows the actual values (e.g., the temperature).
Channel 2 (The Mask): Shows where the data is missing (like a "hole" in the picture).
Channel 3 (The Timeline): Shows the time gaps between readings.
The Analogy: Imagine a detective sketching a crime scene. They don't just write "gunshot at 2 PM." They draw the bullet hole, the blood spatter, and the distance between the victim and the shooter. This helps the AI "see" the irregularity rather than just calculating it.

Tool B: The "Contextual Story" (Text)

The system also writes a text summary of the data. It says things like, "This sensor usually reports temperatures between 20°C and 30°C, but today it's missing 40% of its data."

The Analogy: This is like a detective reading the police report. The numbers tell you what happened, but the text tells you why it might be weird (e.g., "The sensor was broken during the storm"). This gives the AI "common sense" about the data.

Tool C: The "Mathematical Brain" (The ISTS Encoder)

While the Vision and Text tools use a giant, pre-trained AI (a Multimodal Large Language Model) to understand the big picture, MM-ISTS also has a specialized math brain that focuses strictly on the patterns in the numbers.

The Analogy: The Vision/Text tools are like a wise old professor who knows the history of the town. The Math Brain is like a brilliant calculator who is great at spotting specific trends in the numbers. You need both.

3. The "Smart Filter" (Adaptive Query)

Here is the tricky part: The "Professor" (the giant AI) is huge and talks in thousands of words. The "Calculator" (the math brain) only needs a few key facts. If you try to feed the whole conversation to the calculator, it gets overwhelmed.

MM-ISTS uses a Smart Filter (Adaptive Query-Based Feature Extractor).

The Analogy: Imagine the Professor is giving a 3-hour lecture. The Calculator only needs a 5-minute summary. The Smart Filter is a brilliant secretary who listens to the Professor, picks out the exact 5 minutes relevant to the specific variable (e.g., "Temperature Sensor #4"), and hands that summary to the Calculator. It throws away the noise and keeps the gold.

4. The "Traffic Controller" (Multimodal Alignment)

Finally, the system has to decide how much to trust the Professor versus the Calculator.

The Analogy: Imagine a traffic light.
- If a sensor is working perfectly (lots of data), the light turns Green for the Calculator (the math brain). It trusts the numbers.
- If a sensor is broken (lots of missing data), the light turns Green for the Professor (the multimodal AI). It says, "We don't have numbers, so let's use our general knowledge and the text description to guess what's happening."
- This is called Modality-Aware Gating. It dynamically switches between trusting the hard data and trusting the "common sense" AI depending on how messy the data is.

Why is this a big deal?

Previous methods were like trying to fix a broken car with only a wrench. MM-ISTS brings a wrench, a diagnostic computer, and a mechanic's manual all at once.

It handles broken data better: It doesn't panic when data is missing; it uses the "picture" of the missingness and the "story" about the data to fill in the gaps.
It's smarter: By using a giant AI that has read millions of books (the LLM), it understands the context of the data, not just the math.
It's efficient: Even though it uses a giant AI, the "Smart Filter" ensures it doesn't waste time processing unnecessary information.

In short, MM-ISTS is a system that learns to "read" messy, broken time-series data by turning it into a picture, a story, and a math problem simultaneously, then letting a smart traffic controller decide which clue is most important for making a prediction.

1. Problem Statement

Irregularly Sampled Time Series (ISTS) are ubiquitous in real-world domains like healthcare, transportation, and climate science, where sensor malfunctions, network failures, or varying sampling rates lead to asynchronous observations across variables.

Limitations of Existing Methods:
- Single-Modality: Most existing ISTS methods rely solely on historical numerical observations, failing to leverage rich contextual semantics or fine-grained temporal patterns.
- Modality Gap: While recent approaches use Large Language Models (LLMs) or Vision-Language Models (VLMs), they struggle to align sparse, irregular numerical data with the dense, coarse-grained semantic inputs required by Multimodal LLMs (MLLMs).
- Representation Issues: Naive conversion of time series to standard images or text often distorts temporal scales or loses structural correlations, making it difficult for MLLMs to learn subtle dynamics essential for accurate forecasting.

2. Methodology: MM-ISTS Framework

The authors propose MM-ISTS, a multimodal framework that bridges temporal, visual, and textual modalities using frozen MLLMs. The framework consists of four core components:

A. Cross-Modal Vision-Text Encoding

This module transforms sparse ISTS into dense visual and textual representations while preserving irregularity patterns.

Irregularity-Aware Image Construction: Instead of standard line plots, the method constructs a 3-channel tensor image:
1. Channel 0 (Observations): Raw numerical values.
2. Channel 1 (Missingness Mask): Binary indicators (1 for observed, 0 for missing) to explicitly show sparsity.
3. Channel 2 (Temporal Intervals): Encodes the time gaps between observations ( $\delta = t_i - t_{i-1}$ ), preserving the irregular sampling structure.
Statistical-Domin Text Prompting: Generates structured text prompts containing:
- Data Statistics: Mean, range, and missing rates for each variable (filtered if too sparse).
- Domain Context & Task Description: Instructions to guide the MLLM's reasoning.
MLLM Feature Extraction: A frozen MLLM (e.g., Qwen2-VL) processes the 3-channel image and text prompt to produce high-dimensional multimodal tokens ( $E_{MLLM}$ ).

B. ISTS Encoding (Numerical Branch)

To capture fine-grained numerical patterns that MLLMs might miss, a dedicated Temporal-Variable Encoder is used in parallel.

Multi-View Embedding Fusion:
- Temporal Embedding: Uses learnable sinusoidal mappings to handle continuous, irregular timestamps.
- Variable Embedding: Unique learnable vectors for each variable to capture inter-variable correlations.
- Value Embedding: Encodes observed values and mask indicators.
Two-Stage Transformer:
1. Temporal Encoder: Models intra-series dependencies (time dynamics within a single variable).
2. Variable Encoder: Models inter-series dependencies (correlations between different variables).
- Output: A robust numerical representation ( $H_{ISTS}$ ).

C. Adaptive Query-Based Feature Extractor (QBE)

To bridge the dimensionality and sequence length gap between the MLLM output (variable-length, high-dim) and the ISTS encoder (fixed $N$ variables), a Q-Former inspired module is introduced.

Mechanism: Uses $N$ learnable query tokens (one per variable) to interact with MLLM features via self-attention and cross-attention.
Function: Acts as an information bottleneck, compressing vast visual-textual tokens into fixed-length, variable-aligned representations ( $H_{MM}$ ) while filtering out redundant noise.

D. Multimodal Alignment

This module fuses the numerical features ( $H_{ISTS}$ ) and the compressed multimodal features ( $H_{MM}$ ).

Cross-Attention Fusion: Allows numerical features to selectively query and incorporate relevant contextual information from the MLLM branch.
Modality-Aware Gating: A dynamic gating network calculates fusion weights ( $\alpha_{num}, \alpha_{mm}$ $α_{n u m}, α_{mm}$ ) based on variable-specific statistics (mean, variance, missing rate).
- Logic: If a variable is sparse (high missing rate), the model relies more on the MLLM's semantic knowledge. If dense, it relies more on the precise numerical encoder.

E. Predictor

The final fused representation is combined with future query timestamps via an MLP to generate predictions, optimized using Mean Squared Error (MSE) loss.

3. Key Contributions

First Multimodal ISTS Framework: Proposes MM-ISTS, the first framework to leverage vision-text LLMs specifically for irregularly sampled time series forecasting.
Novel Encoding Mechanisms:
- Irregularity-Aware Images: A 3-channel image construction that preserves missingness and time intervals.
- Statistical Prompts: Textual prompts enriched with domain statistics to enhance MLLM reasoning.
- Dual-Branch Encoding: Combines a specialized numerical encoder with a multimodal LLM branch.
Adaptive Alignment: Introduces a Modality-Aware Gating mechanism that dynamically balances numerical and semantic information based on data quality (sparsity), effectively mitigating the modality gap.
Efficient Feature Extraction: The Adaptive Query-Based Feature Extractor compresses MLLM knowledge into variable-aligned tokens, reducing computational overhead while maintaining alignment.

4. Experimental Results

Datasets: Evaluated on four benchmarks: PhysioNet, MIMIC, Human Activity, and USHCN.
Baselines: Compared against Regular Time Series models (e.g., PatchTST), ISTS-specific models (e.g., Latent ODE, T-PatchGNN), and LLM-based methods (e.g., ISTS-PLM).
Performance:
- MM-ISTS achieved state-of-the-art (SOTA) performance on most metrics.
- Outperformed all ISTS forecasting baselines by an average of 14.3% in MSE and 15.1% in MAE.
- Specifically, it improved over the LLM-based baseline ISTS-PLM by 5.2% (MSE) and 4.6% (MAE) on the MIMIC dataset, proving the value of multimodal integration over single-modal LLM approaches.
Efficiency: By freezing the MLLM backbone and only training lightweight downstream modules, MM-ISTS achieves training speeds approximately 50% faster than fine-tuned LLM baselines (like ISTS-PLM) while maintaining superior accuracy.
Ablation Studies: Confirmed that removing any component (Text, Image, QBE, or Alignment) leads to performance degradation. The QBE and Gating mechanisms were identified as critical for handling sparsity and feature compression.
Case Studies: Visualizations showed that the model successfully filters noise in attention maps and correctly assigns higher weights to the multimodal branch for sparse variables.

5. Significance

This paper addresses a critical gap in time series forecasting: the inability of current models to effectively utilize rich contextual information (text/images) when dealing with irregularly sampled data.

Theoretical Impact: It demonstrates that MLLMs, when properly aligned with numerical data via irregularity-aware encoding and adaptive gating, can serve as powerful complementary sources of knowledge, especially in data-scarce scenarios.
Practical Impact: The framework offers a robust solution for real-world applications (e.g., patient monitoring, sensor networks) where data is often incomplete, noisy, or irregular, enabling more informed decision-making without the prohibitive cost of fine-tuning massive models.