UTICA: Multi-Objective Self-Distllation Foundation Model Pretraining for Time Series Classification

Imagine you are trying to teach a robot to recognize different types of music just by listening to short clips. Some songs are jazz, some are rock, and some are classical.

For a long time, the best way to teach this robot was Contrastive Learning. Think of this like a game of "Spot the Difference." You show the robot two clips: "This is Jazz" and "This is NOT Jazz." The robot learns by pushing the "Not Jazz" examples far away in its mind and pulling the "Jazz" examples closer together.

The Problem: In the world of time series (like heartbeats, stock prices, or weather data), this "Spot the Difference" game has a flaw. Imagine you show the robot two different heartbeats from two different people. They might look very similar because they are both healthy heartbeats. If the robot assumes they are "different" just because they come from different people, it gets confused. It starts thinking, "Wait, these look the same, but you told me they are different!" This creates confusion and stops the robot from learning the true patterns of a heartbeat.

Enter "Utica": The New Teacher-Student System

The authors of this paper, Utica, propose a smarter way to teach the robot. Instead of playing "Spot the Difference," they use a Teacher-Student system, inspired by how humans learn from mentors.

Here is how it works, using simple analogies:

1. The Teacher and the Student

Imagine a Teacher (who is very wise and calm) and a Student (who is eager to learn).

The Student looks at a messy, noisy, or cut-up version of a time series (like a song with static or a missing verse).
The Teacher looks at the clean, full version of the same song.
The Student tries to guess what the Teacher sees. If the Student guesses right, they get a high score. The Teacher doesn't learn; they just guide the Student. Over time, the Student becomes just as smart as the Teacher.

2. The Two Special Tricks (The "Secret Sauce")

The paper introduces two specific ways to mess with the data to make the Student smarter. Think of these as two different training drills:

Drill A: The "Zoom and Crop" (DINO Loss)
Imagine you have a long video of a bird flying.
- Global View: You show the Student a zoomed-out clip of the whole flight path.
- Local View: You also show the Student tiny, zoomed-in clips of just the bird's wings flapping.
- The Lesson: The Student learns that even if they only see a tiny part of the wing (local) or the whole flight path (global), it's still the same bird. This teaches the robot to recognize patterns no matter the scale or speed.
Drill B: The "Blindfold" (iBOT Loss)
Imagine you show the Student a sentence, but you cover up 50% of the words with black boxes.
- The Lesson: The Student has to guess what the missing words were based on the context of the words they can see.
- Why it helps: This forces the robot to understand the structure and details of the data, not just the general vibe. It learns that if it sees "The sky is..." it should expect "blue," not "banana."

3. The "Uniformity" Rule (KoLeo Loss)

There's a third rule to make sure the robot doesn't get lazy. Sometimes, a student might cheat by just saying "Everything is the same!" to get a passing grade.

The KoLeo Loss is like a strict coach who says, "You can't just group everything together. You need to spread your knowledge out." It forces the robot to keep different types of data distinct in its memory, ensuring it doesn't collapse into a boring, one-size-fits-all answer.

The Results: Why It Matters

The authors tested this new "Utica" robot on two huge libraries of time series data (UCR and UEA), which include everything from earthquake sensors to medical heart monitors.

The Old Way (Contrastive): The robot was good, but sometimes confused by similar-looking data.
The New Way (Utica): The robot became the champion. It beat all the previous top models in both "frozen" mode (where it just uses what it learned) and "fine-tuned" mode (where it adapts to a specific new task).

The Big Picture

This paper is a breakthrough because it says: "Stop fighting against the data; start understanding its structure."

By moving away from the "Spot the Difference" game and using a "Teacher-Student" approach that looks at both the big picture and the tiny details, we can build AI that understands time series data much better. This means better tools for:

Doctors: Detecting heart issues earlier.
Engineers: Predicting when a machine will break before it happens.
Scientists: Understanding complex weather patterns.

In short, Utica is a new, smarter way to teach AI to listen to the world's rhythms without getting confused by the noise.

1. Problem Statement

The paper addresses the limitations of current Time Series Foundation Models (TSFMs), particularly regarding their suitability for classification tasks (e.g., fault detection, medical diagnostics) versus forecasting.

Limitations of Existing Methods:
- Forecasting-Oriented Objectives: Most TSFMs use autoregressive or masked reconstruction objectives. These prioritize local temporal consistency but often fail to capture global semantic structures vital for classification.
- Contrastive Learning Flaws: While contrastive learning (e.g., Mantis) has been successful, it relies on the assumption that different samples within a batch are semantically distinct. In time series, samples often share similar dynamics or frequency content, leading to "false negatives" that degrade representation quality.
- Single-View Self-Distillation: Previous self-distillation approaches (e.g., Pieper et al., NuTime) rely on a single view-generation strategy (either masking only or global crops only), which limits the model's ability to learn both invariant global features and fine-grained local structures simultaneously.

2. Methodology: UTICA

The authors propose Utica, a pretraining framework that adapts the DINOv2 self-distillation paradigm (successful in computer vision) to time series data. It utilizes a Student-Teacher framework with a multi-objective loss function.

A. Architecture & Backbone

Backbone: A Transformer encoder based on the Mantis architecture.
Tokenization: Inputs are processed using three complementary transformations:
1. Instance-normalized series.
2. First-order differential (to capture stationarity).
3. Patch-level encodings of mean and standard deviation.
Embedding: These are concatenated, projected to dimension $D=256$ , and processed through 6 Transformer layers with a learnable [CLS] token and sinusoidal positional encodings.

B. Data Generation (Pretraining)

Synthetic Data: Following recent findings, Utica is pretrained entirely on synthetic data generated via a Causal Directed Acyclic Graph (DAG).
Generation Process: Root nodes are sampled from Gaussian Processes with non-stationary means and random covariance kernels. Non-root nodes are generated via weighted sums of parents followed by random non-linearities. This ensures diverse temporal dynamics without relying on scarce labeled real-world data.

C. Student-Teacher Framework

Teacher: Updated via Exponential Moving Average (EMA) of Student weights.
Student: Updated via gradient descent.
Views:
- Teacher: Processes only global views (random crops resized to 512).
- Student: Processes all views, including global crops and local crops (smaller segments resized to 256), plus masked versions of global views.

D. Multi-Objective Loss Function

The total loss $L$ is a weighted sum of three distinct objectives:
$L = L_{DINO} + L_{iBOT} + 0.1 L_{KoLeo}$

DINO Loss (Global/Local Alignment):
- Encourages invariance to temporal scale and noise.
- Computes cross-entropy between Student and Teacher [CLS] token distributions.
- Uses a multi-crop strategy: The Student sees 2 global crops and 8 local crops; the Teacher sees only 2 global crops.
- Prevents collapse using Sinkhorn-Knopp centering and temperature sharpening.
iBOT Loss (Local Patch Reconstruction):
- Encourages learning of dense local features.
- Applies patch-level masking (10%–70% ratio) to the Student's global views.
- The Student predicts the token distribution of masked patches, while the Teacher observes the unmasked original signal.
KoLeo Regularizer:
- Applies the Kozachenko-Leonenko differential entropy estimator to the Student's global [CLS] tokens.
- Ensures a uniform distribution of features in the batch, preventing representation collapse.

3. Key Contributions

Non-Contrastive Paradigm for Time Series: Demonstrates that non-contrastive self-distillation (DINO-style) is a superior and complementary strategy to contrastive learning for time series classification, avoiding the "false negative" problem inherent in batch-based contrastive methods.
Multi-Objective Pretraining: Proposes a novel combination of multi-crop augmentation (for scale invariance) and patch masking (for local structure) within a single self-distillation framework.
Synthetic Pretraining Efficiency: Validates that high-performance TSFMs can be pretrained entirely on synthetic causal data, reducing dependency on large labeled datasets.
State-of-the-Art Performance: Achieves new SOTA results on standard benchmarks, proving the effectiveness of the approach.

4. Experimental Results

The model was evaluated on the UCR (128 univariate datasets) and UEA (21 multivariate datasets) archives under two regimes: Linear Probing (frozen backbone) and Fine-tuning.

Baselines: Compared against Mantis (Contrastive), Moment (Masked Autoencoder), NuTime (Self-distillation), and GPT4TS.
Performance Highlights:
- UCR Linear Probing: Utica achieved 0.794 average accuracy (52/128 wins), outperforming Mantis (0.792) and Moment (0.779).
- UCR Fine-tuning: Utica reached 0.857 average accuracy (60 wins), significantly beating Mantis (0.850).
- UEA Benchmarks: Utica achieved the best average rank in both settings (1.60 for linear probing, 1.50 for fine-tuning).
Ablation Study:
- Combining DINO and iBOT losses yielded 0.794 accuracy.
- Using iBOT alone dropped performance to 0.735.
- Using DINO alone dropped performance to 0.747.
- This confirms that global alignment and local reconstruction provide complementary supervision signals.

5. Significance

The paper establishes that non-contrastive self-distillation is a powerful, underexplored paradigm for time series foundation models. By successfully transferring the DINOv2 architecture to the temporal domain and combining global invariance with local structural learning, Utica sets a new benchmark for time series classification. This suggests a shift away from purely contrastive or forecasting-based pretraining toward multi-objective self-distillation for building robust, universal time series representations.

UTICA: Multi-Objective Self-Distllation Foundation Model Pretraining for Time Series Classification

Enter "Utica": The New Teacher-Student System

1. The Teacher and the Student

2. The Two Special Tricks (The "Secret Sauce")

3. The "Uniformity" Rule (KoLeo Loss)

The Results: Why It Matters

The Big Picture

1. Problem Statement

2. Methodology: UTICA

A. Architecture & Backbone

B. Data Generation (Pretraining)

C. Student-Teacher Framework

D. Multi-Objective Loss Function

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank