VETime: Vision Enhanced Zero-Shot Time Series Anomaly Detection

Imagine you are a security guard watching a massive, 24-hour surveillance feed of a factory's machinery. Your job is to spot anything weird.

Sometimes, a machine makes a sudden, loud BANG (a Point Anomaly). Other times, the machine doesn't make a noise, but it starts vibrating in a weird, slow rhythm that lasts for hours (a Context Anomaly).

For a long time, security systems had to choose between two types of guards, and neither was perfect:

The "Microscope" Guard (1D Time Models): This guard looks at the data second-by-second. They are amazing at spotting the sudden BANG. But because they are so focused on the immediate second, they miss the slow, weird vibration that happens over an hour. They lack "big picture" vision.
The "Wide-Angle" Guard (Vision Models): This guard looks at the whole hour of footage at once, like a photograph. They are great at seeing the slow, weird vibration. But because they squint at a whole hour compressed into one image, they miss the tiny, sudden BANG. Also, when they try to zoom in to find the exact second of the BANG, the image gets blurry, and they can't pinpoint it.

The Dilemma: You need a guard who can see the whole picture clearly and spot the tiny details instantly.

Enter VETime: The "Super-Spy" Guard

The paper introduces VETime (Vision Enhanced Time Series Anomaly Detection). Think of VETime as a super-spy who combines the best skills of both guards into one person. It doesn't just look at the numbers or the pictures; it learns to speak both languages fluently.

Here is how VETime works, using simple analogies:

1. The Magic Camera (Reversible Image Conversion)

Usually, turning a line graph (time series) into a photo (image) is like trying to fold a long piece of paper into a tiny square. You lose information, and the lines get messy.

VETime's Trick: It uses a special "Magic Camera." Instead of just squashing the data, it folds the time series into a 2D image in a very smart way. It separates the "Trend" (the big picture) and the "Noise" (the tiny details) and paints them in Red, Green, and Blue channels (like an RGB photo).
The Result: The resulting image is so rich in detail that if you look at it, you can see the weird vibrations and the sudden spikes. It's like taking a high-definition photo of a sound wave.

2. The Time-Traveler's Map (Patch-Level Temporal Alignment)

The problem with turning data into a photo is that the photo loses the "clock." In a photo, you don't know if a pixel happened at 1:00 PM or 1:05 PM.

VETime's Trick: It takes the "Time-Traveler's Map." It looks at the photo and says, "Okay, this red patch in the top-left corner corresponds exactly to the 5th second of the original data."
The Result: It forces the "Photo Brain" and the "Number Brain" to agree on the exact timeline. Now, the Vision model knows exactly when something happened, not just that it happened.

3. The Detective's Training (Anomaly Window Contrastive Learning)

How does the system learn what "weird" looks like without being shown thousands of examples?

VETime's Trick: It plays a game of "Spot the Difference."
- Local Game: It looks at a tiny window (a few seconds) and asks, "Does the picture match the numbers here?" If the numbers spike but the picture looks normal, it flags it.
- Global Game: It looks at a long window (an hour) and asks, "Does the overall shape of the trend look right?"
The Result: By constantly comparing the "Local" view with the "Global" view, the system learns to spot both the sudden BANG and the slow vibration simultaneously.

4. The Smart Manager (Task-Adaptive Multi-Modal Fusion)

Finally, VETime has a manager who decides which guard to listen to.

The Scenario: If the system needs to find a sudden spike, the manager says, "Listen to the Microscope Guard!" If it needs to find a slow trend change, the manager says, "Listen to the Wide-Angle Guard!"
The Result: The system dynamically switches its focus. It doesn't just average the two opinions; it picks the best expert for the specific job at that exact moment.

Why is this a big deal?

Zero-Shot Superpower: Most security guards need to be trained on your specific factory for weeks. VETime is like a genius who has studied every factory in the world. You can drop it into a brand-new factory, and it works immediately without any training.
Speed: Previous "Vision" methods were slow because they tried to process huge images. VETime is incredibly fast (about 100 times faster than its competitors) because it's efficient.
Precision: It doesn't just say, "Something is wrong between 1:00 and 2:00." It says, "Something is wrong at exactly 1:14:32."

In summary: VETime is the first system that successfully combines the "big picture" vision of a camera with the "fine-grained" precision of a stopwatch, allowing it to catch every type of anomaly, big or small, instantly and accurately.

1. Problem Statement

Time-Series Anomaly Detection (TSAD) faces a fundamental challenge in simultaneously identifying two distinct types of anomalies:

Point Anomalies: Instantaneous, abrupt numerical deviations (e.g., spikes).
Context Anomalies: Large-scale, contiguous irregularities in trends or periodicity.

Existing approaches suffer from a "unimodal dilemma":

1D Temporal Models: Excel at fine-grained point localization due to local continuity but lack the global receptive field to detect long-range context anomalies.
2D Vision-Based Models: Capture global patterns well but suffer from information bottlenecks. Converting 1D sequences to fixed-size images (e.g., 224×224) causes temporal blurring, leading to coarse-grained detection windows that miss precise anomaly boundaries.

Furthermore, most models require dataset-specific training, making them impractical for zero-shot scenarios where data is scarce or deployment environments vary widely.

2. Methodology

The authors propose VETime, a framework that unifies temporal and visual modalities through fine-grained alignment and dynamic fusion. The architecture consists of four core components:

A. Reversible Image Conversion (RIC)

To transform 1D time series into information-dense 2D images without losing temporal fidelity:

Multi-Channel Intensity Mapping: The raw series $X$ is decomposed into Trend ( $X_{trend}$ ) and Remainder ( $X_{rem}$ ) components. These, along with the original series, are mapped to the R, G, and B channels respectively, creating a $1 \times L \times 3$ tensor.
Adaptive Folding: The 1D sequence is folded into a 2D grid based on estimated periodicity (using autocorrelation). The folding period is dynamically adjusted to be a multiple of the ViT patch size to prevent temporal discontinuity.
Dimension-Aware Scaling: The grid is scaled to a standard resolution (224×224). Crucially, linear interpolation is used along the time axis to preserve waveform continuity, while copy-padding is used along the period axis to avoid semantic distortion.

B. Patch-Level Temporal Alignment (PTA)

This module bridges the structural gap between the 2D visual features and the 1D temporal features:

Visual features extracted from a frozen pre-trained ViT (MAE) are reshaped back to the 1D temporal domain by inverting the folding logic.
Temporal Ordering: Learnable positional encodings and self-attention layers are applied to the aligned visual features to recover temporal context and model intra-patch dependencies, ensuring the visual features ( $F_V$ ) align perfectly with the temporal features ( $F_{TS}$ ).

C. Anomaly Window Contrastive Learning (AWCL)

To leverage the complementary strengths of both modalities, a hybrid contrastive learning strategy is employed:

Intra-Window Contrast: Targets Point Anomalies. It aligns the visual feature at the anomaly position with the corresponding temporal feature (positive pair) while pushing away normal temporal features within the same window (negative pairs).
Inter-Window Contrast: Targets Context Anomalies. It aggregates features over a window (average pooling) and contrasts the global anomaly window representation against normal windows.
This dual mechanism ensures the model learns discriminative features at both fine-grained (local) and coarse-grained (global) scales.

D. Task-Adaptive Multi-Modal Fusion (TMF)

A dynamic routing mechanism integrates the features for downstream tasks:

Dynamic Weighting: A router network computes patch-level weights to assign importance to three "experts": Time-Series features, Vision features, and Anomaly-Enhanced features.
Task Adaptation: The weights are task-specific. For Anomaly Detection, the model prioritizes Anomaly-Enhanced features (high-level semantics). For Sequence Reconstruction, it prioritizes Temporal features (low-level numerical continuity).
Auxiliary Reconstruction: A reconstruction head is used as an auxiliary constraint to prevent overfitting to sparse anomaly labels and encourage deep feature interaction.

3. Key Contributions

Unified Framework: VETime is the first TSAD framework to effectively unify temporal and visual modalities via fine-grained alignment and dynamic fusion, addressing the limitations of unimodal approaches.
Novel Modules:
- Reversible Image Conversion: Preserves discriminative details while enabling global visual context.
- Patch-Level Temporal Alignment: Establishes a shared timeline between 2D visual patches and 1D temporal steps.
Advanced Learning Mechanisms: Introduction of Anomaly Window Contrastive Learning (handling both point and context anomalies) and Task-Adaptive Multi-Modal Fusion (dynamically balancing reconstruction and detection needs).
Zero-Shot Superiority: The model operates strictly in a zero-shot setting, trained on synthetic data, yet outperforms models trained on specific datasets.

4. Experimental Results

The framework was evaluated on 11 public univariate datasets (TSB-AD benchmark) and 5 multivariate datasets.

Zero-Shot Performance: VETime achieved 25 first-place rankings out of 44 metrics in the zero-shot setting against other foundation models (e.g., TimeRCD, MOMENT, Chronos). It also outperformed full-shot (trained) baselines like TranAD and USAD in 23 instances.
Vision-Based Comparison: Compared to vision-based methods (VIT4TS, VLM4TS), VETime significantly improved detection accuracy (e.g., Affiliation-F1 on YAHOO: 97.15% vs. 60.66% for VIT4TS) while being ~100x faster computationally.
Ablation Studies: Removing any core component (RIC, PTA, AWCL, or TMF) resulted in significant performance drops, confirming the necessity of the full pipeline.
Qualitative Analysis: VETime successfully localized sharp point anomalies (avoiding the over-smoothing of pure vision models) and detected long-term trend shifts (avoiding the missed detections of pure temporal models).

5. Significance

Bridging the Modality Gap: VETime demonstrates that combining the local precision of 1D time-series models with the global context of 2D vision models can solve the inherent trade-off in anomaly detection.
Practical Deployment: By achieving state-of-the-art results in a zero-shot setting with lower computational overhead than vision-based approaches, VETime offers a highly practical solution for real-world scenarios where data collection for training is infeasible (cold-start regimes).
Generalization: The ability to generalize across diverse domains (from web traffic to power grids and spacecraft telemetry) without retraining highlights the robustness of the proposed multi-modal alignment and fusion strategies.