VETime: Vision Enhanced Zero-Shot Time Series Anomaly Detection

VETime is a novel zero-shot time-series anomaly detection framework that unifies 1D temporal and 2D visual modalities through reversible image conversion, patch-level alignment, and adaptive multi-modal fusion to simultaneously achieve fine-grained point localization and global context awareness, outperforming state-of-the-art models with lower computational overhead.

Yingyuan Yang, Tian Lan, Yifei Gao, Yimeng Lu, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang

Published 2026-02-19
📖 5 min read🧠 Deep dive

Imagine you are a security guard watching a massive, 24-hour surveillance feed of a factory's machinery. Your job is to spot anything weird.

Sometimes, a machine makes a sudden, loud BANG (a Point Anomaly). Other times, the machine doesn't make a noise, but it starts vibrating in a weird, slow rhythm that lasts for hours (a Context Anomaly).

For a long time, security systems had to choose between two types of guards, and neither was perfect:

  1. The "Microscope" Guard (1D Time Models): This guard looks at the data second-by-second. They are amazing at spotting the sudden BANG. But because they are so focused on the immediate second, they miss the slow, weird vibration that happens over an hour. They lack "big picture" vision.
  2. The "Wide-Angle" Guard (Vision Models): This guard looks at the whole hour of footage at once, like a photograph. They are great at seeing the slow, weird vibration. But because they squint at a whole hour compressed into one image, they miss the tiny, sudden BANG. Also, when they try to zoom in to find the exact second of the BANG, the image gets blurry, and they can't pinpoint it.

The Dilemma: You need a guard who can see the whole picture clearly and spot the tiny details instantly.

Enter VETime: The "Super-Spy" Guard

The paper introduces VETime (Vision Enhanced Time Series Anomaly Detection). Think of VETime as a super-spy who combines the best skills of both guards into one person. It doesn't just look at the numbers or the pictures; it learns to speak both languages fluently.

Here is how VETime works, using simple analogies:

1. The Magic Camera (Reversible Image Conversion)

Usually, turning a line graph (time series) into a photo (image) is like trying to fold a long piece of paper into a tiny square. You lose information, and the lines get messy.

  • VETime's Trick: It uses a special "Magic Camera." Instead of just squashing the data, it folds the time series into a 2D image in a very smart way. It separates the "Trend" (the big picture) and the "Noise" (the tiny details) and paints them in Red, Green, and Blue channels (like an RGB photo).
  • The Result: The resulting image is so rich in detail that if you look at it, you can see the weird vibrations and the sudden spikes. It's like taking a high-definition photo of a sound wave.

2. The Time-Traveler's Map (Patch-Level Temporal Alignment)

The problem with turning data into a photo is that the photo loses the "clock." In a photo, you don't know if a pixel happened at 1:00 PM or 1:05 PM.

  • VETime's Trick: It takes the "Time-Traveler's Map." It looks at the photo and says, "Okay, this red patch in the top-left corner corresponds exactly to the 5th second of the original data."
  • The Result: It forces the "Photo Brain" and the "Number Brain" to agree on the exact timeline. Now, the Vision model knows exactly when something happened, not just that it happened.

3. The Detective's Training (Anomaly Window Contrastive Learning)

How does the system learn what "weird" looks like without being shown thousands of examples?

  • VETime's Trick: It plays a game of "Spot the Difference."
    • Local Game: It looks at a tiny window (a few seconds) and asks, "Does the picture match the numbers here?" If the numbers spike but the picture looks normal, it flags it.
    • Global Game: It looks at a long window (an hour) and asks, "Does the overall shape of the trend look right?"
  • The Result: By constantly comparing the "Local" view with the "Global" view, the system learns to spot both the sudden BANG and the slow vibration simultaneously.

4. The Smart Manager (Task-Adaptive Multi-Modal Fusion)

Finally, VETime has a manager who decides which guard to listen to.

  • The Scenario: If the system needs to find a sudden spike, the manager says, "Listen to the Microscope Guard!" If it needs to find a slow trend change, the manager says, "Listen to the Wide-Angle Guard!"
  • The Result: The system dynamically switches its focus. It doesn't just average the two opinions; it picks the best expert for the specific job at that exact moment.

Why is this a big deal?

  • Zero-Shot Superpower: Most security guards need to be trained on your specific factory for weeks. VETime is like a genius who has studied every factory in the world. You can drop it into a brand-new factory, and it works immediately without any training.
  • Speed: Previous "Vision" methods were slow because they tried to process huge images. VETime is incredibly fast (about 100 times faster than its competitors) because it's efficient.
  • Precision: It doesn't just say, "Something is wrong between 1:00 and 2:00." It says, "Something is wrong at exactly 1:14:32."

In summary: VETime is the first system that successfully combines the "big picture" vision of a camera with the "fine-grained" precision of a stopwatch, allowing it to catch every type of anomaly, big or small, instantly and accurately.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →