Anomaly detection in time-series via inductive biases in the latent space of conditional normalizing flows

The Big Problem: The "Too Good to Be True" Trap

Imagine you are a security guard at a museum. Your job is to spot fake paintings (anomalies) among the real ones.

Most modern AI systems act like a guard who only looks at how much the painting looks like a real painting. If a fake painting is painted so perfectly that it looks 99% like a real masterpiece, the AI says, "Wow, this is a great painting!" and lets it in.

In technical terms, this is called maximizing likelihood. The AI learns what "normal" data looks like and gives a high score to anything that fits that pattern. But here's the catch: sometimes, a fake painting (an anomaly) can look statistically very similar to a real one, even though it's completely wrong. The AI gets tricked because it's only looking at the surface appearance, not the story behind the painting.

The Solution: The "Script" of Time

This paper proposes a smarter way to be the security guard. Instead of just asking, "Does this look like a real painting?", the new system asks, "Does this painting follow the correct script?"

The authors built a system that doesn't just look at a single moment in time; it watches the story of how things change over time.

1. The Translator (The Normalizing Flow)

Imagine you have a chaotic, noisy video of a busy street. It's hard to understand.
The first part of their system is a Translator. It takes that messy street video and translates it into a clean, simple language (called the Latent Space).

Old way: The translator just tries to make the video look pretty.
New way: The translator is forced to follow strict rules. It must translate the street traffic into a specific, predictable pattern of movement.

2. The Script (Inductive Bias)

This is the secret sauce. The authors force the system to believe that "normal" behavior follows a specific Script (a mathematical rule called a Linear-Gaussian Latent Dynamical Model).

Think of it like a dance troupe.

Normal dancers: They all follow the choreography perfectly. They move in a smooth, predictable line.
The Anomaly: A dancer who suddenly starts breakdancing in the middle of a waltz.

Even if the breakdancer is wearing the exact same costume as everyone else (high "likelihood"), they are breaking the choreography (the inductive bias). The system ignores the costume and checks the dance moves.

3. The Compliance Test (The Goodness-of-Fit)

Instead of guessing a score, the system runs a Compliance Test.

It takes the new data, translates it into the "dance moves" (latent space), and checks: "Does this sequence of moves match the choreography we learned?"
If the moves are slightly off, the system sounds an alarm.
Crucially: This test doesn't need a human to say, "Okay, if the score is above 85, it's an anomaly." The test itself tells you if the data fits the rules or not. It's like a math equation that says "Yes, this fits" or "No, this doesn't."

Why is this better?

The paper shows that this method works even when the "fake" data looks very similar to the "real" data.

The Old Guard (Likelihood): "This fake painting looks so real! I'll give it a 10/10." (Fails to catch the anomaly).
The New Guard (Inductive Bias): "This painting looks real, but the brushstrokes don't follow the artist's usual style. It breaks the rules. I'm flagging it."

Real-World Examples from the Paper

The authors tested this on two types of data:

Synthetic Data (Fake Numbers): They created fake time-series where they changed the frequency (how fast it wiggles) or the amplitude (how high it jumps).
- The old AI missed the amplitude changes because the data still looked "dense" and probable.
- The new AI caught them immediately because the pattern of movement broke the script.
Real-World Data (Stocks, Sensors): They tested it on real datasets like stock prices and sensor readings.
- They found that the system could tell if the AI itself was "confused" during training. If the AI couldn't learn the script, the system would say, "Hey, I'm not ready to detect anomalies yet," acting as a built-in quality control check.

The Takeaway

This paper is about moving from "Does this look normal?" to "Does this act normal?"

By forcing the AI to learn a specific "script" for how data should evolve over time, they created a detector that is much harder to trick. It doesn't just memorize what things look like; it understands how things should behave, making it a much more reliable security guard for time-series data.

1. Problem Statement

The paper addresses a fundamental limitation in using Deep Probabilistic Models (DPMs) for unsupervised anomaly detection (AD) in multivariate time-series.

The Likelihood Trap: Standard DPMs are trained to maximize data likelihood. However, high likelihood in the observation space often measures marginal density rather than conformity to structured temporal dynamics. Consequently, these models can assign high probability (low anomaly score) to out-of-distribution (OOD) samples that happen to look "normal" in terms of density but violate the expected temporal evolution.
Thresholding Issues: Classical AD methods often rely on reconstruction errors or likelihood scores converted to binary decisions via manually tuned thresholds. This is brittle, requires domain expertise, and is difficult to justify statistically when anomalies are rare and heterogeneous.
The Need for Structure: The authors argue that likelihood-based training alone cannot distinguish expected from unexpected behavior without explicit structural inductive biases encoded in the model.

2. Methodology

The authors propose a state-space probabilistic framework that shifts the anomaly detection mechanism from the observation space to a constrained latent space. The core idea is to define "expected behavior" not as a static distribution, but as a compliance with prescribed temporal dynamics.

A. Model Architecture

The framework combines two main components:

Conditional Normalizing Flow (CNF):
- Maps observations $x_t$ to a latent representation $z_t$ .
- Conditioned on a temporal context window $W_t$ (previous observations).
- Transforms the data into a latent space where the distribution is tractable (typically Gaussian).
Prescribed Latent Dynamics (Inductive Bias):
- Instead of letting the latent dynamics be learned freely, the model enforces a specific dynamical law on the mean of the latent variables ( $\mu_t$ ).
- Linear-Gaussian Latent Dynamical Model (LG-LDM): The authors specifically implement a linear-Gaussian bias where:
  - $\mu_0 \sim \mathcal{N}(0, I)$
  - $\mu_t = A\mu_{t-1} + b$
- This forces the latent trajectories to evolve according to a deterministic linear map, effectively constraining the "shape" of valid time-series in the latent space.

B. Training Procedure

The model is trained unsupervised by minimizing the Negative Log-Likelihood (NLL) of the joint distribution of the CNF and the latent dynamics.

The loss function ensures that the CNF maps observations to latent states that not only fit the base distribution but also evolve according to the prescribed dynamics ( $\psi$ ).
Training can be performed sequentially over the full time-series or via mini-batches for computational efficiency.

C. Anomaly Detection Mechanism: Goodness-of-Fit (GOF)

The core innovation is redefining anomaly detection as a statistical compliance test rather than a likelihood score.

Hypothesis: If the model is trained correctly, the mapped latent trajectories of normal data should strictly follow the prescribed inductive bias (e.g., a specific Gaussian evolution).
The Test: During inference, the latent trajectory of a new sequence is tested against the prescribed distribution using a Multivariate Kolmogorov-Smirnov (MV-KS) test.
- Null Hypothesis ( $H_0$ ): The latent trajectory comes from the prescribed distribution.
- Decision Rule: If the KS statistic $s$ exceeds a critical value $\tau$ (derived from the sample size), the sequence is flagged as an anomaly.
Advantage: This approach is threshold-free in the traditional sense; the critical value is determined statistically by the sample size, removing the need for manual tuning on labeled data.

D. Training Diagnostics

The framework includes a built-in diagnostic: the MV-KS test is applied to the training data itself.

If the training data fails the GOF test (high KS score), it indicates the model capacity or the chosen inductive bias is insufficient to capture the data's dynamics.
If the training data passes, the model is deemed "compliant," and the resulting critical value is trusted for inference.

3. Key Contributions

State-Space Deep Generative Model: A novel architecture coupling Conditional Normalizing Flows with explicit, prescribed latent dynamics (e.g., linear-Gaussian) to enforce temporal coherence.
Unsupervised, Threshold-Free AD: A detection method based on Goodness-of-Fit (MV-KS) tests in the latent space. It identifies anomalies even in high-density regions of the observation space where likelihood-based methods fail.
Compliance Diagnostics: A mechanism to verify if the model has successfully learned the inductive bias, providing a data-driven signal for when the unsupervised detector is ready for deployment.
Empirical Validation: Demonstrated robustness against frequency, amplitude, and noise anomalies in synthetic data, and competitive performance on real-world benchmarks (TSB-AD).

4. Experimental Results

The authors evaluated the framework on synthetic data and real-world datasets (TSB-AD benchmark).

Synthetic Data:
- Failure of NLL: Standard NLL-based scoring failed to detect amplitude anomalies because the CNF mapped these points to high-density latent regions.
- Success of MV-KS: The proposed method successfully detected amplitude, frequency, and noise anomalies by identifying deviations in the temporal evolution of the latent trajectory, even when the point density was high.
- Window Size Sensitivity: Optimal performance was found with a window size of $O(D^3)$ (e.g., $w=64$ for 4D data). Too small windows were noisy; too large windows diluted anomalies.
Real-World Data (TSB-AD):
- Performance: The method achieved competitive results (often matching or exceeding baselines like TimesNet, OmniAnomaly, and AutoEncoders) on the Volume Under the Surface Precision-Recall (VUS-PR) metric.
- Correlation with Compliance: High performance correlated with high "FIT" scores (percentage of training sequences passing the GOF test). For example, on the "Stock" dataset (100% FIT), the method performed exceptionally well. On "MITDB" (0% FIT), performance dropped, validating the diagnostic utility.
- Unsupervised vs. Oracle: The unsupervised MV-KS approach (using the statistical critical value) performed close to an "oracle" thresholded version, proving the viability of the label-free approach.

5. Significance and Conclusion

This paper offers a paradigm shift in deep generative anomaly detection:

From Density to Dynamics: It moves the focus from "how likely is this point?" to "does this trajectory follow the rules of time?"
Interpretability: By visualizing the latent space (e.g., seeing trajectories deviate from the Gaussian manifold), the method provides interpretable diagnostics of why a sequence is anomalous.
Robustness: It solves the "likelihood paradox" where OOD samples receive high likelihood scores by enforcing structural constraints that such samples cannot satisfy.
Practicality: The inclusion of a training-time compliance check allows practitioners to know a priori if their model is suitable for deployment, reducing the risk of false negatives in critical applications.

The authors conclude that while the method is sensitive to model capacity and the choice of inductive bias, it provides a principled, statistically grounded framework for unsupervised time-series anomaly detection that outperforms traditional likelihood-based approaches in complex scenarios.