Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents

Imagine you are a detective trying to solve a mystery, but instead of looking for fingerprints or footprints, you are looking at a massive, chaotic graph of lines representing data from an oil rig. This graph shows things like pressure and volume changing over time.

Your boss hands you a note that says: "Look for the moment when the pressure bounces up quickly, then settles down, while the volume stays perfectly still."

Your job is to find that exact moment on the graph. This is the challenge of Time Series Event Detection.

Here is how this paper solves that problem, explained simply:

1. The Problem: Why Old Methods Fail

Traditionally, to teach a computer to find these moments, you would need to show it thousands of examples of "pressure bouncing up" and "volume staying still." You'd have to label every single one by hand.

The Issue: In real-world industries (like oil and gas or healthcare), getting those labeled examples is incredibly hard, expensive, and slow.
The Result: If you only have a few examples, the computer gets confused. If you try to use a super-smart AI (like a Large Language Model) without training, it often "hallucinates"—it guesses wildly and makes things up because it doesn't understand the strict rules of physics.

2. The Solution: "Grammar of the Wave"

The authors propose a new way: Don't show the AI thousands of examples. Just give it the rulebook.

They treat time series data like a language. Just as sentences have grammar (Subject + Verb + Object), events in data have "grammar" (Pressure goes up + Then + Volume stays flat).

They invented a new framework called Event Logic Tree (ELT). Think of this as a family tree for data events:

The Leaves (Primitives): These are the simple words. "Pressure rises," "Volume is flat."
The Branches (Logic): These are the connecting words. "Simultaneously," "After," "Inside of."
The Whole Tree: This is the full story. "A rise in pressure happens inside a period where volume is flat."

3. The Detective Team: SELA

To use this "grammar," they built a robot detective team called SELA. It uses two specialized agents working together, like a conductor and a musician:

The Logic Analyst (The Conductor):
- Job: It reads the human's messy note ("Pressure bounces up...") and translates it into a strict, logical Event Logic Tree. It breaks the sentence down into the "family tree" structure.
- Analogy: It's like an architect drawing the blueprints before construction starts.
The Signal Inspector (The Musician):
- Job: It looks at the actual squiggly lines on the graph. It zooms in and out, checking if the "Pressure rises" part of the blueprint actually matches the real data.
- Analogy: It's the construction worker checking if the bricks match the blueprint. If the brick (data) doesn't fit the spot (logic), it moves it.

The Magic: The Inspector doesn't just guess. It constantly checks its work against the Blueprint (the ELT). If the AI starts to hallucinate (make up a pattern that isn't there), the Blueprint stops it. "Wait," the Blueprint says, "You said the volume was flat, but your data shows it spiking. That violates the rules. Try again."

4. The Test: The "KITE" Dataset

To prove this works, they built a test using real oil rig data from the North Sea.

The Challenge: They asked the AI to find specific events (like a "successful test" vs. a "lost seal") using only the text description, with zero training examples.
The Competition: They compared their new team (SELA) against:
- Old-school computers trained on limited data (they failed).
- Super-smart AI models just guessing (they hallucinated a lot).
- Human experts (the gold standard).

5. The Result

The SELA team came in second place, right behind the human experts, and crushed the other AI models.

Why? Because the "Grammar of the Wave" (the Event Logic Tree) kept the AI honest. It forced the AI to follow the logical steps rather than just guessing based on a vague feeling.

Summary Metaphor

Imagine trying to find a specific song in a radio station that plays 24/7.

Old AI: You play the radio and hope it recognizes the song after hearing it 1,000 times.
Standard LLM: You ask the radio DJ, "What song is playing?" and the DJ guesses wildly because they've never heard it.
This Paper (SELA): You give the DJ a sheet of music (the Logic Tree) that says, "Find the part where the violin plays a high note, followed immediately by a drum beat." The DJ uses the sheet music to scan the radio, zooming in on the exact seconds where the violin and drum match the notes.

In short: This paper teaches AI to read the "grammar" of data so it can find specific events without needing to memorize a million examples, making it smarter, more reliable, and easier to trust.

1. Problem Definition: Knowledge-Guided TSED (K-TSED)

The paper addresses Time Series Event Detection (TSED), a task critical in high-stakes domains like healthcare and energy production. Unlike standard classification or anomaly detection, TSED requires identifying specific time intervals where complex, semantically defined events occur (e.g., "a sharp rise in Pressure followed by a plateau in Volume").

The Core Challenge:

Data Scarcity: Real-world domains often lack large amounts of labeled data required for inductive learning (supervised training).
Explainability: Standard deep learning models act as "black boxes." In critical industries, experts require logical, verifiable reasoning for why an event was detected, not just a probability score.
Hallucination: Existing Large Language Models (LLMs) and Vision-Language Models (VLMs) often hallucinate when reasoning about time series, failing to accurately map linguistic descriptions to signal morphologies.

The Proposed Setting:
The authors introduce Knowledge-Guided TSED (K-TSED). In this setting, the model receives:

A multivariate time series signal ( $X$ ).
A natural language description of the event ( $L$ ).
Zero or few-shot training data.
The goal is to ground the linguistic description into specific time intervals in the signal while providing an explainable rationale.

2. Methodology

The proposed solution consists of two main components: a novel knowledge representation framework (Event Logic Tree) and a neuro-symbolic agent system (SELA).

A. Event Logic Tree (ELT)

ELT is a knowledge representation framework that translates unstructured natural language event descriptions into a structured tree format. It bridges the gap between linguistic semantics and physical time series data.

Structure:
- Primitive Nodes (Leaves): Atomic signal patterns on specific physical channels (e.g., "Volume remains stable"). These include a semantic predicate ( $\tau$ ) and a channel index ( $c$ ).
- Composite Nodes (Internal): Hierarchical temporal-logic relationships connecting primitives (e.g., "Simultaneously," "Followed by").
Operators: The framework uses four core temporal-logic operators based on Allen's interval algebra:
- SEQ: Sequence (A happens before B).
- SYNC: Synchronization (A and B happen simultaneously).
- GUARD: Containment (A happens within B).
- OR: Disjunction (Either A or B).
Axioms: ELT enforces strict axioms to ensure logical validity:
- Constructive Composition: Nodes must add structural meaning.
- Temporal Compactness: No undefined gaps between events.
- Physical Exclusivity: A channel cannot support conflicting states simultaneously.
Instantiation: The tree is instantiated on actual data by assigning time intervals and calculating a semantic coherence score ( $\mu$ ) using neural models, which quantifies how well the signal matches the predicate.

B. SELA: The Neuro-Symbolic Agent System

SELA (Time Series Event Logic Agents) is a multi-agent system utilizing VLMs to perform K-TSED without fine-tuning. It operates via a Central Logic Engine and two specialized agents:

Logic Analyst Agent:
- Role: Parses the unstructured natural language description into an ELT schema.
- Action: Iteratively refines the tree structure (identifying primitives and logical operators) based on visualizations of the data.
Signal Inspector Agent:
- Role: Grounds the ELT schema into the actual time series data.
- Action: Uses active visualization tools (zooming, drawing markers) to inspect local and global signal morphologies. It iteratively locates intervals for primitives and refines boundaries to maximize the root node's confidence score.
Collaboration: The agents interact via a shared environment, submitting "artifacts" (schema definitions or instance intervals) that are validated against the ELT axioms. This neuro-symbolic approach ensures the VLM's reasoning is constrained by logical rules, mitigating hallucinations.

3. Key Contributions

New Problem Setting (K-TSED): Formalized the task of detecting events in multivariate time series using natural language descriptions with little to no training data, shifting from inductive pattern recognition to deductive knowledge grounding.
Event Logic Tree (ELT): A novel framework that models the intrinsic temporal-logic structures of events. It satisfies three key desiderata: Hierarchical Representation, Semantic Quantification, and Topological Elasticity (time warping).
SELA Framework: A neuro-symbolic VLM agent system that combines the reasoning capabilities of LLMs with the precision of symbolic logic and active visualization tools.
KITE Benchmark: The first benchmark for K-TSED, curated from real-world oil and gas pressure test data. It includes expert-annotated ground truth and natural language event descriptions, covering both simple sequential events and complex, mutually exclusive scenarios.

4. Experimental Results

The authors evaluated SELA on the KITE dataset (KITE-easy and KITE-hard) against several baselines:

Baselines: Random guessing, supervised models (CNN, Transformer, Timer, Moment, Chronos), and few-shot/zero-shot LLM/VLM approaches (GPT-4.1/5 with Numeric or VL-Time inputs).
Metrics: F1 scores at IoU thresholds of 0.5 (detection) and 0.9 (precise localization).

Key Findings:

Superior Performance: SELA significantly outperformed all baselines. On the KITE-hard dataset, GPT-5 powered SELA achieved an F1@0.5 of 79.31% and F1@0.9 of 68.96%, approaching human expert performance (85.06% / 81.61%).
Comparison to Supervised Models: In low-resource settings, SELA outperformed supervised foundation models (like Chronos and Moment), which struggled due to the lack of training data.
Comparison to Zero-Shot LLMs: SELA vastly outperformed standard zero-shot LLMs (Numeric/VL-Time). For instance, on KITE-hard, GPT-5 SELA (F1@0.5: 79.31%) significantly beat GPT-5 VL-Time (F1@0.5: 44.83%).
Ablation Study: Removing the ELT constraint ("w/o ELT") caused performance to collapse, particularly on complex datasets. Without the logical structure, VLMs fell into local morphological traps or hallucinated, assigning high confidence to incorrect categories. This confirms ELT is critical for mitigating hallucination and ensuring logical consistency.

5. Significance

Bridging the Gap: The paper successfully bridges the gap between high-level human expert knowledge (natural language) and low-level signal processing, enabling AI to operate in domains where labeled data is scarce.
Explainability by Design: Unlike post-hoc explanation methods, SELA provides intrinsic explainability. The output is an instantiated Event Logic Tree, offering a step-by-step logical proof (e.g., "Pressure rose, then Volume stabilized") that human experts can verify.
Mitigating Hallucination: The work demonstrates that constraining VLMs with symbolic logic (ELT) and active tool use is a robust strategy to prevent the "hallucination" common in generative AI when applied to precise scientific data.
Industry Applicability: By validating the approach on real-world oil and gas pressure test data, the paper proves the viability of neuro-symbolic AI for high-stakes industrial applications where trust and accuracy are paramount.