Neural delay differential equations: learning non-Markovian closures for partially known dynamical systems

Here is an explanation of the paper "Neural Delay Differential Equations: learning non-Markovian closures for partially known dynamical systems," translated into simple language with creative analogies.

The Big Problem: The "Blindfolded" System

Imagine you are trying to predict the weather, but you only have one tiny thermometer in your backyard. You don't know the wind speed, the humidity, the pressure, or what's happening in the clouds miles away. In the world of science, this is called partial observability.

Most modern AI models (like Neural ODEs) assume they can see the entire system at once. They think, "If I know the exact state of everything right now, I can predict the future." But in the real world, we rarely have that luxury. We only have a few sensors.

Furthermore, many systems aren't just about "what is happening now." They are about memory. A system might react to what happened 5 minutes ago, or 2 hours ago. If you ignore that history, your prediction will fail.

The Solution: The "Time-Traveling" AI

The authors propose a new type of AI called Neural Delay Differential Equations (NDDEs).

Think of a standard AI model as a driver who only looks through the windshield. They see the road right now and steer accordingly.

The Problem: If the car is on a winding road with a blind curve, looking only at the road right now isn't enough. You need to know where the road was 10 seconds ago to understand the curve you are currently entering.

NDDEs are like a driver who has a rear-view mirror that shows the road from the past. They don't just look at the current state; they explicitly look at the state of the system at specific times in the past (e.g., "What was the temperature 30 seconds ago?").

The Secret Sauce: Learning the "When"

Here is the clever part. In the past, scientists had to guess when to look back. They had to say, "Let's look at the data from exactly 5 seconds ago." If they guessed wrong, the model failed.

This paper introduces a method where the AI learns the time delays itself.

Analogy: Imagine you are trying to learn a song by listening to a recording, but you don't know the tempo. A standard model tries to guess the beat. This new model is like a musician who listens to the song and realizes, "Ah, the echo comes back exactly 0.4 seconds later." It learns the timing of the echo automatically.

The paper proves mathematically that if you have enough of these "echoes" (delays), you can perfectly reconstruct the hidden parts of the system, even if you can't see them directly. This is based on a famous math idea called Takens' Theorem, which basically says: "If you record a song at different times, you can reconstruct the whole melody."

The Physics Connection: The "Ghost" of the Past

The paper also leans on a physics concept called the Mori-Zwanzig formalism.

The Analogy: Imagine you are watching a billiard game, but you can only see the white ball. The other balls are hidden behind a curtain. When the white ball moves, it's because it was hit by a hidden ball.
The "hidden ball" is the unobserved variable.
The "hit" is the memory term.
The NDDE acts like a detective. It looks at the white ball's current path and its path from the past to deduce where the hidden balls must have been and how they are pushing the white ball around.

How They Tested It (The Lab Experiments)

The team tested this on three very different scenarios:

Population Growth (The Rabbit Model): They modeled how a rabbit population grows. Rabbits don't reproduce instantly; there's a delay between mating and birth. The AI learned this delay perfectly, predicting the boom-and-bust cycles better than other models.
Chemical Reactions (The Brusselator): This is a chemical system that oscillates (pulses) like a heartbeat. The AI had to predict the pulse using only partial data. The NDDE was the only model that stayed stable over a long time without going crazy.
Fluid Dynamics (The Wind Tunnel): This was the "real world" test. They looked at air flowing over a cavity (like a hole in a car door). The air creates swirling vortices that bounce back and forth. This is a chaotic, noisy system.
- The Result: The NDDE was the champion. It handled the noise better than the others. Why? Because by looking at the past, it could "average out" the random sensor noise and focus on the true physical rhythm of the wind.

Why This Matters

It's Smarter: It doesn't just memorize data; it understands that the past influences the future.
It's Efficient: Instead of needing a massive neural network with millions of parameters to "remember" everything (like a giant hard drive), it uses a few specific time delays (like a few key notes in a song) to capture the memory.
It's Interpretable: Because the AI learns specific time delays (e.g., "The system reacts 2.5 seconds later"), scientists can actually look at the result and say, "Aha! The physics of this system has a 2.5-second lag." This gives us physical insight, not just a black-box prediction.

The Bottom Line

This paper gives us a new tool to predict complex systems when we don't have all the data. It teaches the AI to listen to the echoes of the past to understand the present, making it a powerful, efficient, and scientifically grounded way to model the messy, memory-filled real world.

Here is a detailed technical summary of the paper "Neural delay differential equations: learning non-Markovian closures for partially known dynamical systems."

1. Problem Statement

The paper addresses the challenge of modeling partially observed dynamical systems. In many practical scenarios (e.g., fluid dynamics, climate modeling, biology), the full state of a system $u(t)$ is unavailable; instead, only a limited set of sensor measurements $y(t)$ is accessible.

The Core Issue: Standard Ordinary Differential Equations (ODEs) assume the Markov property, where the future state depends only on the current state. However, when the full state is unobserved, the reduced dynamics of the observables $y(t)$ are non-Markovian; they depend on the history of the system due to the influence of unresolved (hidden) variables.
Limitations of Existing Methods:
- Recurrent Neural Networks (RNNs/LSTMs): While they incorporate memory, they often lack interpretability and physical grounding. Their "memory" is a latent variable that is difficult to relate to physical time scales.
- Augmented Neural ODEs (ANODEs): These attempt to recover missing information by expanding the state space but may struggle to capture specific non-Markovian memory effects efficiently.
- Neural IDEs (Integro-Differential Equations): While theoretically sound (based on Mori-Zwanzig formalism), they require computing convolution integrals over the entire history, making them computationally expensive and difficult to scale.

2. Methodology

The authors propose a Neural Delay Differential Equation (NDDE) framework that learns non-Markovian dynamics directly from data using a finite set of learnable time delays.

Theoretical Foundation

The approach is grounded in two main theoretical pillars:

Mori-Zwanzig (MZ) Formalism: This statistical physics framework decomposes the dynamics of observed variables into a Markovian term, a memory term (integral over the past), and a noise term. The paper argues that the memory term, often an intractable integral, can be approximated by a finite set of time delays.
Takens' Embedding Theorem: This theorem guarantees that for smooth dynamical systems, the full state can be reconstructed (embedded) from a time-delayed vector of a single observable. The authors extend this to show that the dynamics of the observable can be exactly represented by a DDE with a finite number of delays, provided the delays are chosen appropriately.

The NDDE Model

The proposed model is defined as:
$\frac{dy(t)}{dt} = h_\theta(t, y(t), y(t-\tau_1), \dots, y(t-\tau_n))$
Where:

$y(t)$ is the vector of observables.
$h_\theta$ is a neural network approximating the vector field.
$\tau_1, \dots, \tau_n$ are learnable constant delays.

Training via Adjoint Method

A key technical contribution is the derivation of the adjoint sensitivity method for NDDEs with learnable delays.

Challenge: Standard backpropagation through time (discretize-then-optimize) is memory-intensive for continuous-depth models.
Solution: The authors derive the adjoint dynamics (a backward-in-time DDE) that allows for efficient gradient computation with respect to both the neural network weights ( $\theta$ ) and the delay parameters ( $\tau$ ).
Algorithm: The training loop involves solving the forward DDE, computing the loss, solving the backward adjoint DDE, and updating parameters. This ensures computational efficiency even for large datasets.

3. Key Contributions

Learnable Delays: Unlike previous works that fix delays a priori or use a single delay, this framework learns multiple constant delays ( $\tau_i$ ) end-to-end. This allows the model to automatically discover the relevant physical time scales of the system's memory.
Theoretical Justification: The paper bridges the gap between the Mori-Zwanzig formalism (memory kernels) and Takens' theorem (delay embeddings), providing a rigorous justification for using a finite set of delays to approximate non-Markovian closures.
Efficient Adjoint Implementation: The authors provide a mathematically derived adjoint formulation for NDDEs and release an open-source implementation (torchdde) that integrates seamlessly with PyTorch, enabling scalable training.
Data Efficiency: The method is shown to be highly data-efficient, requiring fewer parameters than RNNs while achieving superior performance in chaotic and noisy regimes.

4. Experimental Results

The authors validated the NDDE framework on four distinct datasets, comparing it against LSTM, NODE, ANODE, and Latent ODE.

Population Dynamics (Synthetic): The model successfully learned the delay parameter ( $\tau \approx 1$ ), converging to the true dynamics. The learned delays aligned with the minimum delayed mutual information, validating the learning mechanism.
Brusselator (Stiff/Periodic): In a partially observed setting (only species $u_1$ observed), NDDEs matched the performance of LSTMs and ANODEs but demonstrated superior long-term stability in phase portraits, maintaining the correct attractor structure over extended time horizons where other models drifted.
Kuramoto-Sivashinsky (KS) System (Chaotic): This is a high-dimensional, chaotic PDE.
- Performance: NDDEs achieved the lowest Mean Squared Error (MSE) among all models.
- Statistics: The predicted trajectories had a Maximum Lyapunov Exponent (MLE) closest to the ground truth ($0.128 $vs.$ 0.129$), indicating the model correctly captured the chaotic nature of the system, whereas NODEs and Latent ODEs failed to reproduce the correct chaos statistics.
Open Cavity Flow (Experimental): Using real-world wind tunnel data with sensor noise.
- Robustness: NDDEs outperformed all baselines. The time-delay structure effectively averaged out sensor noise, as the model learned to rely on consistent physical feedback loops rather than spurious noise correlations.
- Learnable vs. Fixed: Experiments showed that models with learnable delays significantly outperformed those with fixed delays, confirming the necessity of optimizing $\tau$ .

Reduced-Order Modeling (ROM) Application

The authors applied NDDEs as a closure model for a Proper Orthogonal Decomposition (POD) Galerkin ROM.

Result: In low-data regimes (using only 4 POD modes), the NDDE closure significantly reduced the error compared to standard ODE closures and the state-of-the-art CD-ROM (Continuous Delay ROM). This highlights the NDDE's ability to compensate for the loss of information when projecting high-dimensional systems to low-dimensional subspaces.

5. Significance and Conclusion

Interpretability: Unlike "black box" RNNs, the delays in NDDEs correspond to physical time scales (e.g., feedback loops, wave propagation times), offering a more interpretable model of the system's memory.
Efficiency: The adjoint method makes training NDDEs computationally feasible, avoiding the high memory cost of Neural IDEs.
Generalizability: The framework is applicable to a wide range of domains, from synthetic chaotic systems to real-world experimental fluid dynamics.
Future Directions: The paper suggests that while overestimating the number of delays does not hurt performance, determining the optimal number of delays based on information-theoretic metrics (like delayed mutual information) remains an open research question.

In summary, this work establishes Neural Delay Differential Equations as a principled, data-efficient, and interpretable alternative to recurrent and latent-state models for learning the dynamics of partially observed, non-Markovian systems.