Receding-Horizon Maximum-Likelihood Estimation of Neural-ODE Dynamics and Thresholds from Event Cameras

Imagine you are trying to figure out how a car is driving, but you can't see the car itself. Instead, you only have a very strange, high-tech security camera that doesn't take photos.

The Camera: The "Event" Eye
This special camera (called an Event Camera) works differently than your phone camera. Your phone takes a picture every fraction of a second, even if nothing is moving, creating a lot of blurry, repetitive data.

This event camera is like a hyper-alert guard. It only "blinks" (sends a signal) when it sees something change.

If a car moves across the screen, the camera sends a tiny message: "Hey, the light got brighter at this spot!" or "It got darker there!"
It sends these messages at the exact microsecond they happen.
The Catch: The camera has a "sensitivity setting" (a threshold). It only blinks if the change in light is strong enough to cross that line. If the setting is too high, it misses small movements. If it's too low, it gets noisy.

The Problem: The Mystery Box
The researchers wanted to use these blinking signals to figure out two things:

The Physics: How is the object actually moving? (Is it spinning? Slowing down? Following a curve?)
The Camera's Secret: What is the exact sensitivity setting of the camera?

The problem is that the camera's sensitivity setting is often unknown and can change. If you guess the wrong setting, your math for how the object is moving will be wrong. It's like trying to solve a puzzle where you don't know the shape of the pieces and you don't know the picture you're trying to build.

The Solution: The "Sliding Window" Detective
The authors created a smart system to solve this puzzle in real-time. Here is how they did it, using some fun analogies:

1. The Neural ODE: The "Imagination Engine"

They built a digital brain (a Neural ODE) that acts like a movie director's imagination. It constantly guesses: "If the object is moving like this, what should the light look like right now?"

It doesn't just guess the position; it guesses the entire history of how the light changed to get to this point.

2. The Smooth Surrogate: The "Soft Threshold"

In the real world, the camera is a hard switch: "Did the light change enough? Yes/No." This is bad for math because you can't easily calculate the "slope" of a switch.

The researchers invented a smooth, fuzzy version of this switch. Imagine the threshold isn't a hard wall, but a hill. The closer the light change gets to the top of the hill, the more likely the camera is to "blink."
This allows the computer to use calculus (gradients) to gently nudge its guesses until they fit the data perfectly.

3. The Receding Horizon: The "Sliding Window"

This is the most clever part.

The Old Way: Imagine trying to solve a mystery by reading the entire history of a crime scene from the beginning of time up to the present moment every single time you get a new clue. It would take forever and crash your computer.
The New Way (Receding Horizon): The researchers only look at the last few seconds of the video.
- They take a "window" of time (say, the last 15 seconds).
- They use the data in that window to update their guesses about the movement and the camera settings.
- Then, they slide the window forward, drop the oldest data, and add the newest data.
- Why? It keeps the math fast and manageable, like a detective focusing only on the most recent clues rather than re-reading the whole case file every time.

4. The Monte Carlo Subsampling: The "Spot Check"

To check if their guess is good, they have to compare their "Imagination Engine" against the actual camera blinks.

Normally, they would have to check every single pixel on the screen (thousands of them) to see if the math adds up. That's too slow.
Instead, they use Monte Carlo Subsampling. Imagine you are judging the quality of a giant pizza. Instead of tasting every single slice, you randomly pick 500 slices, taste them, and assume the whole pizza tastes like that.
The computer picks a random sample of pixels to check the math, saving massive amounts of time.

The Result

By combining these tricks, the system can:

Learn the movement: It figures out the exact physics of the moving object (speed, direction, spin).
Learn the camera: It figures out the camera's hidden sensitivity settings, even if they vary from pixel to pixel.
Do it live: It updates these guesses instantly as the video plays, without getting bogged down by old data.

In a Nutshell:
The paper teaches a computer to watch a camera that only blinks when things change, and to figure out both how the object is moving and how sensitive the camera is, all by looking at a sliding window of recent history and taking quick "spot checks" of the data. It turns a chaotic stream of tiny blips into a clear, smooth understanding of the world.

Here is a detailed technical summary of the paper "Receding-Horizon Maximum-Likelihood Estimation of Neural-ODE Dynamics and Thresholds from Event Cameras."

1. Problem Statement

The paper addresses the challenge of online system identification for continuous-time dynamics using data from event cameras (e.g., DVS, DAVIS). Unlike standard frame-based cameras, event cameras output asynchronous streams of "events" triggered when the log-intensity change at a pixel exceeds a specific contrast threshold.

The core challenges identified are:

History Dependence: The generation of an event at a specific pixel depends on the time and intensity of the previous event at that same pixel (a reset mechanism).
Unknown Thresholds: The effective contrast threshold ( $C(u)$ ) is often unknown, varies across pixels, and can drift due to sensor conditions. Treating it as a fixed constant introduces bias in dynamics estimation.
Continuous-Time Dynamics: Many applications (tracking, control) require estimating the underlying continuous-time state dynamics, not just discrete frame predictions.
Computational Cost: Standard Maximum Likelihood Estimation (MLE) for point processes requires integrating the intensity function over the entire pixel grid and time horizon, which is computationally prohibitive for online, streaming applications.

Goal: To jointly estimate the parameters of the latent continuous-time dynamics (modeled as a Neural ODE) and the pixel-dependent contrast thresholds directly from the raw asynchronous event stream in an online, streaming manner.

2. Methodology

The proposed framework combines Neural Ordinary Differential Equations (Neural ODEs) with a Temporal Point Process likelihood model, optimized via a Receding-Horizon strategy.

A. Generative Model

Latent Dynamics: The scene state $x(t)$ evolves according to a Neural ODE:
$\frac{dx(t)}{dt} = f_\vartheta(x(t), t)$
where $\vartheta$ are the learnable dynamics parameters.
State-to-Image Mapping: A differentiable renderer $R$ maps the latent state to predicted log-intensity:
$\hat{L}(u, t) = R(u, x(t))$
Event Generation (Point Process): Events are modeled as a history-dependent marked point process.
- Residual: The "distance to threshold" is defined as the difference between the predicted log-intensity change since the last event and the unknown threshold $C_\psi(u)$ :
  $\phi_{u,p}(t) = \Delta\hat{L}(u, t) - p C_\psi(u)$
- Intensity Function: Instead of a hard threshold, the authors use a smooth surrogate intensity function $\lambda_{u,p}(t)$ that peaks when the residual $\phi$ is near zero. This allows for gradient-based optimization:
  $\lambda_{u,p}(t) = \lambda_0 + \text{softplus}(\beta - \gamma |\phi_{u,p}(t)|)$
- Likelihood: The log-likelihood consists of an event term (sum over observed events) and a compensator term (integral of total intensity over time).

B. Online Receding-Horizon Estimation

To handle streaming data and computational constraints, the authors propose a Fixed-Lag Receding-Horizon approach:

Sliding Window: Instead of optimizing over the entire history, parameters are updated on a moving time window $[\tau_m - \Delta, \tau_m]$ .
Boundary Memory: To handle the history dependence without reprocessing the entire stream, the algorithm stores only two scalars per pixel at the window boundary: the last event time ( $t^-$ ) and the estimated log-intensity at that time ( $\hat{L}^-$ ). These are used to initialize the window and are detached from the computation graph to bound backpropagation depth.
Monte Carlo Approximation: The compensator integral (which sums intensity over all pixels) is approximated using Monte Carlo pixel subsampling. This reduces the complexity from $O(|\Omega|)$ to $O(S)$ , where $S$ is the number of sampled pixels.
Adjoint Sensitivity: Gradients are computed using the adjoint method, which handles the continuous evolution of the state and discrete jumps at event times.

C. Algorithm Summary

At each update step $\tau_m$ :

Collect events in the current window.
Initialize pixel memory from the boundary of the previous window.
Compute the windowed negative log-likelihood (event term + approximated compensator).
Perform $N_{step}$ gradient descent steps (e.g., Adam) to update dynamics parameters ( $\vartheta$ ) and threshold parameters ( $\psi$ ).
Update the boundary memory for the next window.

3. Key Contributions

Differentiable Surrogate for Thresholds: The paper introduces a smooth, differentiable mapping from the contrast-threshold residual to event intensity within a marked point-process likelihood. This enables the joint estimation of dynamics parameters and pixel-dependent contrast thresholds, treating thresholds as learnable parameters rather than fixed constants.
Receding-Horizon MLE: A novel online estimation scheme that performs a few gradient steps on a fixed-lag window. This bounds computational cost and memory usage, making it suitable for streaming event data.
Efficient Compensator Approximation: The use of Monte Carlo pixel subsampling to approximate the normalization term (compensator) allows the method to scale to high-resolution sensors without evaluating the intensity for every pixel at every time step.
Compact Memory Management: The method maintains only two scalars per pixel (last event time and intensity) to summarize history, enabling efficient online operation.

4. Experimental Results

The method was evaluated on a synthetic dataset involving a moving Gaussian blob with known ground-truth dynamics and a spatially varying contrast threshold.

Parameter Recovery: The estimator successfully recovered the true dynamics parameters ( $\alpha, \omega$ ) and the spatial structure of the contrast threshold map.
Horizon Ablation Study:
- Short Horizons: Estimation of the dynamics parameter $\omega$ was highly sensitive to window length; short windows resulted in large errors.
- Long Horizons: Accuracy improved significantly as the horizon length increased (specifically $H \ge 14$ ), allowing the model to capture sufficient temporal context for the oscillatory dynamics.
- Threshold Accuracy: The threshold map error ( $RMSE_C$ ) remained relatively stable across different horizon lengths but improved slightly with longer windows.
Computational Efficiency: The update time per step remained bounded and well below the update interval (0.4s), confirming the feasibility of real-time implementation.

5. Significance and Impact

Bridging the Gap: This work bridges the gap between event-based vision (asynchronous, high-speed) and continuous-time system identification. It moves beyond discrete frame-based representations to a principled probabilistic model of the raw event stream.
Robustness to Sensor Variability: By treating the contrast threshold as a learnable parameter, the method addresses a major source of error in event-camera applications, where sensor calibration is often imperfect or drifts over time.
Scalability: The combination of Neural ODEs, point processes, and receding-horizon optimization provides a scalable framework for online learning in high-speed, high-dynamic-range environments, applicable to robotics (SLAM, control) and safety-critical perception systems.

In summary, the paper presents a mathematically rigorous and computationally efficient framework for learning the physics of motion and sensor characteristics directly from the asynchronous, irregular data streams produced by event cameras.