Locally Adaptive Decay Surfaces for High-Speed Face and Landmark Detection with Event Cameras

Imagine you are trying to take a photo of a busy street using a camera that doesn't take pictures like normal cameras do. Instead of capturing a full image every second, this special camera (called an Event Camera) only snaps a tiny "pixel" whenever something changes—like a car driving by, a bird flapping its wings, or a person blinking. It's incredibly fast and efficient, but the data it sends back is a chaotic, scattered stream of dots rather than a clear picture.

The challenge for computer scientists is: How do you turn this scattered stream of dots into a clear picture that a computer can understand?

The Old Way: The "Static Timer"

Traditionally, researchers tried to build a picture by grouping these dots into time buckets (like 30 times a second). They used a simple rule: "Keep the dots for a fixed amount of time, then forget them."

Think of this like a bucket with a hole in the bottom.

If you pour water (events) into the bucket, it fills up.
The hole lets water leak out at a constant speed, no matter what.

The Problem:

When things are still: If a person is sitting perfectly still, the "water" leaks out too fast. The computer forgets the shape of their face before it can even recognize it.
When things move fast: If a person waves their hand wildly, the bucket fills up so fast that the water overflows and mixes everything together. The hand becomes a blurry mess, and the computer can't tell where the fingers are.

The old method used the same "leak speed" for the whole image. It was a one-size-fits-all approach that failed when the scene was complex.

The New Solution: LADS (The "Smart Sponge")

The authors of this paper introduced a new method called LADS (Locally Adaptive Decay Surfaces).

Imagine replacing that leaky bucket with a smart, magical sponge that changes its behavior based on what it's touching.

In quiet areas (like a nose or cheek): The sponge becomes sticky. It holds onto the "dots" (events) for a long time. This preserves the shape of the face so the computer doesn't lose track of it.
In busy areas (like a blinking eye or moving hand): The sponge becomes super absorbent and then instantly releases. It grabs the new movement, shows it clearly, and then immediately lets the old movement fade away. This prevents the "blur" that happens when things move too fast.

How Does the Sponge Know What to Do?

The paper tests three different ways for the sponge to decide when to be sticky and when to be fast:

Event Rate: "Are lots of dots hitting this spot?" If yes, fade them out fast. If no, hold them tight.
Edge Detection (LoG): "Are there sharp lines here?" If the dots form a sharp edge (like an eyelid), fade them quickly to keep the edge crisp.
Frequency (FFT): "Is this area vibrating with high-energy activity?" If yes, clear it out fast.

Why This Matters (The Results)

The researchers tested this on a task that is very hard for computers: finding a face and pinpointing exactly where the eyes, nose, and mouth are (facial landmarks) using these event cameras.

At normal speed (30 Hz): The new "Smart Sponge" method was better than the old "Leaky Bucket" method. It found faces more accurately and located facial features with less error.
At super speed (240 Hz): This is where LADS truly shines. The old methods fell apart when the camera moved fast; the faces became unrecognizable blobs. But LADS kept the faces clear and sharp. It was so good that it outperformed previous records set at much slower speeds.

The Bonus: Lighter Computers

Because the "Smart Sponge" does such a good job of keeping the image clear before the computer even looks at it, the computer doesn't need to be as "smart" or powerful to do the work.

Old way: Needed a giant, heavy brain (a massive neural network) to try to fix the blurry images.
LADS way: The image is already clear, so the computer can use a tiny, lightweight brain. This means this technology can run on small, battery-powered devices like robots, drones, or cars without needing a supercomputer.

The Bottom Line

This paper is about teaching computers to be context-aware. Instead of treating every part of a scene the same way, LADS looks at what's happening in each tiny corner of the image and adjusts its memory accordingly. It holds onto stillness and lets go of chaos.

This makes event cameras much more useful for real-world applications like driver monitoring (watching if a driver is falling asleep), robotics, and human-computer interaction, allowing them to see fast movements clearly without getting confused or needing expensive hardware.

1. Problem Statement

Event cameras offer significant advantages for computer vision, including microsecond temporal resolution, low latency, and high dynamic range. However, a critical bottleneck exists in converting their sparse, asynchronous event streams into dense tensors suitable for standard neural networks.

The Core Challenge: Conventional event representations, such as Event Histograms (accumulating events in fixed windows) and Global Leaky Time-Surfaces (Global-LI), apply uniform temporal decay parameters across the entire image plane.
The Trade-off: A fixed decay rate creates a conflict:
- Slow decay preserves spatial structure in static regions but causes motion blur and feature accumulation in areas of rapid movement (e.g., blinking eyes).
- Fast decay preserves sharp edges during motion but causes weak signals in static regions to fade prematurely, losing structural context.
Consequence: This global approach fails to handle the highly localized and dynamic nature of facial events, particularly at high update rates (e.g., 240 Hz), leading to accuracy degradation and the need for heavy, recurrent neural network architectures to compensate for lost temporal information.

2. Methodology: Locally Adaptive Decay Surfaces (LADS)

The authors propose LADS, a framework where the temporal decay rate at each pixel is modulated based on local signal dynamics rather than a global constant.

A. The LADS Framework

Patch-based Analysis: The event stream is divided into a grid of non-overlapping spatial patches.
Dynamic Measurement: For each patch, a specific metric is calculated to quantify local activity.
Adaptive Decay: A decay factor is computed for each patch based on the metric.
Interpolation: To avoid visible artifacts at patch boundaries, the patch-wise decay values are bilinearly interpolated to create a smooth, per-pixel decay field ( $d_k(x,y)$ ).
Surface Update: The new time surface is generated by combining the current event histogram ( $H_k$ ) with the decayed previous surface ( $S_{k-1}$ ):
$S_k(x, y) = H_k(x, y) + d_k(x, y) \cdot S_{k-1}(x, y)$

B. Three Adaptive Strategies

The paper explores three distinct methods to measure local signal dynamics:

Event Rate (ER): Calculates the number of events per pixel per second within a patch. High event rates trigger faster decay to prevent blur; low rates trigger slower decay to preserve structure.
Laplacian-of-Gaussian (LoG): Applies a LoG filter to the event histogram to detect edges and sharp features. High LoG response (indicating sharp incoming features) triggers faster decay to prevent accumulation of conflicting edges.
Fast Fourier Transform (FFT): Computes the power spectrum of the patch. The ratio of high-frequency energy to total energy determines the decay. High-frequency content (sharp edges/motion) leads to faster decay. Note: A recursive subdivision strategy was implemented to make FFT computationally feasible for real-time use.

C. Experimental Setup

Datasets: Primarily the Faces in Event Streams (FES) dataset (lab and wild subsets) and the Blink dataset. The authors performed rigorous data cleaning, removing samples with annotation errors (e.g., inconsistent landmark indexing, frozen landmarks).
Networks:
- Landmark Detection: Adapted from PIPNet using a MobileNetV3-Large backbone (3.5M parameters). Recurrent layers (LSTM) were removed to test if the representation alone could preserve temporal consistency.
- Face Detection: Adapted from a YOLO-based detector (2.6M parameters), also stripped of recurrent units.
Evaluation Frequencies: Models were trained and tested at 30 Hz and 240 Hz.

3. Key Contributions

LADS Framework: Introduction of a spatially adaptive event representation that modulates temporal decay based on local signal dynamics, solving the static-vs-motion blur trade-off.
Comprehensive Evaluation: Systematic comparison of three adaptive strategies (ER, LoG, FFT) against standard Histogram and Global-LI baselines on face detection and landmark localization.
High-Frequency Performance: Demonstration that LADS maintains high accuracy at 240 Hz, whereas standard methods suffer significant performance drops.
Lightweight Architectures: Proof that high-quality representations allow for the use of much smaller, non-recurrent networks (e.g., MobileNetV3) without sacrificing accuracy, enabling efficient real-time deployment.
Data Curation: Publication of a cleaned and validated version of the FES dataset with an exclusion list for flawed samples.

4. Results and Analysis

Experiments were conducted on the FES and Blink datasets.

Performance at 30 Hz:
- LADS methods consistently outperformed baselines.
- LoG achieved the highest Face Detection mAP50 (0.957) compared to Global-LI (0.948) and Histogram (0.921).
- LoG and FFT achieved the lowest Landmark Normalized Mean Error (NME) at 2.29% and 2.30%, respectively, outperforming Global-LI (2.37%).
Performance at 240 Hz (High-Speed):
- All methods saw a decline in accuracy, but LADS methods showed significantly smaller drops than baselines.
- LoG maintained a Face Detection mAP50 of 0.943 and Landmark NME of 2.52%.
- In contrast, the Histogram baseline dropped to 0.829 mAP50 and 4.05% NME.
- Significance: LADS results at 240 Hz surpassed the accuracy of prior works operating at 30 Hz, setting new benchmarks.
Generalization (Blink Dataset):
- On the low-event-rate Blink dataset, LoG was the clear winner (0.896 mAP50 at 240 Hz), significantly outperforming baselines. This highlights LADS's ability to preserve structure when events are sparse.
Efficiency:
- While LADS adds computational overhead (LoG: ~0.95ms/frame at 30Hz vs. 0.59ms for Global-LI), it remains compatible with real-time processing.
- The ability to use lighter networks (3.5M params vs. 24.1M params in prior SOTA) offsets the representation cost, resulting in a more efficient overall system.

5. Significance

This work fundamentally shifts the paradigm of event-based vision from global, static integration to context-aware, local integration.

Enabling High-Speed Interaction: By mitigating the accuracy decline at high frequencies (240 Hz), LADS unlocks the full potential of event cameras for capturing fast, transient human motions (e.g., rapid blinking, head turns) without motion blur.
Hardware Efficiency: The method allows for the deployment of lightweight, feed-forward neural networks on edge devices, removing the need for computationally expensive recurrent layers (LSTMs/GRUs) that were previously required to handle temporal dependencies.
Robustness: The approach is robust across varying event densities (from the high-motion FES dataset to the sparse Blink dataset), making it a practical solution for real-world applications like driver monitoring, human-computer interaction, and robotics.

In conclusion, LADS demonstrates that adapting the temporal integration process to local scene dynamics is critical for maximizing the utility of event cameras, achieving state-of-the-art accuracy with significantly reduced model complexity.