Low-latency Event-based Object Detection with Spatially-Sparse Linear Attention

Imagine you are trying to watch a high-speed race, but instead of a normal video camera, you have a special "event camera."

The Problem with Normal Cameras vs. Event Cameras
A normal video camera is like a movie projector: it takes a picture of the whole scene 30 or 60 times a second, whether anything is moving or not. It wastes a lot of energy recording empty sky or a static wall.

An event camera is different. It's like a room full of tiny, hyper-alert security guards. They only speak up when something changes in their specific corner of the room. If a car zooms past, they shout "I see movement!" If the wall is still, they stay silent. This makes them incredibly fast and efficient, perfect for things like self-driving cars that need to react instantly.

The Old Way: The "One-by-One" Bottleneck
The problem with these event cameras is that the data comes in a chaotic, endless stream of individual "shouts" (events).

Old AI models tried to listen to these shouts one by one, like a teacher calling on students individually. They could react fast (low latency), but because they had to process every single shout sequentially, they were slow to learn and often missed details.
Newer AI models tried to listen to everyone at once (parallel processing), which is great for learning, but they ended up listening to everything, even the silence. This made them slow and computationally expensive, defeating the purpose of the event camera.

The Solution: SSLA (The "Smart Neighborhood" System)
The authors of this paper, "Low-latency Event-based Object Detection with Spatially-Sparse Linear Attention," invented a new system called SSLA.

Think of the camera's view as a giant city map.

The "Mixture of Spaces" (The Neighborhoods): Instead of having one giant brain trying to remember the whole city at once, SSLA divides the city into many small, overlapping neighborhoods (patches).
The "Scatter-Compute-Gather" (The Efficient Workflow):
- Scatter: When a car moves, it triggers shouts only in the specific neighborhoods it passes through. The system instantly routes those shouts to the correct neighborhood teams. It ignores the empty neighborhoods.
- Compute: Each neighborhood team processes its own local shouts simultaneously (parallel training). Because they only focus on their small area, they are super fast.
- Gather: Finally, the system brings all the local reports back together to form a complete picture of the car.

The Secret Sauce: "Position-Aware Projection"
There's a tricky part: If a car is in the top-left corner of a neighborhood, it's different from being in the bottom-right. The system uses a special "compass" (Position-Aware Projection) to remember exactly where inside the neighborhood the event happened. This ensures the AI doesn't get confused about the object's shape or location.

The Results: Fast, Accurate, and Efficient
The team built a full object detector called SSLA-Det using this system.

Speed: It's incredibly fast. It processes new events in microseconds (millionths of a second), faster than the camera sensor can even send the data.
Efficiency: It uses 20 times less computing power than the previous best methods. It's like getting the same result with a bicycle instead of a semi-truck.
Accuracy: Despite being so efficient, it actually detects objects better than previous fast methods, achieving top-tier scores on standard driving datasets.

In a Nutshell
This paper solves the "speed vs. smarts" dilemma for event cameras. By organizing the AI into small, specialized teams that only work when and where they are needed, they created a system that is fast enough for real-time driving, smart enough to see small details, and efficient enough to run on small hardware. It's the difference between a chaotic shouting match and a well-organized, hyper-efficient emergency response team.

1. Problem Statement

Event cameras offer high temporal resolution and spatial sparsity, making them ideal for low-latency perception tasks like object detection. However, existing asynchronous event-based neural networks face two critical bottlenecks:

Parallel-Recurrent Bottleneck: Asynchronous inference relies on recurrent architectures (processing event-by-event), which are difficult to train efficiently on long sequences compared to parallel training methods.
Accuracy-Efficiency Trade-off: Improving accuracy usually requires deeper/larger models, which increases per-event computation and latency. While exploiting event sparsity is a natural solution, deep networks often lead to dense activations, negating the benefits.

Specific Challenge for Object Detection:
Recent Linear Attention models (e.g., State Space Models, Linear RNNs) solve the parallel-recurrent bottleneck by supporting both parallel training and recurrent inference. However, standard linear attention updates a global state for every event. For object detection, which requires fine-grained spatial representations, a large global state size leads to excessive per-event computation, creating a poor accuracy-efficiency trade-off. The core challenge is to introduce state-level sparsity (updating only relevant parts of the state) without sacrificing the ability to train in parallel.

2. Methodology: Spatially-Sparse Linear Attention (SSLA)

The authors propose SSLA, a novel module designed to enable state-level sparsity while maintaining parallel training capabilities. The method consists of three key components:

A. Mixture-of-Spaces (MOS) Structure

To achieve state-level sparsity, the global spatial domain is decomposed into overlapping local patches (sliding windows).

State Decomposition: The global hidden state is split into multiple sub-states, each corresponding to a specific spatial patch.
Sparse Activation: For any given event $e_i$ at location $x_i$ , only the sub-states corresponding to the patches covering $x_i$ are activated and updated.
Aggregation: The outputs from these activated sub-states are aggregated to form the final embedding for the event. This prevents the "dense activation" problem in deep layers, as the number of active states per event remains constant regardless of network depth.

B. Position-Aware Projection (PAP)

Since an event is processed by multiple overlapping patches, simply sharing the same embedding across patches ignores the event's relative position within each patch.

Mechanism: The authors introduce learnable projection matrices ( $W_{in}$ and $W_{out}$ ) that depend on the event's relative position ( $\delta_{i,k}$ ) within the specific patch $k$ .
Function: This injects spatial priors and translation invariance into the model, allowing it to distinguish between an event at the center of a patch versus the edge, even if the patches overlap.

C. Scatter-Compute-Gather Training Procedure

To enable parallel training despite the sparse, patch-based structure, the authors derive a specific algorithm:

Scatter: Events are reorganized into patch-specific subsequences based on their spatial location. Each event is "scattered" into $A$ (number of active patches) consecutive embeddings.
Compute: Linear attention is applied in parallel to each patch-specific subsequence. Since patches are independent during this step, this achieves inter-patch parallelism.
Gather: The computed outputs are reordered back to the original event sequence order and aggregated (summed) to produce the final output.
This approach allows the model to leverage GPU parallelism for training while maintaining the event-by-event recurrent nature for inference.

3. Model Architecture: SSLA-Det

Built upon the SSLA module, the authors present SSLA-Det, the first end-to-end asynchronous linear attention model for event-based object detection.

Backbone: A 4-stage asynchronous network where each stage contains 2 SSLA layers, followed by sparse pooling and temporal dropout (to compress sequence length).
Head: A modified YOLOX head where the stem convolution is replaced with a $1 \times 1$ convolution. This ensures the head updates predictions only at the specific event location, maintaining end-to-end asynchronicity.
Input: Raw events $(x, t, p)$ are embedded directly as $[p, \Delta t]$ .

4. Key Contributions

SSLA Module: A novel linear attention mechanism that introduces state-level sparsity via Mixture-of-Spaces and Position-Aware Projections, enabling efficient fine-grained spatial modeling.
Scatter-Compute-Gather Algorithm: A training procedure that reorganizes sparse event data to allow parallel computation across spatial patches, solving the training efficiency bottleneck.
SSLA-Det: The first end-to-end asynchronous linear attention detector for event-based vision.
State-of-the-Art Performance: Achieving new accuracy-efficiency frontiers on standard benchmarks.

5. Experimental Results

The model was evaluated on Gen1 (automotive) and N-Caltech101 datasets.

Accuracy:
- Gen1: SSLA-Det (Large variant) achieves 0.375 mAP, setting a new State-of-the-Art (SOTA) for asynchronous methods.
- N-Caltech101: SSLA-Det (Large) achieves 0.515 mAP and 0.743 AP50.
Efficiency:
- Compared to the strongest prior asynchronous baseline (DAGr-L), SSLA-Det reduces per-event computation by >20× (e.g., 0.724 MFLOPS/ev vs. 17.4 MFLOPS/ev on Gen1).
- Even the smallest model (SSLA-S) outperforms larger baselines in accuracy while reducing computation by ~171×.
Latency:
- Inference latency is <10 µs per event on a CPU, which is significantly lower than typical sensor transmission latencies (~200 µs).
Training Efficiency:
- SSLA trains 3.8× to 4.2× faster per epoch compared to an LSTM baseline with similar parameters, demonstrating the benefit of the parallel training strategy.

6. Significance and Impact

Bridging the Gap: This work successfully bridges the gap between the high accuracy of synchronous models and the low latency of asynchronous models for complex tasks like object detection.
Scalability: By decoupling state size from per-event computation via spatial sparsity, SSLA allows for larger, more accurate models without incurring prohibitive latency costs.
Future Direction: It establishes linear attention as a viable architecture for event-based vision, moving beyond simple classification to dense prediction tasks. The paper suggests future work in hybrid event-image fusion and scaling to even larger models.

In summary, SSLA-Det demonstrates that by exploiting the inherent spatial sparsity of event cameras through a specialized linear attention mechanism, it is possible to achieve high-accuracy, low-latency object detection that significantly outperforms existing asynchronous baselines in both speed and efficiency.

Low-latency Event-based Object Detection with Spatially-Sparse Linear Attention

1. Problem Statement

2. Methodology: Spatially-Sparse Linear Attention (SSLA)

A. Mixture-of-Spaces (MOS) Structure

B. Position-Aware Projection (PAP)

C. Scatter-Compute-Gather Training Procedure

3. Model Architecture: SSLA-Det

4. Key Contributions

5. Experimental Results

6. Significance and Impact

More like this

On the security of 2-key triple DES

Security issues in a group key establishment protocol

The impact of quantum computing on real-world security: A 5G case study

Yet another insecure group key distribution scheme using secret sharing

How not to secure wireless sensor networks: A plethora of insecure polynomial-based key pre-distribution schemes