Latent Replay Detection: Memory-Efficient Continual Object Detection on Microcontrollers via Task-Adaptive Compression

Imagine you have a tiny, battery-powered robot dog. You teach it to recognize your shoes and your coffee mug. It's great! But then, you bring home a new pet, a cat, and a new toy.

In the world of traditional AI, the robot has a major problem: It has a tiny brain (memory) and no way to learn new things without forgetting the old ones.

If you try to teach it about the cat, it might start thinking your shoes are cats. If you try to save pictures of the shoes to remember them, the robot's brain (which is only the size of a postage stamp) fills up instantly. It can't store thousands of photos.

This paper introduces a clever solution called Latent Replay Detection (LRD). Think of it as teaching the robot a new way to "remember" things that fits in its tiny pocket.

Here is how it works, using simple analogies:

1. The Problem: The "Photo Album" vs. The "Sketchbook"

Usually, to remember what it learned yesterday, an AI needs to save raw photos (like a photo album).

The Issue: A single photo of a coffee mug takes up a lot of space (like a heavy brick). The robot's memory can only hold 3 or 4 bricks. It can't learn a new category without running out of room.

The LRD Solution: Instead of saving the photo, the robot saves a tiny, compressed sketch (a "latent" representation).

The Analogy: Imagine you need to remember a complex painting. Instead of carrying the whole canvas (the photo), you write down a few key notes about the colors and shapes (the sketch).
The Result: One sketch takes up almost no space. The robot can now carry 400+ sketches in its pocket, whereas it could only carry 3 photos. This allows it to remember the shoes, the mug, the cat, and the toy all at once.

2. The Smart Compression: The "Chameleon Filter"

You might think, "Can't we just use a standard filter to shrink the photos?" The authors say no.

The Issue: A standard filter (like a fixed black-and-white filter) treats every object the same. But a "shoe" looks very different from a "cat." A one-size-fits-all filter loses important details.
The LRD Solution: They use Task-Adaptive Compression (FiLM).
The Analogy: Imagine a Chameleon Filter. When the robot is looking at shoes, the filter turns green to highlight the laces. When it looks at a cat, the filter turns orange to highlight the whiskers. The filter changes its shape depending on what it is trying to remember. This ensures the most important details aren't lost during the shrinking process.

3. The Smart Picking: The "Map of the Room"

When the robot needs to practice (rehearse) what it learned, it picks some of those sketches to review.

The Issue: If you just pick sketches randomly, you might end up picking 10 sketches of shoes all sitting in the corner of the room. You forget what shoes look like in the middle of the room or on a table. This is called "localization bias."
The LRD Solution: They use Spatial-Diverse Selection.
The Analogy: Imagine the robot is a security guard. Instead of looking at 10 cameras all pointed at the front door, it forces itself to look at cameras in the kitchen, the hallway, the backyard, and the ceiling. It picks sketches that cover every corner of the room so it doesn't get confused about where objects are.

4. The Real-World Test: The "Tiny Brain"

The authors didn't just do this on a powerful computer. They actually put this system on real, tiny microcontrollers (the brains inside smart sensors, industrial robots, and wearable cameras).

The Hardware: They tested it on chips like the STM32 and ESP32. These chips have less memory than a basic calculator.
The Result: The robot learned new things, didn't forget the old things, and did it all while using very little battery power. It could run a full "school day" of learning in a fraction of a second.

Why Does This Matter?

Before this paper, if you wanted a smart device to learn new things in the real world (like a warehouse robot learning to spot new packages), you had to send the data back to a giant cloud server, retrain the robot, and send it back. This is slow, expensive, and requires internet.

With LRD:

The robot can learn on the spot.
It fits in tiny, cheap devices.
It saves battery life.
It never forgets what it learned yesterday.

In short: This paper gave tiny robots a "super-memory" that fits in their pockets, allowing them to grow smarter every day without needing a massive computer in the cloud.

1. Problem Statement

The paper addresses the critical limitation of deploying Continual Learning (CL) for Object Detection on Microcontrollers (MCUs).

Context: MCUs power intelligent edge devices (e.g., smart sensors, robots) but operate under extreme constraints: limited memory (typically <100KB SRAM), low power budgets, and strict latency requirements.
The Challenge: Existing CL methods rely on Experience Replay, which stores raw images or full feature maps from previous tasks to prevent "catastrophic forgetting."
- Storing raw images (e.g., 128×128) requires >10KB per sample, making it impossible to store enough exemplars within a standard 64KB MCU buffer.
- Standard compression techniques (like fixed PCA) are suboptimal because they do not adapt to the specific feature distributions of different object categories, leading to information loss.
- Standard exemplar selection (random or herding) ignores spatial information, causing localization bias where replayed objects cluster in specific image regions, degrading detection performance.

2. Methodology: Latent Replay Detection (LRD)

The authors propose LRD, a framework that replaces raw image storage with highly compressed latent feature representations extracted from intermediate network layers (specifically FPN outputs).

A. Task-Adaptive Compression

Instead of using a fixed compression matrix (like PCA) for all tasks, LRD employs a learnable, task-conditioned compression mechanism:

FiLM Conditioning: The compression function uses Feature-wise Linear Modulation (FiLM). Task-specific embeddings ( $\gamma_t, \beta_t$ ) modulate the compression network, allowing it to adapt its feature extraction to the specific distribution of the current task's object categories.
Hierarchical Compression: Compression is applied at multiple Feature Pyramid Network (FPN) levels ( $P3, P4, P5$ ) with varying ratios (e.g., 8:1 for high-resolution features), balancing redundancy reduction with detail preservation.
Task Similarity: A similarity matrix guides the reuse of compression parameters from similar previous tasks, facilitating knowledge transfer.

B. Spatial-Diverse Exemplar Selection

To address the unique needs of object detection (which relies on bounding box localization), LRD introduces a novel sampling strategy:

IoU-Space Farthest-Point Sampling: Exemplars are selected to maximize the diversity of bounding boxes in IoU (Intersection over Union) space. This ensures the replay buffer contains objects from diverse spatial locations (corners, center) and scales, preventing the model from overfitting to specific image regions.
Grid & Scale Constraints: The method enforces spatial coverage by partitioning images into grids and stratifying sampling across object scales (small, medium, large).

C. MCU-Deployable System Architecture

Memory Efficiency: By storing compressed latent vectors ( $\sim$ 150 bytes/sample) instead of raw images ( $>$ 10KB/sample), a 64KB buffer can hold 400+ exemplars.
Replay Mechanism: During training, compressed features are decoded back into the detection feature space using a lightweight decoder network ( $g_\phi$ ) to compute the detection loss.
Loss Function: The total loss combines:
1. Detection Loss ( $L_{det}$ ): Standard detection loss on current data.
2. Latent Replay Loss ( $L_{replay}$ ): Detection loss on reconstructed features from the memory buffer.
3. Distillation Loss ( $L_{distill}$ ): Ensures the compression-decompression cycle preserves original feature quality.
4. Task-Adaptive Regularization: Prevents overfitting of task-specific FiLM parameters.

3. Key Contributions

First MCU-Deployable Continual Detection Framework: LRD is the first system to enable incremental learning of new object categories on MCUs with strict 64KB memory budgets.
Task-Adaptive Compression: Introduces FiLM-conditioned compression that dynamically adapts to task-specific feature distributions, outperforming fixed compression schemes.
Spatial-Diverse Sampling: Proposes the first exemplar selection method for detection that explicitly maximizes bounding box diversity to prevent localization bias.
Theoretical Guarantees: Provides formal bounds on forgetting, convergence, and localization drift, extending information theory to continual detection.
Real-World Deployment: Successfully deployed on three distinct MCU architectures (STM32H753ZI, ESP32-S3, MAX78000), demonstrating practical feasibility.

4. Experimental Results

The method was evaluated on the CORe50 benchmark (50 classes, 5 tasks) and PASCAL VOC, with hardware deployment on MCUs.

Performance (CORe50):
- LRD achieved 40.4% mAP@50 with 66.7% forgetting.
- This significantly outperforms naive fine-tuning (28.4% mAP, 85.3% forgetting) and regularization-based methods (LwF, EWC).
- While full-memory replay methods (iCaRL, ERD) achieve slightly higher mAP, they exceed MCU memory constraints by 8–16×.
Performance (PASCAL VOC):
- LRD achieved 16.9% mAP@50 with 0.0% forgetting (demonstrating positive backward transfer), whereas fine-tuning suffered 42.5% forgetting.
Ablation Studies:
- Task-Adaptive Compression improved mAP by +3.3% and reduced forgetting by 4.7% compared to standard autoencoders.
- Spatial-Diverse Sampling reduced localization drift by nearly 42% compared to herding-based sampling.
Hardware Benchmarks:
- Latency: 4.9ms (MAX78000) to 97.5ms (ESP32-S3).
- Energy: 49µJ to 2930µJ per inference.
- Memory: Peak SRAM usage remained under 84KB, fitting within the 64KB buffer constraint.

5. Significance and Impact

Democratizing Edge AI: LRD bridges the gap between theoretical continual learning and practical edge deployment, allowing devices to learn new categories on-device without expensive cloud retraining or data collection.
Memory Efficiency: By reducing replay memory requirements by >60×, the paper proves that high-performance continual detection is possible within the kilobyte-scale memory budgets of standard microcontrollers.
New Paradigm for Detection: The work highlights that for object detection, spatial diversity in replay is as critical as feature diversity, a nuance often overlooked in classification-focused CL literature.
Future Directions: The paper opens avenues for fully on-device learning (currently requires offline GPU training for the initial setup) and federated continual learning across edge networks.

In summary, Latent Replay Detection represents a breakthrough in TinyML, demonstrating that with task-adaptive compression and spatial-aware sampling, microcontrollers can sustainably learn and detect new objects over time without forgetting previous knowledge.