Event-Based Visual Teach-and-Repeat via Fast Fourier-Domain Cross-Correlation

Imagine you are teaching a robot how to walk through a maze. In the old way, you would drive the robot through the maze once, recording a video of the path. Later, you would play the robot a video of that path and ask it to "look at the screen and copy what it sees."

The problem with this old method is that standard cameras are like a slow-talking person. They take a full picture (a frame) every 1/30th of a second. If the robot moves fast, the picture blurs. If the lights change, the picture looks different. And because the camera has to wait to take the next picture, the robot often has to pause to "think," making it slow and clumsy.

This paper introduces a super-fast, super-smart robot that uses a special kind of camera called an Event Camera. Here is how it works, explained simply:

1. The Camera: The "Motion Detective"

Instead of taking full photos like a normal camera, this Event Camera is like a motion detective. It doesn't care about the whole room; it only screams out when something changes.

If a wall is sitting still, the camera is silent.
If a shadow moves or a corner passes by, the camera instantly shouts, "Hey! A pixel just got brighter!" or "A pixel just got darker!"
It does this thousands of times a second with incredible precision.

2. The "Teach" Phase: Recording the Rhythm

When you first teach the robot the path, it doesn't just record a video. It counts the "shouts" (events).

The Analogy: Imagine you are walking a path and you decide to take a step every time you hear a specific bird chirp. You aren't counting seconds; you are counting events.
The robot records these "chirps" (events) into little chunks. If the robot moves fast, it gets more chirps in a short time. If it moves slow, it gets fewer. But the pattern of the chirps remains the same. This makes the robot's memory very flexible; it doesn't matter if it walks fast or slow later, the "song" of the path sounds the same.

3. The "Repeat" Phase: The Lightning-Fast Match

Now, the robot has to walk the path again on its own. This is where the magic happens.

The Old Way: The robot would take a photo, compare it to a stored photo, and say, "Hmm, I'm a little to the left." This takes time.
The New Way: The robot uses a mathematical trick called Fast Fourier Transform (FFT).
- The Analogy: Imagine you have two huge jigsaw puzzles. The old way is to pick up every single piece and try to fit it with every other piece (very slow).
- The new way is like turning both puzzles into a soundwave. Instead of looking at the pieces, you listen to the "hum" of the puzzle. If the hums match, you know the puzzles are aligned.
- Because the robot only cares about the "motion changes" (the events), the soundwave is very simple and quiet. The robot can compare the "hum" of the current view with the "hum" of the stored path in the blink of an eye.

4. Why This is a Game-Changer

The authors built this system on a small robot and tested it in a giant warehouse and outside on a university campus (day and night).

Speed: The robot makes decisions 300 times a second. That's like a hummingbird flapping its wings. A normal robot might make decisions 30 times a second.
Accuracy: The robot stayed within 15 centimeters (6 inches) of the perfect path, even when walking over grass, carpets, or in the dark.
Robustness: Because the camera only sees changes, it doesn't get confused by shadows, moving people, or flickering lights. It just ignores the static stuff and focuses on the movement.

The Bottom Line

This paper is about teaching a robot to "listen" to the world's motion instead of "watching" the world's pictures. By using a special camera that only notices changes and a super-fast math trick to match those changes, they created a robot that can navigate complex paths faster, more accurately, and in more difficult conditions than ever before.

In short: They turned a slow, blurry video game into a high-speed, motion-sensing rhythm game, and the robot is now the champion player.

1. Problem Statement

Visual Teach-and-Repeat (VT&R) allows robots to autonomously retrace demonstrated paths using visual feedback. However, conventional VT&R systems rely on frame-based cameras, which suffer from:

Fixed Frame Rates: This creates latency between perception and action, limiting update rates and responsiveness.
Motion Blur and Dynamic Range Issues: Standard cameras struggle in high-speed or low-light conditions.
Computational Cost: Matching dense image frames in real-time is computationally intensive, often preventing high-frequency control loops on resource-constrained robots.

The paper addresses the need for a low-latency, high-frequency VT&R system that can operate robustly in diverse lighting (day/night) and dynamic environments. It proposes replacing standard cameras with event cameras, which asynchronously report pixel-level brightness changes, offering microsecond temporal resolution and high dynamic range.

2. Methodology

The proposed system formulates event-stream matching as a frequency-domain cross-correlation problem, transforming spatial convolutions into efficient Fourier-space multiplications.

A. Event Representation & Accumulation

Binary Event Frames: Instead of processing individual events, the system accumulates events into binary frames ( $I_k \in \{0, 1\}$ $I_{k} \in {0, 1}$ ) based on a fixed event count ( $N$ $N$ ) rather than a fixed time window.
- Significance: This ensures that frames contain consistent information content regardless of robot velocity. If the robot moves faster, it accumulates events faster, but the frame size (in terms of event count) remains constant, maintaining appearance consistency between teach and repeat phases.
Polarity Agnostic: Event polarity (brightness increase/decrease) is discarded to create binary frames. This prevents matching failures caused by polarity reversals during angular corrections.

B. The VT&R Pipeline

The system operates in two phases:

Teach Phase: The robot is teleoperated along a path. Event frames and odometry poses are recorded at regular intervals of linear distance ( $\Delta d$ ) or angular displacement ( $\Delta \alpha$ ) to build a Topometric Map (an ordered list of event frames and poses).
Repeat Phase: The robot autonomously follows the path.
- Odometry Drift Correction: A low-level controller drives the robot to target poses. To correct for odometry drift (lateral and along-path), the system matches the current incoming event frame against a search space of stored teach frames.
- Frequency-Domain Matching: Cross-correlation is performed using the Fast Fourier Transform (FFT).
  - Equation: $P_j = \mathcal{F}^{-1}(\mathcal{F}(I_j) \cdot \mathcal{F}(\hat{I}^*))$
  - This reduces computational complexity from $O(N^2)$ to $O(N \log N)$ .
- Correction Generation:
  - Lateral Correction: The pixel offset with the maximum correlation is converted to a rotational correction ( $\Delta \theta$ ) to adjust the robot's heading.
  - Along-Path Correction: A weighted average of correlation scores across the search space estimates the robot's position relative to the target, adjusting the distance to the next goal.

C. Computational Optimizations

To achieve real-time performance on consumer hardware, two specific optimizations are applied:

Event-Frame Compression: Since event frames are sparse (mostly zeros), a 1D summation kernel is applied to compress the image width before FFT, significantly reducing matrix multiplication costs.
Horizontal Concatenation: Instead of performing multiple FFTs for every frame in the search space, all teach-phase frames in the search window are concatenated horizontally into a single extended frame. A single FFT is performed on this combined frame, and individual correlation scores are extracted by cropping the result.

3. Key Contributions

First Event-Based VT&R System: The authors present the first implementation of a VT&R system specifically designed for event cameras on real-world ground robots, bridging the gap between event-based perception and trajectory following.
High-Speed Frequency-Domain Processing: They introduce an FFT-based correlation framework optimized for the sparse, binary nature of event data. This achieves a processing latency of 2.88 ms (approx. 347 Hz), which is ~3.5x faster than optimized conventional camera baselines.
Extensive Field Validation: The system was tested over 3,000+ meters of indoor and outdoor trajectories (including night-time) on an AgileX Scout Mini robot with a Prophesee EVK4 HD camera.

4. Experimental Results

The system was evaluated against two baselines: an Odometry-only controller and two conventional frame-based VT&R approaches (Dall'Osto et al. and Nourizadeh et al.).

Success Rate: The proposed system achieved a 100% success rate (18/18 trials) across all trajectories. In contrast, the odometry-only baseline failed in every trial (0/18), often due to drift causing collisions or falling off curbs.
Accuracy (Cross-Track Error - XTE):
- Indoor: Mean XTE of 8.04 cm.
- Outdoor: Mean XTE of 9.87 cm.
- Night-time: Maintained a mean XTE of 11.07 cm with 100% success, demonstrating robustness in low-light where standard cameras struggle.
- These results are comparable to or better than conventional camera-based baselines, despite the event camera's sparsity.
Velocity Invariance: Experiments showed that using fixed-event count binning allows the system to successfully repeat trajectories at different speeds (e.g., teaching at 0.33 m/s and repeating at 1.00 m/s), whereas fixed-time binning failed due to appearance divergence.
Latency: The total processing time was 2.88 ms, compared to significantly higher latencies for frame-based methods (e.g., 13.31 ms for NCC matching in baselines).

5. Significance and Impact

Real-Time Viability: The paper proves that event-based perception is not just theoretically superior for high-speed applications but is practically viable for real-time navigation on standard hardware.
Energy Efficiency: By leveraging the low-power nature of event cameras and reducing computational load via FFT and compression, the system is suitable for energy-constrained mobile robots and drones.
Robustness: The system demonstrates superior performance in challenging conditions (low light, high dynamic range, varying speeds) where traditional cameras fail or require heavy preprocessing.
Future Research: The authors release the dataset and code, establishing a baseline for future neuromorphic navigation research. They suggest future work could integrate 3D structural understanding or multi-modal fusion to further enhance robustness in highly dynamic environments.

In summary, this work demonstrates that by combining event cameras with Fourier-domain cross-correlation and sparse data optimizations, robots can achieve high-frequency, low-latency, and highly accurate autonomous navigation that outperforms traditional frame-based systems in both speed and environmental robustness.