Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

Imagine you are a scuba diver trying to take a series of photos of a specific coral reef over the course of several years. Your goal is to return to the exact same spot every time to see how the coral is growing or changing.

In the world of underwater robots (AUVs), this is incredibly hard. Unlike a hiker on a mountain who can look at a GPS signal or a distinct mountain peak, underwater robots are often "blind" to the surface. They rely on expensive, finicky sonar systems that can drift or get confused. If the robot tries to return to a spot years later, it might end up 10 meters away, making it impossible to compare the "before" and "after" photos accurately.

This paper is like a new, super-precise map and a rulebook for helping these robots find their way home, even after years of wandering.

Here is the breakdown of their work using simple analogies:

1. The Problem: The "Drifting GPS"

Underwater robots usually use sonar (sound waves) to know where they are. But sound is tricky; it drifts, and the equipment can get bumped or recalibrated differently each time.

The Analogy: Imagine trying to meet a friend at a park, but your GPS is off by a few meters. If you are walking on flat grass, it doesn't matter much. But if you are walking on a jagged cliffside, being "close" might mean you are standing on a ledge while your friend is in a cave. You aren't really at the same spot.

2. The Solution: A New "Photo Album" (The Dataset)

The authors created a massive, organized library of underwater photos.

What's in it: They took high-definition photos of five different underwater "neighborhoods" (some with dense coral, some with soft sand, some with boulders) over a period of up to six years.
Why it's special: Most underwater datasets are just one-time snapshots. This one is like a time-lapse movie. They also cleaned up the colors (because underwater photos are usually blue/green and murky) so the robots can see the true colors of the reef.
The Goal: This gives researchers a "test track" to see if new robot software can actually recognize these places after years have passed.

3. The New Rulebook: The "Footprint" vs. The "Radius"

This is the most clever part of the paper. To test if a robot found the right spot, you need a way to say, "Yes, that photo is the same place."

The Old Way (The Radius): "If the robot is within 2 meters of the target, it's a match."
- The Flaw: In the ocean, the seafloor isn't flat. If the robot flies high over a flat sandbar, a 2-meter radius is fine. But if the robot flies over a jagged rock wall, being 2 meters away horizontally might mean it's looking at a completely different part of the wall (or a different wall entirely).
The New Way (The Footprint): Instead of measuring distance, they measure overlap.
- The Analogy: Imagine dropping a shadow (a footprint) from the robot's camera onto the seafloor. If the shadow from the new photo overlaps with the shadow from the old photo, then they are looking at the same patch of ground.
- Why it matters: This accounts for hills, valleys, and the robot's altitude. It ensures the robot isn't just "nearby," but actually looking at the same visual content.

4. The Test: Can the Robots "Remember"?

The authors took eight of the smartest "Visual Place Recognition" (VPR) AI models currently available and tested them on this new dataset.

The Result: The robots struggled. Their success rate was much lower than on land or in simpler underwater tests.
The Lesson: The ocean is a much harder place to navigate than a city street. The coral grows, the sand shifts, and the lighting changes. The AI models that work great on Google Maps (land) are getting confused by the dynamic underwater world.
The Winner: One model called MegaLoc (based on a new type of AI called a Vision Transformer) performed the best, but even it only got about 20-50% of the spots right, depending on the terrain.

5. The Big Takeaway

This paper is a wake-up call and a toolkit for the future.

We need better maps: We can't just rely on "being close" to a location. We need to know if the robot is actually looking at the same rocks and coral.
The ocean is tough: Long-term underwater monitoring is harder than we thought. Robots need to be much smarter to handle the changing seasons and years of growth.
The "Footprint" method is the future: By using 3D shadows (footprints) to define "the same place," we can stop fooling ourselves with false positives.

In a nutshell: The authors built a "Time-Traveling Photo Album" of the ocean floor and a new "Shadow-Check" system to prove that robots are actually looking at the same spot. They found that current robots are still a bit lost in the deep, but now we have the tools to teach them how to find their way home.

1. Problem Statement

Long-term visual localization (LTVL) is critical for Autonomous Underwater Vehicles (AUVs) to conduct repeat surveys of benthic (seafloor) habitats for ecological monitoring. However, this field faces two major hurdles:

Lack of Benchmarks: Unlike terrestrial environments, there is a scarcity of curated underwater datasets featuring multiple revisits over extended periods (years) with precise ground truth. Existing datasets (e.g., the Eiffel Tower dataset) are limited to deep-sea hydrothermal vents, which are geologically stable, whereas photic-zone habitats (shallow water) experience rapid, dynamic changes due to storms, biological growth, and sediment shifts.
Inadequate Ground Truthing: Traditional evaluation methods rely on location-based ground truth, where a localization is deemed "correct" if the query and database images are within a fixed distance threshold. This approach fails in underwater environments where:
- Vehicle altitude varies significantly.
- Seafloor relief (rugged terrain) causes the camera's field of view to shift drastically even with small positional changes.
- A fixed distance threshold can include image pairs that share no visual content (false positives) or exclude pairs that share content but are spatially distant due to altitude differences (false negatives).

2. Methodology

A. Dataset Construction

The authors present a curated dataset derived from the IMOS AUV Facility (Australia), collected by the AUV Sirius.

Scope: Five benthic reference sites in the photic zone (depths 18–45m), revisited over periods up to 6 years (2009–2017).
Habitats: Diverse seafloor types including dense/sparse coral reefs, soft sediment, rock reefs, and boulder reefs.
Data Content:
- Raw and color-corrected stereo imagery.
- Camera calibrations and sub-decimeter registered camera poses.
- Color Correction: A multi-image gray-world algorithm is applied to correct for wavelength-dependent attenuation and non-uniform lighting, ensuring consistent color balance across visits.
Geometric Processing:
- 3D Reconstruction: Structure-from-Motion (SfM) and Multi-View Stereo (MVS) are used to generate dense point clouds and globally aligned camera poses.
- Registration: Visits are registered to a common reference frame using a multi-stage pipeline (FPFH descriptors + Colored ICP) to achieve sub-decimeter alignment accuracy.

B. Footprint-Based Ground Truth

To address the limitations of distance-based thresholds, the authors propose a footprint-based ground truth method:

Range Estimation: They fuse stereo-derived metric range maps with monocular depth estimates (Depth Anything V2) to create dense, metrically consistent range maps for every image.
Footprint Projection: Using calibrated camera intrinsics and extrinsics, the 3D coordinates of the four corners of each image are projected onto the seafloor to form a 3D polygon (the "footprint").
Overlap Criterion: Two images are considered a "true match" (ground truth link) if their 2D projected footprints on the seafloor overlap.
Thresholding: A conservative Intersection-over-Union (IoU) threshold ( $\tau_f \approx 0.07$ ) is applied to ensure meaningful visual overlap while accounting for registration errors.

C. Visual Place Recognition (VPR) Benchmark

Models: Eight state-of-the-art (SOTA) models were evaluated, including CNN-based (NetVLAD, MixVPR, CosPlace, EigenPlaces) and Vision Transformer (ViT)-based (AnyLoc, CliqueMining, SALAD, MegaLoc) architectures.
Protocol: Images are treated as an image retrieval problem. For each query image, the system retrieves the top- $K$ candidates from a database of previous visits.
Metrics:
- Recall@K: The proportion of queries with at least one correct match in the top- $K$ .
- IRRecall@K: An Information Retrieval metric that penalizes false negatives, providing a stricter evaluation of retrieval completeness.

3. Key Contributions

First Photic-Zone Dataset: The first curated underwater dataset for long-term visual localization covering multiple sites in dynamic, shallow-water habitats with revisits spanning up to 6 years.
Footprint-Based Ground Truth: A novel evaluation method that uses 3D image footprints and overlap criteria rather than simple Euclidean distance, explicitly accounting for vehicle altitude and terrain relief.
Comprehensive Benchmark: A rigorous evaluation of 8 SOTA VPR models, revealing significant performance gaps between terrestrial/underwater benchmarks and dynamic benthic environments.
Ground Truth Analysis: A demonstration that traditional location-based ground truth systematically overestimates VPR performance (Recall@K) in rugged terrains, whereas footprint-based ground truth provides a more accurate, content-aligned assessment.

4. Results

Performance Levels: VPR performance on this dataset is significantly lower than on terrestrial benchmarks or the deep-sea Eiffel Tower dataset.
- Top Models: ViT-based models (specifically AnyLoc and MegaLoc) consistently outperformed CNN-based models.
- Recall Rates: Even the best models achieved a Recall@1 of ~25% and Recall@10 of ~50% on the best-performing sites, compared to much higher scores in terrestrial datasets.
Spatial Variability: Success is highly clustered. Recognitions occur frequently in areas with distinctive, persistent features (e.g., dense coral, boulders) but fail in homogeneous areas (e.g., soft sediment plains).
Temporal Decay: Recall decreases as the revisit interval increases. A sharp drop occurs between 1 and 2 years, followed by a gradual plateau, suggesting most visual changes happen early in the monitoring cycle.
Ground Truth Impact:
- Location-Based: Overestimates performance, particularly at sites with rugged terrain (e.g., Site 2, 4, 5) where a fixed distance threshold includes non-overlapping views.
- Footprint-Based: Yields lower but more reliable metrics. Stronger models showed a larger drop in IRRecall when switching to footprint-based truth, indicating they were retrieving fewer "false" links that the location-based method included.

5. Significance

This work fundamentally shifts the standard for evaluating underwater visual localization:

Realism: It highlights that dynamic photic-zone environments are far more challenging for LTVL than previously thought, necessitating robust algorithms capable of handling biological and physical changes over years.
Evaluation Rigor: It proves that distance thresholds are insufficient for underwater ground truthing. The proposed footprint-based method ensures that "matches" are based on shared visual content, which is critical for developing reliable navigation systems.
Future Direction: The results suggest that future LTVL systems cannot rely solely on single-image retrieval. They must incorporate map representations (sub-maps of distinct features), multi-image context (trajectory segments), and geometric verification to achieve reliable localization in dynamic benthic environments.

In summary, this paper provides the necessary infrastructure (dataset and ground truth) and empirical evidence to advance the field of long-term underwater robotics, moving beyond idealized scenarios to address the complexities of real-world marine monitoring.