Interpretable Perception and Reasoning for Audiovisual Geolocation

Imagine you are dropped into a strange city with your eyes closed. You can't see the Eiffel Tower or the Statue of Liberty. But you can hear the city. You hear a specific type of double-decker bus, the distinct ring of a church bell, and a bird that only sings in London. Even without seeing a thing, your brain can guess, "I'm probably in London."

This paper is about teaching computers to do exactly that, but for videos. It's called Audiovisual Geolocation.

Here is the story of how the researchers built a "super-sleuth" AI to solve this mystery, explained in simple terms.

The Problem: The "Blind Spot" of Current AI

Right now, if you ask a computer to find where a video was taken, it usually just looks at the picture.

The Visual Problem: A park in New York looks a lot like a park in London. Trees, grass, and benches are everywhere. If the AI only looks, it gets confused and guesses wrong.
The Audio Problem: If the AI only listens, it's even harder. A city sounds like a messy mix of traffic, sirens, and people talking. It's like trying to pick out a single instrument in a rock concert while wearing earplugs.

The researchers realized that to solve this, the AI needs to be a detective that uses both its eyes and ears, but it needs to understand the clues in a very specific way.

The Solution: A Three-Step Detective Process

The team built a system with three distinct stages, like a detective solving a case:

Step 1: The "Sound De-Mixer" (Perception)

Imagine you are listening to a smoothie. It's a mix of strawberries, bananas, and milk. If you just taste the smoothie, you can't tell exactly how much of each fruit is in there.

What the AI does: The researchers created a special tool (called an IC-SAE) that acts like a magical blender in reverse. It takes the messy, noisy audio of a video and "de-mixes" it into individual ingredients.
The Result: Instead of hearing "noisy city," the AI hears: "1. A siren," "2. A specific bird call," "3. The hum of a subway." These are called "Acoustic Atoms." It's like separating the ingredients of a cake so you know exactly what flavor it is.

Step 2: The "Sherlock Holmes" (Reasoning)

Now the AI has the visual clues (a brick building) and the audio clues (a specific bird and a siren). But it needs to put the puzzle together.

What the AI does: They used a powerful AI brain (a Large Language Model) trained to act like a detective. It looks at the "Acoustic Atoms" and the picture and asks: "Okay, I see a brick building. I hear a siren that sounds European. I hear a Robin bird. Where in the world do these three things exist together?"
The Trick: They taught this AI brain to be very careful. If it's not sure, it shouldn't guess wildly; it should admit, "I'm not 100% sure, but it's likely this area." This prevents the AI from making up fake facts.

Step 3: The "Globe Spinner" (Prediction)

Finally, the AI has to point to a spot on the map.

The Problem: The Earth is a sphere (a ball), but computers usually think in flat squares (like a piece of paper). If you try to draw a map on a flat piece of paper, countries get stretched and distorted (like Greenland looking huge on some maps).
The Solution: The researchers used a special math trick called Riemannian Flow Matching. Think of this as a GPS that understands the Earth is a ball. It doesn't just guess a flat coordinate; it calculates the probability of the location on the curve of the Earth, ensuring the math stays perfect.

The New "Case File" (The Dataset)

To train this detective, the researchers couldn't just use random YouTube videos because many have fake music or voiceovers that confuse the AI.

They built a massive new library called AVG (Audiovisual Geolocation).
It contains 20,000 video clips from 1,000 different places around the world.
They were very strict: they only kept videos where the sound you hear matches exactly what you see (no background music, no narrators). This is the "training ground" where the AI learned to be a pro.

The Results: Why It Matters

When they tested their new detective against old methods:

Old AI (Eyes only): Got confused by similar-looking parks.
Old AI (Ears only): Got lost in the noise.
New AI (Eyes + Ears): Solved the mystery much better.

The Big Takeaway:
The paper proves that sound is a secret superpower for finding places. Even when a place looks generic (like a forest or a city street), the sound of that place is unique. By teaching computers to "unmix" sounds and reason about them, we can pinpoint locations with incredible accuracy, even in places where cameras alone fail.

In a nutshell: They taught a computer to listen to the "soul" of a place, not just look at its "face," to find out exactly where it is on the globe.

1. Problem Definition

Audiovisual Geolocation is the task of determining the geographic coordinates (latitude and longitude) of a video source by jointly analyzing its visual frames and synchronized audio stream.

Challenges:
- Visual Ambiguity: Static visual features (e.g., trees, bridges, generic urban parks) often lack distinctiveness, leading to "spatial aliasing" where multiple locations appear identical (e.g., a park in London vs. New York).
- Audio Complexity: Environmental audio is a noisy mixture of overlapping signals (traffic, nature, machinery), making it difficult to isolate geographically specific cues.
- Data Scarcity: Existing datasets are limited in scale (often <20 locations) or quality (containing non-diegetic music/narration), failing to support global-scale multimodal reasoning.
- Modality Mismatch: Previous approaches often treat audio and video separately or use late fusion without deep semantic integration.

2. Methodology

The authors propose a novel three-stage framework: Perception, Reasoning, and Prediction.

A. Perception: Interpretable Feature Extraction

Visual Stream: Uses a frozen, state-of-the-art visual backbone (GeoCLIP-ViT) to extract temporal visual descriptors capturing static landmarks (architecture, vegetation).
Audio Stream (Core Innovation): Introduces an Iterative Convolutional Sparse Autoencoder (IC-SAE) to decompose complex audio mixtures into semantically grounded "acoustic atoms."
- MART (Mixture-Autoregressive Training): To handle "in-the-wild" noise, the model is pretrained using a synthetic data factory. It generates audio mixtures with enforced loudness hierarchies ( $g_1 > g_2 > \dots$ ) and trains the IC-SAE to iteratively subtract the most prominent sound atoms (e.g., a siren) from the mixture to reveal secondary cues.
- Semantic Grounding: The autoencoder dictionary is partitioned based on AudioSet classes (527 classes), ensuring the extracted features are interpretable (e.g., distinguishing a "dog bark" from "traffic").

B. Reasoning: Multimodal Synthesis via MLLM

Model: A Multimodal Large Language Model (MLLM), specifically LLaVA-v1.5-7B, is used to synthesize visual features and sparse acoustic atoms.
Training (GRPO): The MLLM is fine-tuned using Group Relative Policy Optimization (GRPO) with three specialized reward functions to align reasoning with geographic reality:
1. Hierarchical S2 Geometry Reward ( $R_{geo}$ ): Rewards correct identification of nested geographic cells (using S2 geometry) at varying resolutions, avoiding penalties from arbitrary political borders.
2. Entity-Consistency Reward ( $R_{align}$ ): Ensures the model's textual reasoning (e.g., mentioning "London") is consistent with its predicted coordinates.
3. Uncertainty Calibration Reward ( $R_{calib}$ ): Incentivizes the model to output high-confidence predictions for distinct landmarks and diffuse probability distributions for ambiguous scenes (solving the "overconfidence" problem).

C. Prediction: Riemannian Flow Matching

Instead of standard Euclidean regression, the final stage uses Riemannian Flow Matching (RFM) on the $S^2$ manifold (the surface of the Earth).
This approach mathematically preserves the geometric constraints of the globe, generating a continuous probability density function (PDF) over the Earth's surface rather than a single point, allowing for uncertainty-aware localization.

3. Key Contributions

AVG Dataset: The introduction of Audiovisual Geolocation (AVG), the first high-quality, global-scale benchmark. It contains 20,000 curated video clips across 1,000 distinct locations, strictly filtered for diegetic audio and temporal alignment.
Interpretable Perception Framework: A novel IC-SAE trained with MART that decomposes noisy audio into semantically meaningful "atoms," outperforming standard global audio embeddings.
Three-Stage Reasoning Pipeline: A unified framework combining sparse audio decomposition, GRPO-tuned MLLM reasoning, and Riemannian Flow Matching for precise, uncertainty-aware geolocation.

4. Experimental Results

The framework was evaluated on the AVG benchmark and the iNatSounds dataset against state-of-the-art (SoTA) baselines.

Performance on AVG:
- Multimodal Superiority: The full audiovisual model achieved 8.3% accuracy at the City level (25km) and 35.4% at the Continent level (2500km).
- Comparison: It outperformed the best visual-only model (GeoCLIP) by 1.5% (City) and 2.7% (Continent), and significantly outperformed audio-only baselines (e.g., GeoCLAP achieved only 0.1% City accuracy).
- Key Insight: Audio provides an "orthogonal signal" that resolves visual ambiguities where visual models fail.
Performance on iNatSounds (Audio-Only):
- The proposed method reduced the median localization error from ~4,944 km (TaxaBind) to 1,355 km (a 72.6% reduction).
- It demonstrated superior probabilistic calibration (lower Negative Log-Likelihood) compared to RFM S2 baselines.
Ablation Studies:
- MART Pretraining: Crucial for performance; removing it caused a significant drop in region-level accuracy.
- Reward Functions: The combination of $R_{geo}$ , $R_{align}$ , and $R_{calib}$ was essential for preventing hallucinations and improving spatial reasoning.
- Stage Integration: The reasoning stage (MLLM) was shown to be necessary to synthesize the extracted acoustic atoms with visual cues; simple feature concatenation was insufficient.

5. Significance

This work represents a paradigm shift in geolocation research by:

Moving Beyond Static Images: It addresses the long-standing challenge of geolocating dynamic video content by leveraging the temporal and acoustic richness of real-world data.
Interpretability: Unlike "black-box" deep learning models, this framework provides interpretable reasoning chains (e.g., identifying a specific bird call or siren pattern) that explain why a location was chosen.
Global Scalability: The creation of the AVG dataset and the demonstration of planet-scale reasoning capabilities pave the way for applications in digital forensics, environmental monitoring, and autonomous navigation in GPS-denied environments.
Mathematical Rigor: By utilizing Riemannian Flow Matching on the $S^2$ manifold, the paper ensures that predictions are geometrically consistent with the Earth's shape, avoiding distortions inherent in flat-map regression.