SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation

Imagine you are at a crowded, noisy party. There are dozens of people talking, music playing, and glasses clinking. If you close your eyes, you can still tell who is speaking because you hear their voice. If you open your eyes, you can point to exactly who is making that sound.

Now, imagine teaching a computer to do the same thing with a video. This is the challenge of Audio-Visual Instance Segmentation (AVIS). The computer needs to not only "see" a person or object but also "hear" them, and then draw a perfect outline around only the things making noise, ignoring the silent ones.

The paper introduces a new AI system called SeaVIS (Sound-Enhanced Association for Online Audio-Visual Instance Segmentation). Here is how it works, explained simply:

1. The Problem: The "Time Travel" Mistake

Before SeaVIS, most AI models worked like a movie critic who watches the entire movie before writing a review. They look at the whole video at once (offline).

The Flaw: In the real world, videos stream in real-time (like a live Zoom call or a security camera). You can't wait for the future to understand the present.
The Confusion: Even if a model could watch in real-time, it often gets confused. If a dog barks, then stops, then barks again, a standard camera-only AI might think the dog disappeared and reappeared as a new dog. It doesn't understand that the silence is part of the same dog's story.

2. The Solution: SeaVIS

SeaVIS is the first system designed to watch a video frame-by-frame (like a human watching a live stream) while keeping a perfect memory of what it heard. It solves two main problems:

A. The "Time-Traveling Ear" (Causal Cross-Attention Fusion)

Imagine you are trying to identify a friend in a crowd. You see them, but you aren't 100% sure. Then, you hear their unique laugh from a few seconds ago. That laugh helps you confirm, "Yes, that's them!"

How it works: SeaVIS has a special module called CCAF. It looks at the current video frame but also listens to the entire history of audio that happened before it.
The Catch: It respects "causality." It can listen to the past, but it cannot peek into the future. This ensures it works in real-time, just like a human. It uses the audio history to sharpen the visual picture of the current moment.

B. The "Voice-Activated ID Card" (Audio-Guided Contrastive Learning)

This is the cleverest part. Standard AI tracks objects by how they look (a red car looks like a red car). But in a video, a car might be silent for a while. If the AI only looks at the car, it might lose track of it or confuse it with another silent car.

SeaVIS gives every object a "Voice ID Card."

The Training: The AI is taught a special rule: "If an object is making noise, its ID card should look very different from when it is silent."
The Result: When the AI sees a person who is not talking, it checks their ID card. The card says, "I am silent right now." The AI then thinks, "Okay, you are just background scenery," and ignores you. But the moment you start talking, your ID card changes, and the AI says, "Ah! You are the main character now!" and starts tracking you.

3. Why This Matters (The "Party" Analogy)

Think of the old methods as a security guard who takes a photo of the whole party, waits until the party is over, and then tries to figure out who was talking. They are slow and can't handle a live stream.

SeaVIS is like a super-human security guard who:

Watches the room live, second by second.
Listens to the conversation history to know who is speaking right now.
Instantly draws a circle around the speaker and ignores the silent people in the background.
Does all this fast enough to keep up with a live broadcast.

4. The Results

The researchers tested SeaVIS on a massive dataset of videos with music, speech, machines, and animals.

Accuracy: It was better at finding the right sound sources and drawing the right outlines than any previous system.
Speed: It works fast enough for real-time applications (like self-driving cars needing to know which pedestrian is shouting, or robots interacting with humans).

Summary

SeaVIS is a new AI that finally bridges the gap between "seeing" and "hearing" in real-time. It doesn't just look at a video; it listens to the story of the sound to understand exactly who is making the noise and when, filtering out the silence to focus on the action. It's like giving a computer the ability to "listen with its eyes."

1. Problem Statement

The paper addresses the limitations of existing Audio-Visual Instance Segmentation (AVIS) methods, which are predominantly offline. Offline models process entire video sequences simultaneously, relying on future frames to make predictions. This creates two critical issues for real-world applications:

Continual Inference Gap: Offline models cannot process continuous, streaming video streams incrementally. Once a segment is processed, new frames cannot be associated with previously identified instances without re-running the entire sequence.
Silent Object Confusion: Conventional Video Instance Segmentation (VIS) relies solely on visual appearance. In AVIS, this leads to errors where silent objects (e.g., a person not speaking) are incorrectly segmented or tracked because the model cannot distinguish between "sounding" and "silent" states of the same object.

The goal is to develop an online framework that processes video frame-by-frame, associates instances across time, and effectively filters out silent objects using audio cues.

2. Methodology: SeaVIS

The proposed SeaVIS framework operates in two sequential stages: per-frame instance segmentation prediction and cross-frame instance association. It introduces two core innovations to address the challenges above:

A. Causal Cross-Attention Fusion (CCAF)

To handle the temporal constraints of online processing while leveraging rich audio context:

Mechanism: Instead of simple "in-frame" fusion, CCAF integrates visual features from the current frame with the entire history of audio features up to the current moment.
Causality: It employs a Causal Mask in the attention mechanism. This ensures that the visual features at time $t$ can only attend to audio features from time $t$ and earlier ( $t' \leq t$ ), strictly preventing access to future information.
Function: This allows the model to resolve ambiguities in the current visual frame by referencing past audio context (e.g., recognizing a sound source that started speaking a few frames ago).

B. Audio-Guided Contrastive Learning (AGCL)

To solve the problem of distinguishing sounding vs. silent instances, the authors propose a dual-level contrastive learning strategy:

Frame-Level Loss: Within a single frame, an audio anchor attracts embeddings of sounding instances while repelling embeddings of silent instances and background. This ensures the model learns to separate active sound sources from silent objects in the same scene.
Instance-Level Loss: For a single tracked object across time, the model learns to separate its sounding states from its silent states. An average audio anchor is computed for the frames where the object is sounding, pulling those embeddings together while pushing away embeddings from frames where the object is silent.
Inference Benefit: During the association phase, if a tracked object becomes silent, its embedding shifts away from the "sound-aware" prototype. The similarity score drops below the threshold, causing the tracker to suppress the instance as background rather than falsely tracking it.

3. Key Contributions

First Online AVIS Framework: SeaVIS is the first method designed specifically for online audio-visual instance segmentation, enabling real-time processing of continuous video streams.
CCAF Module: A novel fusion mechanism that effectively integrates historical audio temporal information into multi-scale visual features under strict causal constraints.
AGCL Strategy: A dual-level contrastive learning approach that generates instance embeddings encoding both visual appearance and vocalization states, effectively filtering out silent objects during association.
State-of-the-Art Performance: The method achieves superior results on the AVISeg benchmark while maintaining real-time inference speeds.

4. Experimental Results

The model was evaluated on the AVISeg dataset (926 long-duration videos).

Quantitative Performance:
- FSLA (Frame-Level Sound Localization Accuracy): 44.12 (vs. 42.78 for the previous best, AVISM).
- HOTA (Higher-Order Tracking Accuracy): 63.71 (vs. 61.73 for AVISM).
- mAP (mean Average Precision): 41.23 (vs. 40.57 for AVISM).
- SeaVIS outperformed all compared methods, including offline AVS models and online VIS models.
Efficiency:
- SeaVIS achieves 34.65 FPS on an NVIDIA A800 GPU (ResNet-50 backbone), significantly faster than the previous best AVIS model (AVISM at ~20 FPS) while maintaining higher accuracy.
- Ablation studies confirmed that the CCAF module adds negligible memory overhead (approx. 100MB increase) and minimal latency, even for long videos (30 mins).
Robustness:
- The model showed superior tracking stability (lower IDSW) in multi-source overlapping speech scenarios compared to AVISM.
- While sensitive to noisy audio (as expected for an audio-dependent model), this sensitivity confirmed that the model actively relies on audio cues rather than ignoring them.

5. Significance

SeaVIS represents a significant step forward in multi-modal perception by bridging the gap between offline research and real-world deployment.

Real-World Applicability: By solving the "continual inference gap," SeaVIS enables applications requiring immediate response, such as autonomous driving (identifying sirens or horns in real-time), interactive robotics, and human-computer interaction.
Noise Filtering: The ability to distinguish between sounding and silent states of the same object is a crucial capability for complex environments where visual appearance alone is insufficient for tracking.
Efficiency: The framework demonstrates that high-accuracy, multi-modal segmentation can be achieved in real-time, making it feasible for edge devices and streaming services.