Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

Imagine you are teaching a robot driver how to navigate a city.

The Old Way: The "Strict Student"
Traditionally, we taught robots by showing them thousands of pictures of sunny days with clear roads, cars, and people. The robot learned to recognize these specific things.

The Problem: If you suddenly put this robot in a heavy rainstorm, a dark tunnel, or a construction zone with weird new objects (like a giant umbrella or a police barrier), the robot panics. It says, "I don't know what that is!" or "Is that a car? No, wait, it's raining!" It fails because it was only taught a fixed list of rules for a fixed list of things.

The New Idea: The "Open-Minded Explorer"
This paper introduces a new way of thinking called OVDG-SS (Open-Vocabulary Domain Generalization). Think of this as training the robot to be an explorer rather than a student.

Open Vocabulary: Instead of memorizing a list of 10 things, the robot learns to understand concepts. If you tell it, "Look for a 'traffic cone'," it knows what a cone is, even if it's never seen one before. It uses a "dictionary" (text descriptions) to understand the world.
Domain Generalization: The robot learns to recognize these concepts even when the "weather" changes. It shouldn't matter if the sun is shining, it's snowing, or the camera is blurry; the robot should still know what a "road" or a "person" is.

The Big Challenge: The "Noisy Radio"
The researchers found a major problem with current "smart" robots. They use a powerful tool (like a super-intelligent library called a Vision-Language Model) to connect pictures to words.

The Analogy: Imagine trying to listen to a radio station (the word "Road") while driving through a tunnel (a new environment). The tunnel causes static and noise. The signal gets distorted. The robot hears "Road" but the static makes it think it might be "Grass" or "Sky."
In technical terms, when the environment changes (lighting, weather, location), the connection between the image and the text gets "noisy" and confused. The robot starts seeing things that aren't there or missing things that are.

The Solution: S2-Corr (The "Signal Cleaner")
To fix this, the authors built a new module called S2-Corr. Think of it as a high-tech noise-canceling headphone for the robot's brain.

Here is how it works, using a simple metaphor:

The Snake Scan (The Path):
Imagine the robot is reading a long line of text (the image). Old methods read it like a robot: left-to-right, top-to-bottom, strictly. If there's a typo (noise) at the start, the robot gets confused for the rest of the sentence.
- S2-Corr reads the text like a snake. It slithers back and forth (zig-zag). This helps it keep the context of the "neighborhood" (spatial structure) intact. If it sees a weird noise in one spot, it doesn't let that noise ruin the whole sentence because it keeps checking its surroundings.
The Decay Gate (The Filter):
Imagine the robot is remembering a story. Sometimes, old memories (from the training data) are wrong for the current situation.
- S2-Corr has a special "forgetting gate." If a piece of information is too old or too noisy (like a static-filled memory of a sunny day while it's currently raining), the gate says, "Let that go." It filters out the bad data so only the clear, relevant information passes through.
The Contextual Hint (The Translator):
Before the robot tries to understand the image, S2-Corr gives it a little hint based on the current weather.
- Analogy: If it's raining, the robot gets a note saying, "Hey, remember, things look darker and wetter today." This helps the robot adjust its expectations so it doesn't get confused by the rain.

The Result
By using this "Signal Cleaner," the robot can now:

Drive in the rain, snow, or at night.
Recognize new things it has never seen before (like a construction barrier or a stray dog) just by reading the name.
Do all this faster and with less computer power than previous methods.

In Summary
This paper is about teaching self-driving cars to be adaptable. Instead of just memorizing a map of a sunny city, they are learning to understand the concept of a city, no matter the weather or the strange new objects they encounter. They did this by building a smarter "noise filter" that keeps the robot's vision clear even when the world gets messy.

1. Problem Definition: OVDG-SS

The paper addresses a critical gap in autonomous driving perception: the inability of current models to simultaneously handle unseen domains (e.g., adverse weather, night, different geographic regions) and unseen classes (e.g., construction barriers, traffic cones, animals, tunnels) within a single framework.

Limitations of Existing Approaches:
- Domain Generalization (DG-SS): Models robust to domain shifts but restricted to a fixed, closed set of training classes. They fail to recognize novel objects.
- Open-Vocabulary Segmentation (OV-SS): Models capable of recognizing new classes via text prompts (using Vision-Language Models like CLIP) but highly sensitive to domain shifts. Their performance degrades significantly when deployed in environments different from their training data (e.g., synthetic-to-real or adverse weather).
The Proposed Setting: The authors introduce Open-Vocabulary Domain Generalization in Semantic Segmentation (OVDG-SS). This setting requires a model to generalize to unseen target domains while recognizing both source classes and a vast vocabulary of unseen classes.

2. Core Analysis: Why Existing OV-SS Fails

The authors identify that the primary failure mode of OV-SS under domain shifts is the distortion of text-image correlations.

Noise Propagation: In standard OV-SS pipelines (e.g., CAT-Seg), the initial correlation map between image features and text embeddings becomes noisy and misaligned when the domain shifts.
Attention Mechanism Flaw: Existing methods use cross-attention to refine these correlations. However, if the initial correlation map contains domain-induced noise, the attention mechanism treats these noisy activations as valid keys/values, causing errors to propagate and amplify across spatial and class dimensions.

3. Methodology: S2-Corr

To address these issues, the authors propose S2-Corr, a novel State-Space-driven Text-Image Correlation Refinement mechanism. Instead of relying on attention-based aggregation, S2-Corr utilizes Selective State-Space Models (SSMs) to process correlations sequentially.

Key Components of S2-Corr:

Selective State-Space Model (SSM) Aggregation:
- Replaces the quadratic complexity of cross-attention ( $O(N^2)$ ) with linear-time sequential processing ( $O(N)$ ).
- Mechanism: It processes the correlation embeddings as a 1D sequence. A continuous-time state-space model updates the hidden state $h_t$ based on the current input $x_t$ and previous state $h_{t-1}$ .
- Benefit: The SSM includes a decay gate ( $A_t$ ) that dynamically controls how much past information is preserved. If a previous state is noisy (due to domain shift), the gate can effectively "forget" it, preventing noise propagation.
Modulation Before Aggregation:
- Visual Modulation: Injects image-specific cues into correlation embeddings before sequence processing to improve spatial consistency.
- Text Modulation: Uses multi-domain textual prompts (e.g., "a photo of a cat in the rain") to generate domain-aware text features. These features modulate the class embeddings, allowing the model to adapt its semantic understanding to the specific domain conditions.
Learnable Geometric Decay Prior:
- To further mitigate long-range noise accumulation inherent in sequential models, a learnable geometric decay prior is introduced. It balances data-driven gating with a fixed geometric attenuation pattern, ensuring that distant, potentially noisy correlations are suppressed.
Chunk-wise Snake Scanning Strategy:
- To maintain 2D spatial continuity while processing a 1D sequence, the image grid is divided into chunks.
- A snake-shaped traversal is used (alternating scanning direction row-by-row) to minimize discontinuities at row boundaries.
- The final hidden state of one chunk is passed as the initial state to the next, ensuring smooth feature propagation across the entire image.

4. Benchmark and Datasets

The authors constructed the first comprehensive benchmark for OVDG-SS in autonomous driving:

Source Domains: Cityscapes (7 classes, CS-7) and GTA-V (7 classes, GTA-7).
Target Domains (Unseen):
- Weather/Lighting: ACDC (adverse conditions), BDD-100K (diverse illumination).
- Geographic Regions: Mapillary (Mapi).
- Construction Contexts: ROADWork (RW-10).
Unseen Classes:
- Dv-19: 19 standard driving classes.
- Dv-58: Expanded to 58 classes, including 30+ additional categories (e.g., barriers, cones, animals, tunnels, railways) created via Stable Diffusion inpainting and manual curation.
Settings: Synthetic-to-Real (GTA $\to$ Real) and Real-to-Real (Cityscapes $\to$ Adverse/Other Regions).

5. Experimental Results

Extensive experiments demonstrate that S2-Corr outperforms state-of-the-art methods (including CAT-Seg, MaskAdapter, CLIPSelf, and DG-specific methods) across all settings.

Performance Gains:
- Real-to-Real (ViT-B/16): Achieved 50.3% mIoU on Dv-19 (surpassing the previous best by 4.3 points) and 47.9% on Dv-58.
- Synthetic-to-Real (ViT-B/16): Achieved 48.2% on Dv-19 and 46.7% on Dv-58.
- Large Vocabulary: Consistently improved performance on unseen classes (e.g., tunnels, railways, construction equipment) where other methods failed completely.
Efficiency:
- S2-Corr is significantly faster and more memory-efficient than attention-based baselines.
- FPS: 26.1 FPS (vs. 15.4 for CAT-Seg) on standard vocabularies, maintaining 18.3 FPS even with a 150-class vocabulary (where CAT-Seg drops to 5.7 FPS).
- Training Time: Reduced to ~140 minutes compared to 180–220 minutes for competitors.

6. Key Contributions

New Task Setting: Defined and formalized OVDG-SS, bridging the gap between domain generalization and open-vocabulary segmentation.
First Benchmark: Released a comprehensive benchmark covering diverse unseen domains and over 50 classes, including synthetic-to-real and real-to-real generalization scenarios.
Novel Architecture (S2-Corr): Proposed a state-space-driven correlation refinement module that effectively suppresses domain-induced noise, outperforming attention-based aggregation.
SOTA Performance: Demonstrated superior accuracy, robustness, and computational efficiency, establishing a new baseline for open-world urban scene perception.

7. Significance

This work is significant for the deployment of autonomous driving systems in the real world. It moves beyond the "closed-set" assumption, enabling vehicles to safely navigate and recognize novel objects (like construction zones or unexpected obstacles) under varying environmental conditions (rain, night, fog) without requiring retraining. The proposed S2-Corr mechanism offers a scalable and efficient solution for building robust, open-world perception systems.

Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

1. Problem Definition: OVDG-SS

2. Core Analysis: Why Existing OV-SS Fails

3. Methodology: S2-Corr

Key Components of S2-Corr:

4. Benchmark and Datasets

5. Experimental Results

6. Key Contributions

7. Significance

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers