See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

Imagine you have a super-smart but slightly clumsy photographer (the AI model) who has spent their whole life taking photos of sunny beaches and open fields. They are amazing at recognizing objects in those settings.

Now, you drop this photographer into a dark, cluttered living room (the new environment). Suddenly, they get confused. The lighting is weird, objects are hidden behind furniture, and the angles are strange. Their photos come out blurry or they miss the objects entirely.

The Old Way (Fine-Tuning):
Usually, to fix this, you would hire a tutor to retrain the photographer. You'd show them thousands of photos of living rooms and teach them how to spot a couch in the dark.

The Problem: This is expensive, takes a long time, and often makes the photographer forget how to take great beach photos (a problem called "catastrophic forgetting"). Plus, you need a human to label every single photo with "this is a couch," which is a huge chore.

The New Way (Sea2 - See, Act, Adapt):
The authors of this paper propose a brilliant twist: Don't retrain the photographer. Instead, hire a smart guide.

The Cast of Characters

The Photographer (Frozen Perception Model): This is the pre-trained AI. It stays exactly the same. It doesn't learn anything new. It just does what it's good at: looking at an image and saying, "I think that's a couch."
The Guide (The VLM Agent): This is a Vision-Language Model (like a super-smart robot brain) that acts as the photographer's body. It holds the camera and decides where to stand.
The Feedback Loop: The guide doesn't need a human to say "Good job!" or "Bad job!" It just listens to the photographer's confidence. If the photographer says, "I'm 90% sure that's a couch," the guide knows, "Great, I'm in the right spot!" If the photographer says, "I have no idea," the guide knows, "Okay, I need to move."

How It Works: The "See, Act, Adapt" Dance

Imagine you are trying to find a specific toy hidden in a messy room, but you can only see through a small hole in a box.

See: The guide looks at the current view. The photographer says, "I see something small and blurry, but I'm not sure."
Act: The guide thinks, "That object looks too far away and blocked by a chair. I need to move closer and to the left." It takes a step.
Adapt: The guide looks again. The photographer now says, "Ah! That's definitely a red toy car! I'm very confident!" The guide stops moving.

The Magic Sauce:
The guide learns this dance using a two-step process:

Step 1: The Training Wheels (Supervised Fine-Tuning): First, the guide learns basic rules from a human teacher. "If you can't see the object, turn until you find it. If it's off-center, move the camera to the middle. If it's too small, walk closer." This gives the guide a basic sense of direction.
Step 2: The Playground (Unsupervised Reinforcement Learning): Now, the guide is on its own in the messy room. It tries different moves. Every time the photographer gets more confident, the guide gets a "virtual high-five" (a reward). Every time the photographer gets less confident, the guide gets a "virtual frown." The guide learns to maximize those high-fives without ever needing a human to tell it what the object actually is.

Why Is This a Big Deal?

No New Labels Needed: You don't need to hire humans to draw boxes around objects in the new room. The system figures it out by itself just by asking the photographer, "Are you sure?"
No Memory Loss: Since the photographer isn't being retrained, it doesn't forget how to recognize things in the beach photos. It keeps all its old knowledge.
Plug-and-Play: You can swap out the photographer for a different one (e.g., one that's better at 3D shapes, another better at text) without changing the guide. The guide just learns to listen to the new photographer's confidence.

The Results

The paper tested this on three tasks:

Finding objects (Visual Grounding).
Cutting out objects (Segmentation).
Measuring 3D objects (3D Box Estimation).

In the messy indoor rooms, the "Guide + Photographer" team performed 13% to 27% better than the photographer standing still or moving randomly. They even beat a system that knew exactly where the objects were supposed to be (the "Shortest Path" baseline), proving that knowing where to look is just as important as knowing what to look for.

In short: Instead of trying to teach a smart AI to see better in new places, we just teach a smart robot to hold the camera in the perfect spot so the AI can do its best work. It's like realizing that sometimes, the best way to solve a problem isn't to change the expert, but to change their perspective.

1. Problem Statement

Pre-trained large-scale visual perception models (e.g., for grounding, segmentation, or 3D estimation) excel in generic internet-scale image domains but suffer significant performance degradation when deployed in novel, embodied environments (e.g., indoor scenes). This "domain gap" arises from differences in viewpoint distribution, occlusion patterns, and spatial semantics.

Limitations of Current Solutions:

Fine-tuning: Adapting perception models to downstream data requires costly, scene-specific annotations (pixel masks, 3D boxes) and risks catastrophic forgetting of prior knowledge.
Existing Active Perception: Prior methods often couple exploration tightly with specific model architectures or require collecting labeled data for retraining, limiting their generalizability and efficiency.

Core Question: Can we adapt perception to new domains without modifying the perception models themselves or using downstream labels?

2. Methodology: Sea2 Framework

The authors propose Sea2 (See, Act, Adapt), a paradigm shift where the deployment strategy is adapted rather than the perception model. The system uses an intelligent agent to control camera pose, seeking informative viewpoints to maximize the output quality of frozen perception modules.

Key Components

Frozen Perception Modules: The system utilizes off-the-shelf models (e.g., GroundingDINO, SAM, 3D box estimators) that remain frozen throughout training. No gradients are backpropagated to these modules.
VLM-Guided Agent: A Vision-Language Model (VLM) acts as the low-level pose controller. It receives natural language task instructions and current RGB observations, then outputs:
- Thoughts: Spatial reasoning (e.g., "object is occluded," "move closer").
- Task Routing: Selection of the appropriate perception module.
- Task Prompt: A refined language description for the module.
- Action: Discrete control commands (move forward, turn, look up/down).
Scalar Feedback Loop: The agent relies solely on scalar feedback derived from the frozen modules (e.g., detection confidence, mask area, geometric consistency) to navigate, requiring no ground-truth labels during training.

Two-Stage Training Pipeline

To transform a general VLM into an effective embodied pose controller, Sea2 employs a two-stage training process:

Stage 1: Supervised Fine-Tuning (SFT)
- Goal: Align the VLM with spatial reasoning and control formats.
- Data: Trajectories generated by a deterministic, rule-based heuristic (Search $\to$ Center $\to$ Approach).
- Outcome: The VLM learns the basic logic of navigating to an object and formatting its output correctly, providing a stable "cold start" for reinforcement learning.
Stage 2: Unsupervised Reinforcement Learning (RL)
- Algorithm: Group Relative Policy Optimization (GRPO).
- Reward Function ( $r$ ): Constructed entirely from the frozen module's outputs without ground truth.
  - Format Reward ( $r_f$ ): Ensures the output structure (Thoughts, Action) is valid.
  - Confidence Reward ( $r_c$ ): Encourages the agent to increase the perception module's confidence score over time ( $c_t - c_{t-1}$ ).
  - Geometric Reward ( $r_g$ ): Encourages the predicted region to be centered in the image and occupy a sufficient area (minimizing occlusion/truncation).
- Objective: Maximize cumulative perception quality (confidence + geometric consistency) to find the optimal viewpoint.

3. Key Contributions

Plug-and-Play Adaptation: The first VLM-based active perception framework that achieves compatibility with diverse off-the-shelf perception models without retraining them. It uses only scalar outputs as rewards.
Unsupervised RL Pipeline: Introduces a training pipeline based on perception-derived rewards, eliminating the need for dense perceptual annotations (masks, 3D boxes) and enabling policy learning in annotation-scarce environments.
Decoupling Perception and Control: By freezing perception modules, the method avoids catastrophic forgetting and allows the same agent policy to be applied to different tasks (grounding, segmentation, 3D) and different model backbones.

4. Experimental Results

Experiments were conducted on the ReplicaCAD and HM3D datasets within the Habitat simulator across three tasks: Visual Grounding, Segmentation, and 3D Box Estimation.

Performance Improvements on ReplicaCAD:

Visual Grounding: +13.54% improvement in average mAP.
Segmentation: +15.92% improvement in IoU (and +13.59% in Dice).
3D Box Estimation: +27.68% improvement in IoU (and +25.35% in Center Score).

Key Findings:

Baselines: Simple motion policies (Forward, Random) degrade performance. Heuristic policies offer marginal gains but fail when initial detections are poor. Shortest Path (oracle) shows modest gains, proving that simply reaching the object's location is insufficient; viewpoint quality is critical.
Ablation Studies:
- SFT + RL outperforms RL-only (unstable) and SFT-only (limited by heuristics).
- Reward Design: Combining Confidence and Geometric rewards yields the best results. Confidence alone is noisy; Geometry alone lacks task specificity. Their combination creates a robust control objective.
Generalization: The method demonstrates similar robustness on the more complex HM3D dataset, confirming its ability to bridge domain gaps across diverse scene distributions.

5. Significance

Sea2 establishes a new direction for label-efficient domain adaptation in embodied AI. It demonstrates that strategic viewpoint selection alone can recover performance drops caused by domain gaps, offering a scalable alternative to traditional fine-tuning. This approach is particularly valuable for open-world settings where acquiring ground-truth annotations is prohibitively expensive or impossible, allowing agents to leverage powerful pre-trained models effectively in novel environments.

See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

The Cast of Characters

How It Works: The "See, Act, Adapt" Dance

Why Is This a Big Deal?

The Results

1. Problem Statement

2. Methodology: Sea2 Framework

Key Components

Two-Stage Training Pipeline

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Compositional Neuro-Symbolic Reasoning

Understanding the Nature of Generative AI as Threshold Logic in High-Dimensional Space

AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems