Choose What to Observe: Task-Aware Semantic-Geometric Representations for Visuomotor Policy

Imagine you are teaching a robot how to pick up a specific toy from a messy table. You show the robot a video of you doing it. The robot learns well, but here's the catch: the robot is a bit of a "snooper." It doesn't just look at the toy; it memorizes the color of the table, the pattern on the rug, and the lighting in the room.

If you move the robot to a new room with a different table color, the robot panics. It thinks, "Wait, this isn't the training video! I don't know what to do!" It freezes or grabs the wrong thing. This is what researchers call overfitting to "nuisance" visual factors.

This paper proposes a clever solution: Don't change the robot's brain; change what the robot is allowed to see.

The Core Idea: The "Magic Filter"

Instead of feeding the robot raw, messy photos of the real world, the authors built a "magic filter" (an observation interface) that sits between the camera and the robot's brain.

Think of it like a video game character that only sees a simplified, color-coded map.

The Robot's Eyes (Raw RGB): Sees a cluttered room with a red table, a blue chair, a green toy, and a shadow.
The Magic Filter (The Paper's Method): Instantly erases the background, the shadows, and the clutter. It paints the table a boring, constant gray. It paints the robot's arm bright blue and the target toy bright red.

Now, the robot sees a clean, simple picture: "Ah, there is a Blue Arm and a Red Object on a Gray Background. I know exactly what to do."

How the Filter Works (The Two Levels)

The authors created two versions of this filter, like a "Basic" and a "Pro" mode:

Level 0: The "Sticker" Mode (Semantic Repainting)

How it works: The system uses a super-smart AI (called SAM3) to find the "Robot" and the "Target Object" in the photo. It then takes a paintbrush and repaints the whole scene.
The Result: The background becomes a solid, boring color. The robot arm becomes a solid blue block. The target object becomes a solid red block.
Why it helps: It's like turning a complex, high-definition movie into a simple cartoon. The robot ignores the distracting details (like whether the table is wood or plastic) and focuses only on the shapes and positions of the important things.

Level 1: The "3D Glasses" Mode (Adding Depth)

The Problem: Sometimes, just knowing where the object is (red block) isn't enough. The robot needs to know how far away it is or if it's a flat sticker or a 3D box.
The Solution: This mode takes the "Sticker" image from Level 0 and paints a special "depth map" onto the target object. Imagine looking at the red toy through 3D glasses; the red color now has shading that tells the robot how deep the object is.
Why it helps: It's like adding a ruler to the cartoon. Now the robot knows not just what to grab, but exactly how to reach for it in 3D space.

The Results: Why This is a Big Deal

The researchers tested this on many different robots and tasks (lifting blocks, closing cabinets, grabbing toys).

The "Brittle" Robot: The standard robot (using raw photos) worked great in the training room but failed miserably when the table color changed or when new clutter appeared. It was like a student who memorized the answers to a specific test but failed if the teacher changed the font.
The "Filtered" Robot: With the Magic Filter, the robot stayed calm. Even when the background changed completely, or the object was a different color, the robot saw the same simple "Red Object on Gray Background" and performed perfectly.
No Brain Surgery: The best part? They didn't have to retrain the robot's brain or make it smarter. They just changed the input. It's like giving a driver better sunglasses instead of trying to teach them how to drive faster.

The Analogy: The "Chef's Recipe"

Imagine you are a chef (the robot) learning to make a cake.

Old Way: You watch a video of a chef making a cake in a kitchen with a red wall, a wooden table, and a cat in the corner. You memorize the red wall and the cat. When you go to a new kitchen with a blue wall and no cat, you can't make the cake because the "recipe" feels wrong.
New Way (This Paper): You watch a video where the background is erased, the cat is gone, and the ingredients are highlighted in bright neon colors. You learn: "Mix the Neon Flour with the Neon Eggs." It doesn't matter if the kitchen is red, blue, or green. You just follow the neon instructions.

Summary

This paper teaches us that for robots to be truly robust and adaptable, we shouldn't just make them "smarter" with bigger brains. Instead, we should give them better eyes—eyes that can filter out the noise and focus only on what actually matters for the task. By simplifying the visual world, we make the robot's job much easier and more reliable.

Here is a detailed technical summary of the paper "Task-Aware Semantic-Geometric Representations for Visuomotor Policy."

1. Problem Statement

Visuomotor policies trained on demonstration data often suffer from brittleness when deployed in environments with visual distribution shifts (Out-of-Distribution or OOD). Specifically, these policies tend to overfit to "nuisance" visual factors in raw RGB observations, such as background clutter, tabletop colors, or object textures. When these appearance factors change during deployment (e.g., a different colored table or a new object texture), performance drops significantly, even if the task semantics and dynamics remain identical.

Current solutions often rely on scaling model capacity (e.g., larger VLA models) or extensive data augmentation, which are computationally expensive and do not fundamentally address the root cause: the policy's reliance on raw, unfiltered visual inputs.

2. Methodology

The authors propose a Task-Aware Semantic-Geometric Observation Interface that acts as a pre-processing layer. Instead of modifying the policy architecture, this interface transforms raw RGB frames into a canonical representation that suppresses irrelevant appearance variations while preserving action-relevant structure.

The pipeline consists of two levels of observation construction:

A. Level 0: Semantic Repainting (L0)

Mechanism: Given an RGB image and open-vocabulary text prompts (e.g., "robot gripper," "target object"), the system uses SAM3 (Segment Anything Model 3) to generate binary masks for the robot/gripper and the target object.
Transformation: The raw image is repainted into a "canonical" 3-channel image:
- The background is set to a constant color (e.g., black).
- The robot/gripper is painted with a fixed semantic color.
- The target object is painted with a different fixed semantic color.
Goal: This removes all texture, lighting, and background clutter variations, leaving only the spatial layout of relevant entities.

B. Level 1: Semantic-Geometric Injection (L1)

Mechanism: For tasks requiring fine-grained geometric cues (e.g., depth, shape), the system integrates monocular depth estimation using Depth Anything 3.
Transformation:
1. A dense depth map is predicted from the raw RGB.
2. The depth values are normalized only within the target object's mask region.
3. This normalized depth map is tiled to 3 channels and overwrites the target object region in the L0 canvas.
Goal: The resulting image remains a standard 3-channel input compatible with off-the-shelf vision encoders but now contains explicit geometric information for the target object while maintaining the clean, canonical background.

C. Policy Training

The extracted observations ( $\tilde{o}_t$ ) are fed into standard visuomotor policies (specifically Flow Matching Policy (FMP) and SmolVLA) without any architectural changes. The authors also employ LoRA (Low-Rank Adaptation) to fine-tune SAM3 and Depth Anything 3 on the in-distribution training data to ensure robust segmentation and depth estimation, but these perception models are not fine-tuned on OOD data.

3. Key Contributions

Task-Aware Observation Interface: A novel method to canonicalize visual input via segmentation-based repainting (L0) and optional depth injection (L1), effectively decoupling policy learning from nuisance appearance factors.
Policy Agnosticism: The interface is designed to work with existing policy backbones (both Flow Matching and Vision-Language-Action models) without requiring architectural modifications or retraining the policy on raw data.
Systematic Robustness Evaluation: A comprehensive evaluation across multiple simulation benchmarks (RoboMimic, ManiSkill, RLBench) and real-world Franka robot experiments, demonstrating consistent gains under controlled appearance shifts.
Ablation Studies: Proving the necessity of including the robot/gripper in the segmentation mask and the effectiveness of LoRA fine-tuning for foundation models in this context.

4. Experimental Results

The method was evaluated on four main benchmarks under In-Distribution (ID) and Out-of-Distribution (OOD) conditions (object recoloring, background changes, and clutter).

RoboMimic (Lift):
- Raw RGB performance dropped from 98.7% (ID) to 18.4% under background shifts.
- L0/L1 maintained robustness, achieving ~90.7% success under the same shifts.
ManiSkill (YCB Grasping):
- Raw RGB failed in cluttered scenes (15.0% success) compared to ID (98.0%).
- L0/L1 recovered performance to ~93-94% in cluttered environments.
RLBench (Tabletop Color Shifts):
- Raw RGB collapsed on tasks like CloseMicrowave (dropping from 80.7% to ~10%).
- L0/L1 stabilized performance, maintaining 80-90% success rates across color shifts.
- L1 showed specific advantages on tasks requiring spatial precision (e.g., CloseMicrowave), where depth cues improved ID performance to 96.7%.
Real-World (Franka Robot):
- Validated on ReachX and CloseCabinet tasks.
- Raw RGB dropped to ~21-45% under OOD conditions.
- L0/L1 maintained 75-83% success, confirming the method's transferability to real hardware without test-time adaptation.
Ablation Insights:
- Robot Masking: Excluding the robot from the mask caused performance to collapse (e.g., 7.3% on Lift), proving the robot's pose is a critical visual state.
- Foundation Model Fine-tuning: Pretrained SAM3 failed to segment robots in OOD settings (0% IoU), while LoRA-finetuned SAM3 achieved >99% IoU.
- Comparison: The proposed L1 (overwrite) outperformed a S2Diffusion-style channel concatenation approach.

5. Significance

This work shifts the focus from "scaling up" policy models to revisiting the observation interface. By explicitly controlling what the policy sees, the authors demonstrate that robustness to visual shifts can be achieved without:

Modifying the policy architecture.
Collecting massive amounts of diverse OOD data.
Fine-tuning the policy itself on new visual domains.

The approach offers a computationally efficient, plug-and-play solution for deploying visuomotor policies in dynamic, real-world environments where appearance changes are inevitable. It highlights that semantic abstraction is a powerful tool for generalization in robotic learning.