Compose by Focus: Scene Graph-based Atomic Skills

Imagine you are teaching a robot to clean your messy kitchen. You want it to pick up a banana, put it in a bowl, then grab a cup, and finally wipe the table. This is a long-horizon task—a big job made of many small steps.

The problem is that most robots are like students who only studied in a perfectly clean, empty classroom. If you put them in a real kitchen with clutter, distractions, and weird lighting, they freeze up or drop everything. They get confused by the "noise" (like a stray spoon or a toy on the counter) and forget how to do the simple job of picking up the banana.

This paper, "Compose by Focus," proposes a clever new way to teach robots so they don't get distracted. Here is the breakdown using simple analogies:

1. The Problem: The "Distracted Student"

Think of a standard robot policy (the robot's brain) as a student trying to solve a math problem while a circus is happening next door.

The Old Way: The robot looks at the entire scene (the whole circus, the noise, the colors) as one giant, blurry picture. When the task changes slightly (e.g., "pick up the red apple" instead of the "green apple"), the robot gets overwhelmed because it's trying to process too much irrelevant information.
The Result: It works fine in a quiet room but fails miserably in a messy one.

2. The Solution: The "Spotlight" (Scene Graphs)

The authors introduce a Scene Graph. Imagine this as a smart spotlight or a highlighter pen that the robot uses before it even tries to move.

Instead of looking at the whole messy kitchen, the robot asks a smart assistant (an AI called a Vision-Language Model): "Hey, for this specific task of picking up the banana, what actually matters?"

The assistant draws a mental map (the Scene Graph) that includes only:

The Robot's Hand.
The Banana.
The Bowl.
Maybe a chair if it's in the way.

It completely ignores the stray toy, the cat, or the picture on the wall. It filters out the "circus" and focuses only on the "math problem."

3. How It Works: The "LEGO" Analogy

The paper calls these small tasks Atomic Skills. Think of these like individual LEGO bricks.

The Goal: Build a castle (the long task).
The Old Way: You try to glue the whole castle together at once. If the instructions are slightly different, the glue fails.
The New Way: You teach the robot how to snap one brick perfectly. Because you taught it to focus only on that one brick (ignoring the other 100 bricks on the table), it learns to snap them together perfectly every time, no matter how messy the table is.

The robot uses a Graph Neural Network (GNN). Think of this as a super-smart translator that turns the "Spotlight Map" (the Scene Graph) into a language the robot's muscles understand. It connects the dots: "Hand is here, Banana is there, Bowl is over there. Action: Move hand to banana."

4. The Magic Ingredient: Diffusion

The robot learns these skills using something called Diffusion.

Imagine you have a blurry, noisy photo of a robot moving its hand.
The "Diffusion" process is like a sculptor slowly chipping away the noise (the static) to reveal the perfect, smooth movement underneath.
Because the robot is only looking at the "Spotlight Map" (the relevant objects), it chips away the noise much faster and more accurately than if it were trying to clean up the whole messy room photo.

5. The Results: From "Fragile" to "Robust"

The researchers tested this in two ways:

Simulation: A virtual robot trying to stack blocks, sort colors, and use tools.
Real World: A real robot arm trying to pick up vegetables from a cluttered table.

The Outcome:

Old Robots: When asked to pick up one vegetable, they did okay. But when asked to pick up five vegetables in a row (a long task) in a messy room, they failed almost 100% of the time. They got confused by the extra vegetables.
The New Robot: It picked up the vegetables with near-perfect success (97-100%). Even when they added random obstacles or changed the background, the robot didn't care. It just turned on its "Spotlight," found the vegetable, and grabbed it.

Summary

This paper teaches robots to stop looking at the whole picture and start focusing on the specific parts that matter.

By turning a messy visual scene into a clean, structured list of "Important Objects and Relationships" (a Scene Graph), the robot can learn simple skills once and then combine them like LEGO bricks to solve complex, messy real-world problems without getting distracted. It's the difference between a student who panics in a noisy library and one who puts on noise-canceling headphones and gets straight to work.

Here is a detailed technical summary of the paper "Compose by Focus: Scene Graph-based Atomic Skills".

1. Problem Statement

The paper addresses the challenge of compositional generalization in robotics: the ability of a robot to combine pre-learned atomic skills (manipulation primitives) to solve complex, long-horizon tasks.

The Core Issue: While high-level planners (like Vision-Language Models or TAMP) can successfully decompose a complex task into sub-goals, the low-level visuomotor policies often fail when executing these sub-goals in cluttered or novel environments.
The Cause: Standard policies trained on raw RGB images or 3D point clouds are highly sensitive to distribution shifts. When a policy trained on a single object in a clean environment encounters a cluttered scene with distractors, its performance degrades significantly because it cannot distinguish between task-relevant objects and irrelevant noise.
The Gap: Existing methods either focus on high-level planning (ignoring low-level robustness) or rely on raw visual inputs that lack explicit structural reasoning about object relationships.

2. Methodology

The authors propose a "Compose by Focus" framework that replaces raw visual inputs with dynamic, task-relevant Scene Graphs. This approach allows the robot to "focus" only on the objects and relations pertinent to the current sub-task.

A. Scene Graph Construction

Instead of processing raw images, the system constructs a structured 3D scene graph for every observation:

Object Segmentation: Uses Grounded-SAM (a visual foundation model) to segment masks of task-relevant objects from RGB images and extract corresponding 3D point clouds.
Node Encoding: Point clouds are downsampled and encoded into compact vector representations using a lightweight MLP (DP3 Encoder). These vectors form the nodes of the graph.
Relation Inference: A Vision-Language Model (e.g., ChatGPT) infers semantic relationships (e.g., "grasp," "next to," "inside," "avoid") between objects based on the scene. These form the edges.
Dynamic Sub-graphs: For each atomic skill, the system constructs a concise sub-graph containing only the robot gripper, relevant objects, targets, and necessary obstacles, filtering out all distractors.

B. Policy Learning (Diffusion + GNN)

The learned policy is a Diffusion-based Visuomotor Policy conditioned on the scene graph:

Graph Neural Network (GNN): A two-layer Graph Attention Network (GAT) processes the scene graph. It aggregates node features while attending to inter-object relations, producing a global graph embedding ( $F$ ).
Language Conditioning: Skill descriptions are encoded using a CLIP encoder to produce a text feature vector ( $P$ ).
Diffusion Model: The policy ( $\pi$ ) takes the graph features ( $F$ ), language features ( $P$ ), and robot pose ( $Q$ ) as conditions. It uses a denoising diffusion process to iteratively predict actions ( $A_t$ ) from Gaussian noise.
Training: The system is trained via behavior cloning on expert demonstrations of individual atomic skills in isolation.

C. Inference and Composition

At test time, a high-level planner (VLM) decomposes a long-horizon task into a sequence of sub-goals. For each sub-goal:

The VLM identifies relevant objects.
A dynamic sub-scene graph is constructed.
The trained diffusion policy executes the specific atomic skill conditioned on this focused graph, ignoring irrelevant background objects.

3. Key Contributions

Scene Graph Representation for Policy Learning: The paper proposes encoding structural scene graphs as general, interpretable inputs for vision-based policy learning, constructed with the aid of VLMs and visual foundation models.
Integration with Diffusion Imitation Learning: They integrate this representation with diffusion-based imitation learning, creating a framework that is robust to visual perturbations and capable of zero-shot generalization in compositional scenes.
Demonstrated Robustness: The method achieves state-of-the-art performance in both simulation and real-world settings, significantly outperforming baselines that rely on raw 2D/3D inputs or large-scale pre-trained foundation models (like $\pi_0$ ) when faced with cluttered, multi-skill tasks.

4. Experimental Results

Simulation Experiments

Tasks: Five multi-skill, long-horizon tasks (e.g., "Sort by Color," "Blocks Stacking," "Tool Usage with Obstacles") built on ManiSkill2.
Baselines: Compared against 2D Diffusion Policy, 3D Diffusion Policy (DP3), and $\pi_0$ (a large-scale foundation model).
Findings:
- Atomic Skills: All methods performed well on single skills.
- Skill Composition: Baselines suffered catastrophic failure (success rates dropped to <50% or near 0%) when tasks required composing skills in cluttered scenes.
- Proposed Method: Achieved high success rates (0.78–0.93) across all compositional tasks, demonstrating minimal performance gap between single-skill and multi-skill scenarios.
- Ablation: Removing 3D information (using 2D crops), removing the graph structure (concatenating point clouds), or removing the GNN (shuffling inputs) all led to significant performance degradation, proving the necessity of the full scene graph pipeline.

Real-World Experiments

Tasks: Vegetable picking (picking multiple items from a cluttered table) and Tool usage (pulling/pushing cubes with sticks while avoiding obstacles).
Results:
- Vegetable Picking: The proposed method achieved a 97% success rate in skill composition, while baselines (Diffusion Policy, DP3, $\pi_0$ ) scored between 0% and 20%.
- Tool Usage: The method achieved a 90% success rate, significantly outperforming baselines (max 60%).
- Generalization: The method successfully handled unseen obstacles (e.g., switching from a stick to bricks) and distractors, whereas baselines failed to adapt.

5. Significance

This work represents a paradigm shift in how robotic skills are learned and composed:

From "Raw Pixels" to "Structured Focus": It demonstrates that filtering visual input through a semantic, task-relevant lens (scene graphs) is more effective than feeding raw, noisy data into powerful neural networks.
Data Efficiency: By focusing on atomic skills and using scene graphs to generalize, the method avoids the exponential data requirement of collecting demonstrations for every possible combination of skills and clutter configurations.
Bridging Planning and Control: It provides a unified interface where high-level semantic reasoning (VLMs) can directly inform low-level control policies, enabling robust execution of complex, long-horizon tasks in unstructured real-world environments.

In summary, "Compose by Focus" solves the robustness bottleneck in robotic skill composition by ensuring that the robot's "eyes" (the policy input) only see what is necessary for the current step, effectively mitigating the distribution shift caused by environmental clutter.