GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

Imagine you come home after a long day, and you see a giant, messy pile of laundry on your bed. It's a tangled mess of shirts, pants, socks, and sweaters all mixed together. You want to grab just one specific red shirt to wear, or maybe you want the robot to start folding the whole pile one by one.

Doing this with a robot is incredibly hard. Clothes are floppy, they get tangled, and they look almost identical when crumpled up. Most robots are trained to handle one clean shirt at a time. If you give them a messy pile, they usually grab a whole bunch of clothes at once, or they get confused and drop everything.

GarmentPile++ is a new "brain" for robots that solves this problem. Think of it as a robot that has super-vision, common sense, and two helpful hands working together. Here is how it works, broken down into three simple steps:

1. The Detective Phase: "Which one do I pick?"

The Problem: When you look at a messy pile, it's hard to tell where one shirt ends and another begins. A robot might try to grab a shirt but accidentally pull up a pair of pants tangled underneath it.
The Solution: The robot uses a high-tech camera system (called SAM2) that acts like a digital highlighter. It tries to draw a line around every single item in the pile.

The "Fine-Tuning" Trick: Sometimes, the highlighter gets it wrong (maybe it thinks two red shirts are one big blob). The robot's "brain" (a Vision-Language Model, or VLM) looks at the picture and says, "Wait, that doesn't look right." It then tells the robot to gently lift and shake the pile. This movement helps the robot see the clothes separate, allowing it to redraw the lines correctly.
The Decision: Once the robot sees the pile clearly, you tell it, "I want the red shirt." The robot's brain looks at the pile and decides, "Okay, the red shirt is on top and easiest to reach. Let's start with that one."

2. The Strategy Phase: "Where should I grab it?"

The Problem: Even if the robot knows which shirt to grab, grabbing it in the wrong spot is a disaster. If you grab a shirt by a loose thread, it might rip. If you grab it by the bottom hem, the whole pile might get pulled up with it.
The Solution: The robot uses a special "feeling" map called an Affordance Model. Imagine the robot is painting the shirt with a heat map:

Red spots are "Great to grab here!" (Strong, safe, won't pull other clothes).
Blue spots are "Don't touch here!" (Weak, tangled, or risky).
The robot picks the "Reddest" spot to grab. This ensures it lifts the shirt cleanly without dragging the whole laundry pile with it.

3. The Teamwork Phase: "Do I need help?"

The Problem: Some clothes are huge (like a king-size blanket) or very long. A robot with just one arm might struggle to lift a giant blanket without it dragging on the floor or getting tangled.
The Solution: The robot has two arms (a Master and a Slave).

The Check: After the first arm grabs and lifts the item, the robot pauses and asks its brain: "Did I accidentally pick up two things? Is this item too heavy or long for just one arm?"
The Team-Up: If the answer is "Yes," the second arm jumps in. The robot uses a tracking system to find the perfect spot for the second arm to grab, so they can lift the item together smoothly, like two people carrying a large table.
The Safety Net: If the robot realizes it grabbed two shirts instead of one, it immediately stops, puts them down, and tries again. It never forces a bad grab.

Why is this a big deal?

Think of previous robots as single-handed toddlers trying to pick up a messy pile of toys. They often grab too much or get stuck.

GarmentPile++ is like a smart, patient adult with two hands:

They can see through the mess (using the camera and fine-tuning).
They know exactly where to grab to avoid a mess (using the heat map).
They know when to call for backup if the job is too big (using the second arm).

This system allows robots to finally handle real-world laundry piles safely and efficiently, paving the way for robots that can actually help us fold clothes, hang them up, or get dressed in the future.

Here is a detailed technical summary of the paper "GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning."

1. Problem Statement

Robotic garment manipulation is a critical capability for home-assistant robots but remains highly challenging due to the near-infinite state space of deformable objects and complex kinematics. While existing research has made progress in manipulating single, isolated garments (e.g., folding, hanging), real-world scenarios typically involve cluttered piles of garments.

The Gap: Most current methods assume a single-garment initial state. Existing approaches for cluttered piles (e.g., the predecessor GarmentPile) often fail to retrieve exactly one garment per attempt, frequently lifting multiple items simultaneously. They also lack language integration (limiting flexibility) and rely solely on single-arm manipulation, which struggles with large or long garments.
The Goal: To develop a pipeline that follows open-ended language instructions to retrieve exactly one specific garment from a cluttered pile safely and cleanly, establishing a robust foundation for downstream tasks like folding or dressing.

2. Methodology

GarmentPile++ integrates Vision-Language Models (VLMs) for high-level reasoning with Visual Affordance Models for low-level action precision. The pipeline is structured into three sequential stages:

Stage 1: Which to Retrieve (Vision-Language-Guided Segmentation & Selection)

Segmentation: The system uses SAM2 (Segment Anything Model 2) to generate initial masks for garments in the pile.
Mask Fine-Tuning: Due to occlusion and color similarity in piles, initial masks are often inaccurate (e.g., merging multiple garments).
- A VLM (Qwen2.5-VL-7B) evaluates the initial masks.
- If errors are detected, the robot performs a physical "shake" action (lift, shake, release) while recording video.
- SAM2's video tracking and point-prompt segmentation are used to refine the masks based on the motion, ensuring each mask corresponds to a single garment.
Selection: The VLM analyzes the refined masks and the user's language query (e.g., "retrieve the red top" or "retrieve all garments") to select the optimal target garment ( $X_t$ ) based on task constraints (e.g., least entangled, topmost).

Stage 2: Where to Retrieve (Affordance-Guided Grasping)

Retrieval Affordance Model: A PointNet++-based model predicts a per-point affordance map ( $A_{retrieve}$ ) for the target garment.
Input: The model takes the 3D point cloud of the pile and the segmentation mask of the target garment. It incorporates the mask as a feature to distinguish the target from the background.
Output: A score map $[0, 1]$ indicating the feasibility of grasping at each point. The model prioritizes points that maximize single-arm retrieval success while ensuring garment safety (avoiding tearing or excessive sagging).
Action: The point with the highest score is selected as the grasp point ( $p_{retrieve}$ ) for the master arm.

Stage 3: How to Retrieve (State-Based Arm Cooperation)

Execution: The master arm lifts the garment at the selected point.
Cooperation Decision: The VLM analyzes the post-lift observation to determine:
1. Error Detection: Did the lift accidentally grab multiple garments? If yes, the attempt is terminated, and a new attempt begins.
2. Cooperation Need: Is the garment too long/heavy for a single arm?
Dual-Arm Coordination: If cooperation is required, a Tracking Selection module identifies a secondary grasp point for the slave arm. The two arms then cooperatively lift and transport the garment horizontally to the target location.

3. Key Contributions

Novel Pipeline: A robust framework that guarantees the retrieval of exactly one garment per attempt, addressing a critical limitation of previous cluttered manipulation methods.
VLM + Affordance Integration: Seamlessly combines the high-level reasoning of VLMs (for task understanding and mask refinement) with the generalization power of visual affordance models (for precise grasp point selection).
Adaptive Segmentation: Introduces a Mask Fine-Tuning mechanism using physical interaction (shaking) and video tracking to resolve segmentation ambiguities in heavily occluded piles.
Dual-Arm Robustness: Implements a dynamic cooperation framework that triggers dual-arm assistance only when necessary (e.g., for long garments or to prevent sagging), optimizing efficiency.
Comprehensive Evaluation: Validated across diverse scenarios (Open/Closed boundaries) and tasks (Sequential retrieval vs. Specific item retrieval) in both simulation and real-world environments.

4. Experimental Results

The method was evaluated in DexGarmentLab (Isaac Sim) and on a real-world ARX x7s robot with a RealSense D405 camera.

Baselines: Compared against ThinkGrasp (VLM + LangSAM), GarmentPile (Affordance only), and Qwen (only) (VLM only).
Simulation Performance:
- Success Rate: GarmentPile++ achieved the highest Average Success Rate (ASR) across all tasks.
  - Sequential Retrieval (Open Boundary): 90.4% (vs. 82.9% for Qwen only).
  - Specific Retrieval (Closed Boundary): 87.7% (vs. 78.7% for Qwen only).
- Efficiency: The method maintained a low Average Motion Steps (AMS), indicating high efficiency.
Real-World Performance:
- Achieved 95% success rate for sequential retrieval in open boundaries and 85% in closed boundaries, significantly outperforming baselines (e.g., ThinkGrasp at 45% and 35% respectively).
Ablation Studies:
- Removing Mask Fine-Tuning reduced ASR and increased the probability of triggering unnecessary dual-arm cooperation (PDR).
- Removing Affordance significantly dropped single-arm success rates, proving the necessity of learned grasp points over random selection.
- Removing Dual-Arm Cooperation led to failures with large/long garments.

5. Significance

GarmentPile++ represents a significant step forward in home-assistant robotics by bridging the gap between theoretical manipulation and real-world cluttered environments.

Practicality: It solves the "one-at-a-time" retrieval problem, which is a prerequisite for automated folding, washing, or dressing systems.
Robustness: The combination of physical interaction for segmentation refinement and dual-arm adaptability makes the system resilient to the chaotic nature of real laundry piles.
Generalization: By leveraging VLMs for reasoning and affordance models for control, the system can handle diverse garment types and complex user instructions without retraining for specific scenarios.

The work establishes a new standard for deformable object manipulation in unstructured environments, moving beyond single-object assumptions to tackle the complexity of real-world clutter.

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

1. The Detective Phase: "Which one do I pick?"

2. The Strategy Phase: "Where should I grab it?"

3. The Teamwork Phase: "Do I need help?"

Why is this a big deal?

1. Problem Statement

2. Methodology

Stage 1: Which to Retrieve (Vision-Language-Guided Segmentation & Selection)

Stage 2: Where to Retrieve (Affordance-Guided Grasping)

Stage 3: How to Retrieve (State-Based Arm Cooperation)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms

PACED: Distillation at the Frontier of Student Competence

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

Reversible Lifelong Model Editing via Semantic Routing-Based LoRA