GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

GarmentPile++ is a novel pipeline that integrates vision-language reasoning with visual affordance perception and dual-arm cooperation to enable safe, precise retrieval of single garments from cluttered piles, bridging the gap between single-garment manipulation research and real-world scenarios.

Mingleyang Li, Yuran Wang, Yue Chen, Tianxing Chen, Jiaqi Liang, Zishun Shen, Haoran Lu, Ruihai Wu, Hao Dong

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you come home after a long day, and you see a giant, messy pile of laundry on your bed. It's a tangled mess of shirts, pants, socks, and sweaters all mixed together. You want to grab just one specific red shirt to wear, or maybe you want the robot to start folding the whole pile one by one.

Doing this with a robot is incredibly hard. Clothes are floppy, they get tangled, and they look almost identical when crumpled up. Most robots are trained to handle one clean shirt at a time. If you give them a messy pile, they usually grab a whole bunch of clothes at once, or they get confused and drop everything.

GarmentPile++ is a new "brain" for robots that solves this problem. Think of it as a robot that has super-vision, common sense, and two helpful hands working together. Here is how it works, broken down into three simple steps:

1. The Detective Phase: "Which one do I pick?"

The Problem: When you look at a messy pile, it's hard to tell where one shirt ends and another begins. A robot might try to grab a shirt but accidentally pull up a pair of pants tangled underneath it.
The Solution: The robot uses a high-tech camera system (called SAM2) that acts like a digital highlighter. It tries to draw a line around every single item in the pile.

  • The "Fine-Tuning" Trick: Sometimes, the highlighter gets it wrong (maybe it thinks two red shirts are one big blob). The robot's "brain" (a Vision-Language Model, or VLM) looks at the picture and says, "Wait, that doesn't look right." It then tells the robot to gently lift and shake the pile. This movement helps the robot see the clothes separate, allowing it to redraw the lines correctly.
  • The Decision: Once the robot sees the pile clearly, you tell it, "I want the red shirt." The robot's brain looks at the pile and decides, "Okay, the red shirt is on top and easiest to reach. Let's start with that one."

2. The Strategy Phase: "Where should I grab it?"

The Problem: Even if the robot knows which shirt to grab, grabbing it in the wrong spot is a disaster. If you grab a shirt by a loose thread, it might rip. If you grab it by the bottom hem, the whole pile might get pulled up with it.
The Solution: The robot uses a special "feeling" map called an Affordance Model. Imagine the robot is painting the shirt with a heat map:

  • Red spots are "Great to grab here!" (Strong, safe, won't pull other clothes).
  • Blue spots are "Don't touch here!" (Weak, tangled, or risky).
    The robot picks the "Reddest" spot to grab. This ensures it lifts the shirt cleanly without dragging the whole laundry pile with it.

3. The Teamwork Phase: "Do I need help?"

The Problem: Some clothes are huge (like a king-size blanket) or very long. A robot with just one arm might struggle to lift a giant blanket without it dragging on the floor or getting tangled.
The Solution: The robot has two arms (a Master and a Slave).

  • The Check: After the first arm grabs and lifts the item, the robot pauses and asks its brain: "Did I accidentally pick up two things? Is this item too heavy or long for just one arm?"
  • The Team-Up: If the answer is "Yes," the second arm jumps in. The robot uses a tracking system to find the perfect spot for the second arm to grab, so they can lift the item together smoothly, like two people carrying a large table.
  • The Safety Net: If the robot realizes it grabbed two shirts instead of one, it immediately stops, puts them down, and tries again. It never forces a bad grab.

Why is this a big deal?

Think of previous robots as single-handed toddlers trying to pick up a messy pile of toys. They often grab too much or get stuck.

GarmentPile++ is like a smart, patient adult with two hands:

  1. They can see through the mess (using the camera and fine-tuning).
  2. They know exactly where to grab to avoid a mess (using the heat map).
  3. They know when to call for backup if the job is too big (using the second arm).

This system allows robots to finally handle real-world laundry piles safely and efficiently, paving the way for robots that can actually help us fold clothes, hang them up, or get dressed in the future.