FoldNet: Learning Generalizable Closed-Loop Policy for Garment Folding via Keypoint-Driven Asset and Demonstration Synthesis

Imagine you want to teach a robot to fold laundry. It sounds simple to us, but for a robot, a T-shirt is a nightmare. It's floppy, it twists, it gets tangled, and unlike a rigid box, it changes shape every time you touch it.

This paper introduces FoldNet, a clever system designed to teach robots how to fold clothes without needing thousands of humans to demonstrate the task in real life. Here's how it works, broken down into simple concepts:

1. The Problem: The "Blank Canvas" Dilemma

To teach a robot anything today, you usually need a massive amount of data (videos of humans doing the task). But for laundry, this is hard.

Real life is messy: You can't easily record 15,000 perfect folding sessions.
Simulation is fake: If you try to simulate clothes in a computer, the textures often look like plastic, and the robot gets confused when it sees a real shirt that looks different.

2. The Solution: Building a "Virtual Wardrobe"

The authors built a digital factory called FoldNet to create a massive library of synthetic clothes. Think of it like a video game character creator, but for laundry.

The Skeleton (Keypoints): Instead of drawing every single shirt from scratch, they created "skeletons" (templates) for T-shirts, hoodies, vests, and pants. These skeletons have special "anchor points" (keypoints) like the collar, cuffs, and hem.
The Skin (Textures): They used AI art generators (like Stable Diffusion) to paint these skeletons with realistic fabrics—stripes, florals, solids—so the robot sees a world full of variety, not just plain white shirts.
The Result: They generated thousands of unique, high-quality 3D clothes that look real enough to fool a robot's camera.

3. The Teacher: The "Safety Net" Strategy (KG-DAgger)

This is the most brilliant part of the paper. Usually, when we train robots in simulation, we only show them perfect examples.

The Flaw: If a robot tries to grab a shirt and misses (a common mistake), and it has only seen perfect grabs, it panics. It doesn't know what to do next, so it fails.
The Fix (KG-DAgger): The authors introduced a "Safety Net" strategy.
- Imagine a student learning to ride a bike. If they wobble and almost fall, a perfect teacher would just say, "Start over."
- KG-DAgger is like a coach who steps in while the student is wobbling. It says, "Whoops, you missed the handle! Here is how you correct your grip and try again."
- The system automatically detects when the robot is about to fail, steps in to fix the mistake, and records that "recovery" move as a new lesson.

By teaching the robot not just how to succeed, but how to recover from mistakes, the robot becomes much more robust.

4. The Results: From Lab to Laundry Room

They trained their robot using 15,000 of these simulated folding sessions (which equals about 2 million image-action pairs).

The Test: They took the robot out of the computer and into the real world with real clothes it had never seen before.
The Outcome:
- Without the "Safety Net" (KG-DAgger), the robot succeeded only 50% of the time.
- With the "Safety Net," the success rate jumped to 75%.
- They even tested a giant, pre-trained AI model (called $\pi_0$ ) on this data, and it learned to fold clothes in the real world without ever seeing a real human fold a shirt first.

The Big Picture Analogy

Think of this like learning to cook.

Old Way: You watch a master chef make a perfect omelet once. If you drop an egg, you give up because you don't know how to fix it.
FoldNet Way: You have a virtual kitchen where you can practice 10,000 times. If you drop an egg, a magical assistant instantly shows you how to scoop it up and keep cooking. By the time you try to cook for real, you aren't just a chef who knows the recipe; you're a chef who knows how to handle disasters.

In short: FoldNet creates a realistic digital playground and teaches robots how to fix their own mistakes, allowing them to master the messy, floppy art of folding laundry in the real world.

Here is a detailed technical summary of the paper "FoldNet: Learning Generalizable Closed-Loop Policy for Garment Folding via Keypoint-Driven Asset and Demonstration Synthesis."

1. Problem Statement

Robotic garment manipulation is a critical but highly challenging task due to the deformable nature of fabrics and their complex dynamics. While data-driven approaches like imitation learning have shown promise, they face two primary bottlenecks:

Data Scarcity and Quality: Collecting large-scale, diverse, high-quality real-world demonstration data is labor-intensive. Existing synthetic datasets often lack diverse garment assets, detailed semantic annotations, or realistic textures, limiting the generalization capabilities of learned policies.
Lack of Robustness (Error Recovery): Most prior works rely on open-loop approaches or perfect demonstrations. In long-horizon tasks like folding, small errors accumulate, leading to out-of-distribution states and task failure. Current methods struggle to teach robots how to recover from failures (e.g., a missed grasp) without explicit, expensive human intervention for every error case.

2. Methodology

The authors propose FoldNet, a comprehensive framework that integrates synthetic asset generation with a novel closed-loop training strategy.

A. Keypoint-Driven Garment Asset Synthesis

To address the lack of diverse assets, the authors developed a pipeline to generate high-quality, physically simulatable garment meshes:

Geometry Generation: Instead of manual modeling, they use template-based generation. For four categories (T-shirt, Vest, Hoodie, Trousers), they define a set of 2D semantic keypoints that control the garment's structure. Keypoints are randomized, connected via Bezier curves, and triangulated to create diverse mesh geometries.
Texture Synthesis: They employ a generative pipeline using Large Language Models (LLMs) to create texture descriptions and Stable Diffusion to generate corresponding texture maps.
Filtering: A Vision-Language Model (VLM) is used to filter and select the most realistic texture-mesh combinations, ensuring consistency between geometry and appearance.
Annotation: Crucially, every generated mesh comes with automatically annotated semantic keypoints, which serve as ground truth for both perception and control.

B. Demonstration Generation & KG-DAgger

The core innovation lies in generating training data that includes error recovery, moving beyond "perfect" demonstrations:

Keypoint-Based Policy: An initial policy is designed to fold garments by manipulating specific keypoints (e.g., grabbing the sleeve keypoint).
KG-DAgger (Keypoint-Gated DAgger): This is a novel variant of the DAgger algorithm designed for simulation-to-real transfer.
- Process: During training, the policy attempts to execute actions. A keypoint-based monitoring strategy (using ground-truth keypoint locations and gripper states) detects potential failures (e.g., a grasp failure where the gripper closes but the garment doesn't move).
- Recovery: When a failure is detected, the system invokes a recovery strategy to correct the state (e.g., re-grasping).
- Data Integration: These "recovery trajectories" are added to the dataset. The loss function is modified to assign zero weight to incorrect actions during the failure phase and full weight to the recovery actions.
- Result: The final model learns an end-to-end vision-action policy that implicitly understands how to detect and recover from errors without needing explicit keypoint detection modules during inference.

C. Model Architecture

The system uses a Diffusion Policy as the vision-action model. It takes a single-view RGB image and proprioceptive state (gripper positions/states) as input and outputs a sequence of actions (gripper positions and open/close states).

3. Key Contributions

FoldNet Dataset: A synthetic dataset containing 15,000 trajectories (approx. 2M image-action pairs) across four garment categories. It features scalable, diverse garment meshes with automatic semantic keypoint annotations and realistic textures.
KG-DAgger Framework: A novel training strategy that automatically generates error-recovery demonstrations in simulation. This significantly improves the robustness of closed-loop policies by teaching the model to handle out-of-distribution states.
Generalizable Closed-Loop Policy: A model trained on this data achieves high success rates in the real world without requiring real-world fine-tuning, demonstrating strong sim-to-real transfer capabilities.

4. Experimental Results

The authors validated their approach through keypoint detection and garment folding tasks in both simulation and the real world.

Keypoint Detection:
- Models trained on FoldNet assets achieved a 47.2% mAP on real-world images (zero-shot, no fine-tuning), outperforming models trained on other synthetic datasets (e.g., aRTF at 36.6%).
- This confirms the high visual fidelity and geometric consistency of the generated assets.
Garment Folding (Simulation & Real World):
- Impact of KG-DAgger: Models trained with KG-DAgger (including recovery data) achieved a 75% success rate in the real world. In contrast, models trained only on "perfect" demonstrations or simple noise-augmented data struggled, with real-world success rates dropping significantly (e.g., ~50% for baselines).
- Robustness: The KG-DAgger approach enabled the robot to retry grasps after failures, a capability absent in models trained on perfect data.
- Scalability: Performance improved with data scale, plateauing at 15K trajectories.
- VLA Fine-tuning: The dataset was successfully used to fine-tune a large Vision-Language-Action model ( $\pi_0$ ), enabling it to generalize to unseen robots and garments in the real world without additional real-world data.

5. Significance and Impact

Bridging the Sim-to-Real Gap: By focusing on error recovery and high-fidelity asset synthesis, FoldNet addresses the primary reasons why synthetic data often fails to transfer to reality.
Scalability: The keypoint-driven generation pipeline allows for the rapid creation of infinite variations of garments, overcoming the bottleneck of manual asset creation.
Robustness in Deformable Manipulation: The paper demonstrates that teaching robots to handle failures (via KG-DAgger) is more critical than simply increasing the volume of perfect demonstrations. This shifts the paradigm for learning long-horizon deformable tasks.
Open-Source Potential: The framework provides a blueprint for generating high-quality synthetic datasets for other complex manipulation tasks where real-world data collection is prohibitive.

In conclusion, FoldNet establishes a new standard for synthetic data in robotic manipulation, proving that keypoint-driven asset generation combined with failure-aware training (KG-DAgger) can produce policies capable of robust, generalizable garment folding in the real world.