DressWild: Feed-Forward Pose-Agnostic Garment Sewing Pattern Generation from In-the-Wild Images

Imagine you have a single photo of a friend wearing a cool, unique jacket. They are striking a dynamic pose—maybe jumping, twisting, or reaching out. You want to know exactly how that jacket was made. You want the "blueprint" (the sewing pattern) so a tailor could cut the fabric and sew it, or a computer could simulate how it moves in a video game.

Usually, this is incredibly hard. If you just look at the photo, the fabric is bunched up, stretched, and hidden by the pose. It's like trying to figure out the shape of a crumpled piece of paper just by looking at the crumpled ball.

Enter "DressWild."

Think of DressWild as a super-smart, magical tailor's assistant that can look at that one chaotic photo and instantly "un-crumple" the garment in its mind to reveal the perfect, flat sewing pattern underneath.

Here is how it works, broken down into simple steps:

1. The "Magic Mirror" (Vision-Language Models)

First, the system looks at your photo of your friend jumping. It knows that the pose is tricky. So, it uses a powerful AI (called a Vision-Language Model) to imagine a "Magic Mirror." In this mirror, your friend is standing perfectly still, facing forward, with arms straight out (a "T-pose").

The AI doesn't just guess; it uses its knowledge of how clothes should look to mentally "re-dress" your friend in this perfect, neutral pose. This strips away the confusion of the jump or the twist, leaving only the pure shape of the jacket.

2. The "Detective Team" (Feature Extraction)

Now, the system has two clues:

Clue A: The original photo (showing the real-world details, wrinkles, and lighting).
Clue B: The "Magic Mirror" image (showing the clean, standard shape of the clothes).

It also acts like a skeleton detective, analyzing exactly how the human body is bent and twisted in the original photo. It separates the "body movement" from the "clothing shape."

3. The "Brain Swap" (Feature Fusion)

This is the secret sauce. The system takes the clues from the original photo, the clean "Magic Mirror" image, and the body movement data, and mixes them together in a special "blender" (a Transformer model).

Think of it like making a smoothie. If you only put in the "jumping" photo, the smoothie tastes like chaos. If you only put in the "standing still" photo, it tastes boring and fake. But when you blend them with the body movement data, you get the perfect flavor: The true shape of the clothes, regardless of how the person is posing.

4. The "Blueprint Generator" (Pattern Prediction)

Once the system understands the true shape, it doesn't just make a 3D model; it draws the 2D sewing pattern.

Imagine a tailor laying out flat pieces of fabric on a table: a piece for the front, a piece for the back, sleeves, and collars. DressWild draws these shapes, calculates exactly where the curves go, and tells you which edges need to be stitched together. It even figures out the texture (the fabric pattern) and wraps it around the clothes.

Why is this a big deal?

No More "Perfect Studio" Shots: Previous methods needed photos taken in a studio with perfect lighting and a model standing still. DressWild works on "in-the-wild" photos—snapshots from your phone, social media, or movies.
It's Fast: Old methods tried to solve this by running thousands of simulations to guess the answer (like trying to solve a maze by running it 1,000 times). DressWild does it in one quick pass (feed-forward), like a human expert who just knows the answer.
Ready for Real Life: The output isn't just a pretty picture. It's a "simulation-ready" blueprint. You can take these patterns and actually sew the clothes, or drop them into a video game engine to see them move realistically.

The Bottom Line

DressWild is like having a time machine for fashion. You take a snapshot of a person in any crazy pose, and it travels back in time to show you the flat, perfect sewing pattern that created that outfit. It turns a messy, 3D reality into a clean, 2D blueprint that anyone (or any computer) can use to recreate the garment.

1. Problem Statement

The paper addresses the challenge of generating simulation-ready, editable 2D sewing patterns and corresponding 3D garments from a single, arbitrary "in-the-wild" image (images with diverse poses, viewpoints, and backgrounds).

Limitations of Existing Methods:
- Data-driven Feed-forward methods: Often restricted to canonical poses (e.g., A-pose or T-pose) and struggle to generalize to diverse real-world poses. They typically require controlled multi-view inputs.
- Optimization-based methods: While they can handle diverse poses, they rely on iterative simulation and gradient-based optimization, making them computationally expensive, slow, and difficult to scale.
- Geometry-only approaches: Many recent methods generate 3D mesh geometry but fail to recover the underlying 2D sewing patterns, limiting editability, parametric control, and physical manufacturability.

2. Methodology: DressWild Pipeline

DressWild is a feed-forward pipeline that reconstructs physics-consistent 2D sewing patterns and 3D garments in a single pass without iterative optimization. The architecture consists of four main stages:

A. VLM-Guided Canonicalization & Data Curation

Normalization: The system leverages a Vision-Language Model (VLM), specifically NanoBanana Pro, to synthesize a canonical front-facing T-pose image ( $I_c$ ) from the input wild image ( $I$ ). This disentangles pose and viewpoint variations from the garment's appearance.
Data Augmentation: The training dataset is augmented using the VLM to generate diverse multi-pose and multi-view images from base T-pose templates, ensuring the model learns robust features across variations.

B. Feature Extraction

The system extracts three complementary feature streams from the input image ( $I$ ) and the canonical image ( $I_c$ ):

Image Appearance Features ( $f_i$ ): Extracted from the segmented original image using Hunyuan3D.
Canonical-Space Features ( $f_c$ ): Extracted from the synthesized T-pose image using Hunyuan3D. These provide pose-invariant structural cues.
Pose-Aware Features ( $f_p$ ): Extracted from the original image using SAM3D-Body to explicitly encode human body articulation and pose.

C. Feature Fusion & Parameter Decoding

Hybrid Attention Fusion: The three feature streams are projected into a shared embedding space and concatenated. A Transformer-based encoder with self-attention fuses these features, allowing the model to selectively attend to complementary cues (appearance vs. structure vs. pose).
Autoregressive Decoding: A decoder-based transformer predicts the sewing pattern parameters autoregressively. The output includes:
- 2D Panel Geometry: Vertices and edge definitions (straight lines or quadratic Bézier curves for curvature).
- 3D Placement: Rigid transformations (rotation and translation) for each panel.
- Stitching Topology: Labels defining how edges connect between panels.
- Output: The result is a structured, parametric representation directly compatible with physical simulation.

D. Post-Processing (Texture & Simulation)

Texture Generation: Textures are synthesized on the 3D garment surface using Hunyuan3D-Paint and then projected onto the UV map of the sewing patterns to ensure seam consistency.
Garment Simulation: The generated patterns are draped onto a SMPL-X body model. The system uses Position-Based Dynamics (PBD) for initial collision avoidance and the Codimensional Incremental Potential Contact (CIPC) simulator for dynamic, multi-layer garment simulation.

3. Key Contributions

First Feed-Forward Pose-Agnostic Pipeline: Introduces a method that generates diverse, simulation-ready 2D sewing patterns and 3D garments from a single in-the-wild image without requiring multi-view inputs or iterative optimization.
VLM-Powered Canonicalization: Effectively utilizes Vision-Language Models to normalize pose and viewpoint, enabling the model to leverage priors from curated datasets while generalizing to wild data.
Hybrid Feature Fusion: Proposes a novel architecture that fuses pose-aware, canonical-structure, and appearance features via a Transformer encoder, enabling robust recovery of garment topology under complex poses.
End-to-End Fabrication Readiness: The output is not just a 3D mesh but a parametric sewing pattern with stitching topology, directly applicable to physical simulation, texture synthesis, and virtual try-on.

4. Experimental Results

The authors evaluated DressWild against state-of-the-art baselines: NeuralTailor (point-cloud based) and SewFormer (single-image based).

Quantitative Performance:
- Panel Accuracy: DressWild achieved 94.35%, significantly outperforming NeuralTailor (25.99%) and SewFormer (28.81%).
- Edge Accuracy: DressWild achieved 85.41%, compared to 29.05% and 34.56% for baselines.
- Geometric Error: Reduced Shape L2 error to 6.22 (vs. ~23 for baselines) and achieved the lowest Chamfer Distance (0.01899).
Qualitative Performance:
- DressWild successfully reconstructed coherent patterns for complex garments (dresses, jackets) in diverse poses (jumping, walking, arching back) where baselines produced fragmented or misaligned panels.
- Ablation studies confirmed that removing canonical features or pose features significantly degraded accuracy, validating the necessity of the multi-stream fusion design.

5. Significance

Scalability & Efficiency: By replacing slow optimization loops with a feed-forward network, DressWild offers a scalable solution for real-time or high-throughput garment generation.
Bridging the Gap: It bridges the gap between visual generation (3D meshes) and industrial application (sewing patterns), enabling realistic digital fashion that can be physically manufactured or simulated.
Robustness: The ability to handle "in-the-wild" images with arbitrary poses makes the technology applicable to real-world scenarios like e-commerce, social media content creation, and digital avatars, where controlled capture is unavailable.