cadrille: Multi-modal CAD Reconstruction with Reinforcement Learning

Imagine you have a physical object in your hand—a weirdly shaped coffee mug, a custom car part, or a piece of jewelry. Now, imagine you want to turn that physical object into a digital blueprint that engineers can edit, 3D print, or manufacture. This process is called CAD Reconstruction.

For a long time, doing this was like trying to translate a book written in a dead language without a dictionary. You needed expensive scanners, specialized skills, and the computer often got the details wrong.

Enter Cadrille (pronounced cad-ree-lee), a new AI model introduced in this paper that acts like a "Universal Translator" for the physical world, turning real-world objects into editable digital designs.

Here is the simple breakdown of how it works, using some everyday analogies.

1. The Problem: The "One-Trick Pony"

Before Cadrille, most AI models were like specialized chefs.

One chef could only cook if you gave them a pile of raw ingredients (a Point Cloud from a 3D scanner).
Another chef could only cook if you gave them a photo (an Image).
A third chef could only cook if you gave them a written recipe (a Text Description).

If you didn't have the exact ingredient they needed, they couldn't help you. Furthermore, even when they did cook, the food (the digital model) often came out broken or inedible (invalid code).

2. The Solution: The "Master Chef" (Cadrille)

Cadrille is different. It's a multimodal Master Chef.

It doesn't care if you hand it a 3D scan, a photo, or a text description like "a red cylinder with a hole in the middle."
It understands all three languages at once.
Instead of just guessing a shape, it writes Python code (specifically using a library called CadQuery). Think of this as the chef not just serving you a dish, but handing you the exact recipe so you can tweak the salt or change the shape later.

3. How It Learned: The "Apprentice" and the "Coach"

The paper describes a two-step training process, which is like training a new employee:

Step 1: The Internship (Supervised Fine-Tuning)
First, the AI is fed a massive library of millions of synthetic CAD models. It's like an apprentice watching thousands of hours of master craftsmen at work. It learns the rules: "If I see a circle here, I should write a command to draw a circle there."

The Catch: The apprentice is great at following rules but gets confused when the real world gets messy. If the data is slightly different from the training books, the apprentice freezes or makes mistakes.

Step 2: The Coaching Session (Reinforcement Learning)
This is the paper's big innovation. Instead of just memorizing more books, the AI is put in a "gym" where it tries to build models and gets instant feedback.

Imagine the AI tries to build a chair.
If the chair falls over (the code is invalid), the "Coach" (the computer program) gives it a harsh penalty: "No points! Try again."
If the chair stands up perfectly, it gets a reward.
Crucially, the AI learns from its mistakes in real-time. It figures out how to handle messy, real-world data (like a scan with noise or missing parts) that it never saw in the training books.

4. Why This is a Big Deal

The authors tested Cadrille on 10 different benchmarks (like final exams).

Versatility: It beat the best "specialized chefs" in every category, whether you gave it a photo, a scan, or text.
Reliability: Previous models often produced "broken" code that wouldn't run. Cadrille's "Coaching" phase made it incredibly reliable, almost never producing broken code.
Real-World Ready: They tested it on real-world scans (which are usually messy and imperfect). Cadrille handled them like a pro, whereas other models struggled.

The Analogy Summary

Old AI: A student who memorized a textbook perfectly but fails the test if the question is phrased slightly differently.
Cadrille: A student who memorized the textbook, and then spent months taking practice tests, failing, getting corrected, and learning exactly how to handle curveballs.

The Bottom Line

Cadrille is a breakthrough because it bridges the gap between the messy, imperfect real world and the precise, mathematical world of engineering. By using a "learn from mistakes" approach (Reinforcement Learning), it creates digital blueprints that are not only accurate but also editable, making high-tech design accessible to anyone with a camera or a scanner.

In short: It turns "I have this object" into "Here is the editable code for that object," no matter how you describe or scan it.

1. Problem Statement

Computer-Aided Design (CAD) is fundamental to engineering, yet creating 3D models manually is time-consuming and requires specialized skills. CAD reconstruction aims to automate this by generating editable CAD models from input data (e.g., point clouds, images, or text).

Current state-of-the-art (SOTA) methods face three critical limitations:

Modality Silos: Most existing models are single-modal (handling only point clouds, images, or text separately), limiting their generalizability and robustness.
Quality vs. Multimodality: Early multimodal attempts (e.g., CAD-MLLM, CAD-GPT) significantly underperform compared to single-modal SOTA methods.
Training-Testing Gap: Models trained on procedurally generated data often fail to generalize to real-world scans, while models trained on small, handcrafted datasets lack diversity. Furthermore, existing methods struggle to generate valid executable code, often producing syntax errors or non-manifold geometries.

2. Methodology: The Cadrille Framework

The authors propose Cadrille, a multimodal model that accepts point clouds, multi-view images, and textual descriptions, outputting executable Python code (specifically using the cadquery library) to reconstruct 3D CAD models.

A. Architecture

Base Model: Built upon Qwen2-VL-2B, a Vision-Language Model (VLM) capable of processing text and images and generating code.
Multimodal Integration:
- Text & Images: Processed natively by the VLM's embedding and visual encoder layers.
- Point Clouds: Processed via a trainable linear projection layer (similar to CAD-Recode). Points are sampled via furthest point sampling (256 points, no normals) and embedded into the shared space.
Output: The model generates a sequence of tokens representing a Python script. When executed, this script produces a parametric Boundary Representation (B-Rep) of the 3D shape.

B. Training Pipeline: A Two-Stage Approach

Inspired by Large Language Model (LLM) training paradigms, Cadrille employs a distinct two-stage pipeline designed to bridge the gap between synthetic data and real-world application.

Stage 1: Supervised Fine-Tuning (SFT)

Data: Trained on a massive dataset of procedurally generated CAD models (approx. 1 million samples from CAD-Recode) combined with the DeepCAD dataset.
Goal: To teach the model the syntax of CAD code and the mapping from multimodal inputs to valid Python scripts.
Strategy: The model learns to minimize cross-entropy between ground-truth code and predictions. Crucially, the authors use only procedurally generated data for this stage to maximize diversity and volume, avoiding the "domain gap" issues seen when mixing small handcrafted datasets early on.

Stage 2: Reinforcement Learning (RL) Fine-Tuning

Motivation: SFT models often struggle with validity (generating code that crashes or produces invalid geometry) and generalization to real-world noise.
Data: Uses handcrafted datasets (DeepCAD, Fusion360) and real-world scans (CC3D). Crucially, this stage does not require ground-truth CAD sequences (code); it only requires the input (image/point cloud) and the ground-truth 3D mesh for evaluation.
Reward Function ( $R$ ):
- $R(\tau) = r_{IoU}(\tau) + r_{invalid}(\tau)$
- $r_{IoU}$ : Intersection over Union between the generated CAD mesh and the ground truth mesh (scaled by 10 to enforce precision).
- $r_{invalid}$ : A heavy penalty (-10) if the generated code fails to execute or produces an invalid model; 0 otherwise.
Algorithms:
- DPO (Direct Preference Optimization): Used for offline preference learning.
- Dr. CPPO: A hybrid of Dr. GRPO (removes reference model dependency) and CPPO (selects samples with the strongest signal). This online RL approach samples multiple outputs, calculates advantages, and updates the policy to maximize rewards.
Hard Example Mining: RL is applied only to "hard" examples where the SFT model's initial reward is below a threshold ( $R_{th} = 7.5$ ), ensuring efficient use of compute.

3. Key Contributions

First Multimodal SOTA: Cadrille is the first model to achieve state-of-the-art results simultaneously across three input modalities (point clouds, images, text) within a unified framework, outperforming specialized single-modal baselines.
RL for Multimodal CAD: The paper proves that Reinforcement Learning fine-tuning significantly improves multimodal CAD reconstruction, specifically enhancing code validity and geometric accuracy.
Novel Training Strategy: The authors demonstrate that separating SFT (on massive synthetic data) from RL (on scarce, high-quality handcrafted/real data) is superior to mixing these datasets during SFT. This approach bridges the domain gap without requiring massive handcrafted datasets for initial training.
Comprehensive Evaluation: The model sets new SOTA on 10 benchmarks across 4 datasets (DeepCAD, Fusion360, CC3D, Omni-CAD), including a real-world dataset (CC3D).

4. Experimental Results

The paper reports results using Chamfer Distance (CD), Intersection over Union (IoU), and Invalidity Ratio (IR).

Performance on DeepCAD: Cadrille outperforms all modality-specific baselines.
- Point Clouds: IoU 87.1% (vs. 77.3% for CAD-SIGNet), IR reduced to 0.0% (vs. 5.0%).
- Images: IoU 86.1% (vs. 3.6% for CADCrafter), IR reduced to 0.0%.
- Text: IoU 82.1% (vs. 71.5% for Text2CAD).
Generalization (Zero-Shot):
- On Fusion360 and CC3D (real-world noisy scans), Cadrille achieves SOTA.
- On CC3D (point clouds), RL fine-tuning reduced the Invalidity Ratio from 7.7% (SFT only) to 0.1%, while improving IoU from 56.1% to 65.0%.
RL Impact:
- RL fine-tuning on images improved point cloud reconstruction performance, demonstrating cross-modal benefits.
- Dr. CPPO (Online RL) consistently outperformed DPO (Offline RL) and SFT, achieving near-zero invalidity ratios across all datasets.
Efficiency: Unlike previous methods that rely on test-time sampling (generating 10+ candidates to find a valid one), Cadrille achieves high validity with single-sample inference, maintaining speed comparable to single-modal baselines.

5. Significance and Impact

Democratization of Design: By supporting text, images, and point clouds, Cadrille lowers the barrier to entry for CAD creation, allowing non-experts to generate precise 3D models from simple descriptions or smartphone photos.
Robustness to Real-World Data: The successful application of RL on real-world scans (CC3D) suggests the model can handle noise, missing parts, and smoothed edges, making it viable for industrial reverse engineering.
Paradigm Shift: The paper establishes that programmatic feedback (executing code to verify geometry) is a powerful signal for training generative models in engineering domains, offering a blueprint for future research in AI-driven design.
Code Availability: The authors have open-sourced the code, facilitating further research and application development.

In conclusion, Cadrille represents a significant leap forward in CAD reconstruction, moving from fragile, single-modal prototypes to a robust, multimodal, RL-enhanced system capable of generating production-ready, editable 3D models from diverse real-world inputs.

cadrille: Multi-modal CAD Reconstruction with Reinforcement Learning

1. The Problem: The "One-Trick Pony"

2. The Solution: The "Master Chef" (Cadrille)

3. How It Learned: The "Apprentice" and the "Coach"

4. Why This is a Big Deal

The Analogy Summary

The Bottom Line

1. Problem Statement

2. Methodology: The Cadrille Framework

A. Architecture

B. Training Pipeline: A Two-Stage Approach

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank