CAD-Tokenizer: Towards Text-based CAD Prototyping via Modality-Specific Tokenization

Imagine you are an architect trying to build a 3D model of a chair. You have two ways to talk to a computer about this:

The "Raw Data" Way: You give the computer a list of millions of specific coordinates for every single point on the chair. It's like giving someone a recipe that lists the exact chemical composition of every grain of flour. It's precise, but it's overwhelming and hard to edit.
The "CAD" Way (Computer-Aided Design): You give the computer a set of instructions: "Draw a circle, then pull it up to make a cylinder, then cut a hole in the side." This is how real engineers work. It's a sequence of logical steps (sketches and extrusions) that builds the object.

The problem is that AI models (like the ones powering ChatGPT) are trained to read and write human language. They are great at understanding words like "chair," "red," or "big." But when you ask them to write a CAD sequence, they get confused.

The Problem: The Wrong Dictionary

Think of a standard AI tokenizer (the part of the AI that breaks text into chunks it can understand) like a dictionary that only knows words.

If you ask a standard AI to write a CAD instruction like extrusion(10, 5), a standard tokenizer might chop it up into weird, meaningless pieces like ["extru", "sion", "(", "1", "0", ")"].

It's like asking a chef to cook a meal, but the chef only understands the letters in the words "salt" and "pepper," not the concept of "seasoning." The AI loses the structure. It doesn't see that extrusion is a single, important action; it just sees a jumble of letters. This makes it terrible at building or editing complex 3D shapes.

The Solution: CAD-Tokenizer

The authors of this paper built a special translator called CAD-Tokenizer.

Imagine you are teaching a child to build with LEGO.

Old Way: You hand them a bag of loose bricks and say, "Build a house." They might just pile them up randomly because they don't know what a "wall" or a "roof" is as a single concept.
New Way (CAD-Tokenizer): You give them pre-assembled blocks: a "Wall Block," a "Roof Block," and a "Window Block." Now, when you say, "Build a house," they can snap these meaningful blocks together perfectly.

CAD-Tokenizer does exactly this for AI:

It learns the "LEGO blocks" of CAD: Instead of breaking CAD code into random letters, it groups them into primitives (the basic building blocks like "draw a line," "make a curve," "extrude this shape").
It speaks the AI's language: It translates these CAD blocks into a special code that the AI's brain (the Large Language Model) can understand and predict, just like it predicts the next word in a sentence.
It follows the rules: CAD has strict grammar (you can't cut a hole before you draw the shape). The authors added a "rulebook" (a Finite State Automaton) that acts like a strict editor, ensuring the AI never makes a grammatical mistake in its 3D instructions.

What Does This Actually Do?

The paper shows that this new system can do two things in one go, which previous AI couldn't do well:

Text-to-CAD: You say, "Make a coffee mug," and the AI generates the perfect step-by-step instructions to build it.
CAD Editing: You say, "Take that mug and make the handle bigger," and the AI knows exactly which step to change without breaking the whole model.

The Result

By using this "LEGO block" approach instead of the "letter soup" approach, the AI becomes much smarter at design.

It's faster: It doesn't have to guess millions of tiny letters; it just picks the right building blocks.
It's more accurate: The 3D models it creates actually look like what you asked for.
It's more flexible: It can both create new things and fix old things, just like a human engineer.

In short: The paper teaches AI to stop thinking of 3D design as a jumble of letters and start thinking of it as a logical sequence of building blocks, making it a much better digital architect.

1. Problem Statement

Context: Computer-Aided Design (CAD) is fundamental to industrial prototyping. Unlike raw 3D meshes, CAD models are defined by sequential construction operations (e.g., sketches, extrusions, boolean operations). This sequential structure allows for efficient initialization and iterative editing.
The Challenge: While Large Language Models (LLMs) have shown promise in text-guided CAD tasks, existing approaches suffer from a critical mismatch in tokenization:

Standard Tokenizers: General-purpose LLM tokenizers (e.g., BPE) decompose CAD sequences into natural language word pieces (e.g., splitting "extrusion" into "[extru][sion]" or treating numeric parameters as arbitrary substrings).
Consequences: This fragmentation prevents attention mechanisms from capturing primitive-level semantics (e.g., the relationship between a specific line and an arc) and geometric structures. It hinders the model's ability to perform unified tasks that require both generating new designs from scratch (Text-to-CAD) and modifying existing designs (CAD Editing) within a single framework.
Gap: No prior work has successfully unified these two distinct tasks (generation and editing) because the backbone models struggle to understand the structural dependencies of CAD primitives when tokenized as natural language.

2. Methodology: CAD-Tokenizer

The authors propose CAD-Tokenizer, a framework that replaces standard natural language tokenization with a modality-specific, primitive-level tokenization strategy. The framework consists of four key stages:

A. Primitive-Based VQ-VAE Tokenization

Instead of tokenizing character-by-character, the system compresses CAD sequences into discrete tokens representing sketch-extrusion pairs.

Architecture: A Vector Quantized Variational Autoencoder (VQ-VAE) is pre-trained on CAD sequences.
Primitive Pooling: Unlike standard VQ-VAEs that pool an entire sequence into one vector, this model introduces a primitive-specific pooling layer. It decomposes the input into sketch-extrusion pairs and generates multiple discrete latent vectors (one per primitive).
Output: This produces compact, primitive-aware representations that capture both local geometric details and contextual sequence information.

B. Alignment with LLM Embeddings

To integrate the VQ-VAE output with a pre-trained LLM (e.g., LLaMA-3):

Adapter Modules: The authors train lightweight adapter layers that map the VQ-VAE's latent vectors to the LLM's embedding space.
Bi-directional Reconstruction: The adapters are trained using a vector reconstruction loss to ensure the mapped tokens remain faithful to the original primitive representations without modifying the frozen LLM backbone. This creates "native" LLM-recognizable primitive IDs.

C. Unified Instruction Tuning

The framework unifies Text-to-CAD and CAD Editing into a single training objective:

Input: A prompt containing an instruction ( $I$ ) and an optional original CAD sequence ( $C_{orig}$ ).
Process: The original CAD sequence is encoded into primitive tokens by the tokenizer. The LLM is fine-tuned (using LoRA) to predict the next primitive token sequence ( $C_{gen}$ ) based on the instruction and the encoded input.
Efficiency: Because CAD sequences are significantly compressed (fewer tokens than raw text/characters), training is more computationally efficient.

D. FSA-Guided Inference

To ensure the generated CAD sequences are syntactically valid:

Finite-State Automaton (FSA): An FSA is designed to formalize the grammar of CAD construction rules (e.g., a sketch must end before an extrusion begins).
Decoding Strategy: During inference, the FSA provides a mask that restricts the LLM's candidate logits to only grammar-compliant tokens. This prevents syntactic errors that standard sampling methods (like top-p or beam search) might produce.

3. Key Contributions

Unified Framework: The first model to tackle unified text-based CAD prototyping, simultaneously handling Text-to-CAD generation and CAD editing.
Modality-Specific Tokenizer: Introduction of a primitive-level VQ-VAE tokenizer that maps CAD data to discrete tokens, overcoming the fragmentation issues of standard LLM tokenizers.
Alignment Mechanism: A novel adapter-based alignment strategy that integrates the tokenizer vocabulary with pre-trained LLM backbones without retraining the backbone.
Grammar-Constrained Decoding: Implementation of an FSA-based sampling strategy that enforces valid CAD syntax, significantly reducing invalid generation rates.
Comprehensive Evaluation: Extensive quantitative and qualitative experiments demonstrating superiority over both task-specific baselines (CADFusion, CAD-Editor) and general-purpose LLMs.

4. Experimental Results

The model was evaluated on the SkexGen dataset (formatted for CADFusion and CAD-Editor), with ~100k training pairs and 1,000 test pairs.

Quantitative Performance:

CAD Editing: CAD-Tokenizer achieved the best results across almost all metrics, improving F1 scores (Sketch and Extrusion) by ~10 points compared to the task-specific CAD-Editor. It also showed superior instruction following in VLM and Human Evaluation scores.
Text-to-CAD: The model significantly outperformed baselines in F1 scores and Chamfer Distance (CD). While a task-specific model (CADFusion) had a slightly higher VLM score, CAD-Tokenizer achieved better overall distributional metrics (COV, JSD, MMD) and a lower Invalidity Ratio (IR).
Baselines: Standard LLMs using vanilla tokenizers (Vanilla-LLaMA) and GPT-4o performed poorly, often failing to capture structural dependencies or producing invalid sequences.

Ablation Studies:

Tokenization: Primitive-level pooling (Curve-based) was shown to be superior to single-token encoding or standard BPE. It balanced compression ratio with reconstruction quality.
Sampling: The FSA-guided sampling strategy significantly reduced the Invalidity Ratio (from ~45% with beam search to ~4.9% with FSA) compared to standard decoding methods.

Qualitative Findings:

The model produced more faithful multi-face objects and better structural alignments than baselines.
In editing tasks, it successfully modified existing shapes according to instructions, whereas baselines often failed to deviate from the original geometry.

5. Significance and Impact

Paradigm Shift: The paper demonstrates that for domain-specific sequential data like CAD, custom tokenization is critical. Treating CAD as natural language leads to suboptimal performance; treating it as a sequence of geometric primitives unlocks the potential of LLMs.
Unified Workflow: By unifying generation and editing, CAD-Tokenizer mirrors the real-world industrial workflow where designers iteratively create and modify models, moving away from siloed task-specific models.
Efficiency: The compression of CAD sequences into primitive tokens reduces the computational cost of training and inference while improving model convergence.
Reliability: The integration of formal grammar (FSA) into the decoding process addresses the "hallucination" of invalid CAD structures, a major bottleneck in applying LLMs to engineering tasks.

Limitations: The authors note that the model's performance is currently limited by the quality of open-source CAD datasets (which lack complex shapes) and the backbone LLM's inherent lack of spatial/common-sense reasoning, which affects performance on highly abstract or complex geometric instructions.

CAD-Tokenizer: Towards Text-based CAD Prototyping via Modality-Specific Tokenization

The Problem: The Wrong Dictionary

The Solution: CAD-Tokenizer

What Does This Actually Do?

The Result

1. Problem Statement

2. Methodology: CAD-Tokenizer

A. Primitive-Based VQ-VAE Tokenization

B. Alignment with LLM Embeddings

C. Unified Instruction Tuning

D. FSA-Guided Inference

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

FLeX: Fourier-based Low-rank EXpansion for multilingual transfer

Spectral Edge Dynamics Reveal Functional Modes of Learning

S3S^3S3: Stratified Scaling Search for Test-Time in Diffusion Language Models

$S^3$ : Stratified Scaling Search for Test-Time in Diffusion Language Models