FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation

Imagine you are trying to teach a robot how to draw a complex 3D statue, like a dragon or a car, but the robot only understands a very specific, rigid language: a long list of numbers.

The Old Way: The "Pixel-by-Pixel" Struggle

In the past, if you wanted the robot to build a 3D mesh (a wireframe model made of triangles), you had to describe every single corner (vertex) of every single triangle separately.

Think of a 3D model like a giant mosaic made of millions of tiny tiles.

The Problem: To describe one tile, you had to list the coordinates of its three corners. Since each corner has an X, Y, and Z coordinate, describing just one triangle required 9 numbers.
The Bottleneck: If your statue has 1,000 triangles, the robot has to read a list of 9,000 numbers just to understand the shape.
The Result: The robot's brain (the computer) gets overwhelmed. It's like trying to read a 9,000-page book just to understand a simple story. The process is slow, expensive, and the robot often gets tired and makes mistakes, resulting in a wobbly, low-quality statue.

The New Way: FACE (The "Tile-by-Tile" Revolution)

The paper introduces a new method called FACE. Instead of forcing the robot to read every single number one by one, FACE changes the rules of the game.

The "One-Face-One-Token" Strategy
Imagine you are building a wall out of bricks.

Old Method: You tell the builder, "Put a brick here, then a brick there, then a brick there..." listing every single brick's position individually.
FACE Method: You hand the builder a pre-assembled brick and say, "Here is one complete brick. Now, here is the next one."

In the world of 3D models, a "brick" is a triangle face. FACE treats an entire triangle (with all its 3 corners and 9 numbers) as a single, unified unit.

Why This is a Game-Changer

The Shortcut: By grouping the 9 numbers into 1 unit, the robot's "reading list" becomes 9 times shorter.
- Analogy: Instead of reading a 9,000-page book, you are now reading a 1,000-page book.
Speed & Efficiency: Because the list is so much shorter, the computer doesn't have to work as hard. It's like switching from a slow, winding dirt road to a high-speed highway. The paper claims this makes the process twice as efficient as the best previous methods.
Better Quality: You might think, "If we group things together, do we lose detail?" Surprisingly, no. Because the computer isn't struggling with the sheer volume of data, it can focus its energy on getting the shape right. The result is a statue that looks sharp, smooth, and realistic, with no weird holes or glitches.

The Magic "Latent Space" (The Robot's Dream)

The paper also shows that this new way of thinking helps the robot "dream" up new shapes.

They trained the robot to look at a single photo of an object and then build a 3D model of it.
Because the robot learned to understand shapes as "faces" rather than "numbers," it created a very organized mental library (called a latent space).
When you show it a picture of a chair, it doesn't just guess; it pulls a "chair-face" from its library and assembles it perfectly, even if it's never seen that specific chair before.

The Bottom Line

FACE is like upgrading from a typewriter that types one letter at a time to a printer that prints whole words at once.

Before: Slow, clunky, and prone to errors when the job got big.
Now: Fast, efficient, and capable of creating high-definition 3D worlds from simple inputs like point clouds or single images.

This breakthrough lowers the barrier for creating 3D content, meaning in the future, video games, movies, and virtual reality could be populated with incredibly detailed, realistic objects generated in seconds rather than days.

1. Problem Statement

Current autoregressive (AR) models for 3D mesh generation suffer from a fundamental bottleneck: computational inefficiency due to long token sequences.

The Status Quo: Existing methods (e.g., MeshGPT, MeshXL) flatten a mesh into a 1D sequence of individual vertex coordinates. Since a triangle face consists of 3 vertices (9 coordinates), the sequence length $S$ is $9 \times N$ (where $N$ is the number of faces).
The Bottleneck: Transformer-based AR models rely on self-attention mechanisms with quadratic complexity $O(S^2)$ . The massive sequence length required for high-fidelity meshes makes training and inference computationally prohibitive, limiting the resolution and complexity of generated geometries.
Limitations of Prior Solutions: Previous attempts to mitigate this via complex traversal algorithms or block indexing often introduce brittleness, disrupt global structure, or lead to exploding vocabulary sizes, addressing the symptom (length) rather than the root cause (semantic level).

2. Methodology: The FACE Framework

The authors propose FACE, a novel Autoregressive Autoencoder (ARAE) that reconceptualizes mesh generation by operating at the face level rather than the vertex level.

Core Innovation: "One-Face-One-Token"

Instead of treating 9 coordinates as 9 separate tokens, FACE treats each triangle face as a single unified token.

Sequence Reduction: This reduces the sequence length by a factor of 9 ( $S = N$ instead of $9N$ ).
Complexity Impact: This theoretically reduces the computational cost of self-attention by a factor of $9^2 = 81$ , enabling scalable generation of high-resolution meshes.

Architecture Components

The framework consists of two main stages:

Shape Encoder (VecSet Encoder):
- Input: A point cloud $P$ (XYZ + Normals).
- Mechanism: Based on the 3DShape2VecSet architecture. It uses Farthest Point Sampling (FPS) to select query points, which attend to the full point cloud via Cross-Attention to aggregate global geometry.
- Output: A compact, structured latent representation called a VecSet ( $C$ ), which serves as the global conditioning signal.
Autoregressive Face Decoder:
- Input: The latent VecSet $C$ and the sequence of previously generated faces.
- Face Embedding: A lightweight MLP projects the 9D coordinate vector of a face into a single $d_{model}$ -dimensional token ("Face Pooling").
- Transformer Processing:
  - Causal Self-Attention: Models the local structure and connectivity of the mesh based on the sequence of generated faces.
  - Cross-Attention: Injects the global shape context from the Encoder's VecSet $C$ at every layer to ensure consistency with the target geometry.
- Face Decoding Head (CausalMLP):
  - Unlike standard parallel decoding, this head uses a CausalMLP to decode the 9 coordinates of a face sequentially.
  - The prediction of the $j$ -th coordinate is conditioned on the latent face vector and the previously predicted coordinates within that same face. This hierarchical autoregression (face-level + coordinate-level) ensures robust generation.

Image-to-Mesh Extension

The authors validate the versatility of the learned latent space by training a Diffusion Transformer (DiT).

The DiT denoises the latent VecSet conditioned on image features (extracted via DINOv3).
The pre-trained FACE decoder then reconstructs the mesh from the generated latent VecSet, enabling high-fidelity single-image-to-mesh generation without fine-tuning the decoder.

3. Key Contributions

Novel Paradigm: Introduction of the "one-face-one-token" strategy, shifting autoregressive generation from vertex-level to face-level semantics.
Unprecedented Efficiency: Achieves a compression ratio of 0.11 (sequence length relative to a 1-token-per-coordinate baseline), effectively halving the sequence length of the previous state-of-the-art (0.22).
State-of-the-Art Quality: Demonstrates that this efficiency gain does not compromise fidelity. FACE achieves SOTA reconstruction quality (lowest Hausdorff and Chamfer distances) on multiple benchmarks.
Versatile Latent Space: Proves the learned latent space is robust and generalizable by successfully enabling downstream tasks like single-image-to-mesh generation via latent diffusion.

4. Experimental Results

Datasets: Evaluated on Objaverse, Toys4K, and Famous (complex, iconic shapes).
Quantitative Performance:
- Reconstruction: FACE significantly outperforms baselines (MeshAnything, MeshAnythingV2, TreeMeshGPT, BPT).
  - Example: On the Famous dataset, FACE achieved a Hausdorff Distance of 0.077, compared to 0.143 for the next best method (BPT).
- Compression: Achieved a practical compression ratio of 0.11, compared to 0.22 for the previous best.
Qualitative Results:
- Generated meshes exhibit fewer topological errors (holes, disconnected fragments) and better preservation of sharp features compared to baselines.
- In image-to-mesh tasks, FACE produces more aligned and detailed structures (e.g., Lego figures, birds) compared to EdgeRunner.
Ablation Studies:
- Ordering: Simple lexicographical ZYX sorting outperformed complex graph traversal (DFS/BFS).
- Decoding: The CausalMLP head significantly outperformed parallel and attention-based decoding strategies.
- Scaling: A larger model (1.2B parameters, 1024-resolution quantization) showed even higher fidelity, confirming the framework's scalability.

5. Significance

FACE represents a paradigm shift in 3D generative modeling. By recognizing that the fundamental building block of a mesh is the face, not the vertex, the authors bypassed the quadratic computational barrier that has limited AR models.

Scalability: The method lowers the barrier for generating high-resolution, complex 3D assets, making high-fidelity mesh generation computationally feasible.
Foundation for Future Work: The successful integration with diffusion models suggests that FACE provides a robust, generalizable representation for 3D shapes, paving the way for multi-modal 3D creation workflows (e.g., text-to-mesh, image-to-mesh).
Simplicity: The approach achieves these gains through architectural elegance rather than complex, lossy compression schemes, offering a simple and scalable solution for the industry.