LATO: 3D Mesh Flow Matching with Structured TOpology Preserving LAtents

Imagine you want to build a 3D model of a dragon for a video game. You have two main ways to do this in the world of AI:

The "Mud Sculptor" Approach (Current Standard): The AI starts with a giant block of digital clay (a dense grid). It carves away the excess to reveal the dragon's shape. The problem? The result is a messy pile of tiny, uneven triangles. It looks okay from a distance, but if a game developer tries to rig the dragon's wings to make them flap, the messy triangles break. It's like trying to build a house out of wet sand; it holds a shape, but you can't easily move the rooms around.
The "Lego Master" Approach (Old Explicit Methods): The AI tries to build the dragon piece by piece, snapping Lego bricks together one by one. This creates a perfect structure, but it takes forever. If the dragon is complex, the AI has to place millions of bricks, which makes the process incredibly slow and often causes the model to run out of memory and crash.

Enter LATO: The "Architect with a Blueprint."

The paper introduces LATO, a new method that acts like a master architect who doesn't just sculpt mud or place bricks one by one. Instead, LATO builds a smart, invisible blueprint that knows exactly where every wall, window, and door should go before the building is even finished.

Here is how LATO works, broken down into simple analogies:

1. The Secret Ingredient: The "Displacement Map" (VDF)

Most AI models just look at a shape and say, "Is this point inside or outside the object?" LATO is smarter. It looks at a point on the surface and asks, "Where are the three corners of the triangle I'm standing on?"

The Analogy: Imagine you are standing on a floor tile. A normal AI just knows you are on the floor. LATO knows you are on a specific tile and can point to the three corners of that tile. It carries this "pointing" information everywhere. This allows the AI to understand the structure (the topology) of the object, not just its shape.

2. The Compression: The "Magic Squeeze" (T-Voxels)

Because carrying all that "pointing" information for a whole dragon is too heavy, LATO uses a Sparse Voxel VAE. Think of this as a magic compression algorithm.

The Analogy: Imagine you have a giant, messy room full of furniture. Instead of moving every single chair and table individually, you put them all into a few smart, self-organizing boxes. These boxes (called T-Voxels) only exist where there is furniture. If a part of the room is empty, the box doesn't exist. This keeps the data tiny and efficient. Crucially, these boxes remember how the furniture connects to each other.

3. The Construction: "Growing and Pruning"

When LATO needs to build the dragon, it doesn't start with the final model. It starts with a rough, low-resolution cloud of those smart boxes.

The Analogy: Think of it like a tree growing.
1. Grow: The AI starts with a few big branches (coarse boxes).
2. Split: It splits those branches into smaller twigs (subdividing the boxes).
3. Prune: It immediately cuts off the twigs that are empty air (pruning).
4. Repeat: It keeps splitting and pruning until it has the exact shape of the dragon.
- The Magic: Because it started with a "smart blueprint," it knows exactly where the vertices (the corners) are. It doesn't have to guess; it just reveals them.

4. The Connection: The "Wiring Diagram" (Connection Head)

Once the AI has placed all the points (vertices), it needs to know which points connect to which to form the triangles.

The Analogy: Imagine you have a pile of dots on a piece of paper. A normal AI might try to connect them randomly, creating a mess. LATO has a special "Wiring Head" that looks at the dots and instantly draws the lines between them, knowing exactly which ones should be connected to form a perfect triangle. It predicts the connections directly, skipping the messy "guessing" phase.

Why is this a Big Deal?

Speed: Unlike the "Lego Master" approach that takes minutes or hours, LATO works like a "Mud Sculptor" (fast), generating complex models in 3 to 10 seconds.
Quality: Unlike the "Mud Sculptor" approach that leaves you with messy, unusable triangles, LATO gives you clean, organized triangles that game developers and animators can actually use.
Flexibility: It can handle "open" shapes (like a cup without a bottom) or weird shapes that usually break other AI models.

In Summary:
LATO is like a construction crew that doesn't just pour concrete (slow and messy) or lay bricks one by one (too slow). Instead, they use a smart, self-expanding 3D grid that knows the rules of architecture. They grow the building from the inside out, cutting away the empty space as they go, resulting in a perfectly structured, game-ready 3D model in the blink of an eye.

1. Problem Statement

Current 3D generative models face a fundamental dichotomy between scalability and explicit topological control:

Implicit Field Approaches (e.g., SDF, Occupancy): Methods like TRELLIS, CLAY, and Hunyuan3D use latent diffusion to generate shapes. While scalable and geometrically faithful, they rely on isosurface extraction (e.g., Marching Cubes) to produce meshes. This results in dense, irregular triangulations that lack the structured topology required for downstream tasks like rigging, deformation, or game engine deployment. They also strictly require "watertight" training data, discarding open surfaces.
Explicit Mesh Approaches (e.g., Autoregressive Models): Methods like MeshGPT or MeshAnything treat mesh generation as a sequence modeling problem. While they produce explicit topology, they suffer from computational bottlenecks. Generating high-resolution meshes requires excessively long token sequences, leading to memory constraints, truncated training, and fragmented/broken surfaces.

The Core Challenge: How to generate 3D meshes that are both scalable (efficient inference, high resolution) and topology-aware (explicit, artist-friendly connectivity) without relying on post-hoc isosurfacing or suffering from sequence length limits.

2. Methodology: LATO

The authors propose LATO, a framework that bridges this gap by introducing a topology-preserving sparse voxel representation called T-Voxels. The pipeline consists of three main components:

A. Vertex Displacement Field (VDF) Representation

Instead of encoding raw points or occupancy, LATO represents a mesh $M=(V, F)$ using a Vertex Displacement Field.

Mechanism: For any point $p$ sampled on a triangular face, the field value $F(p)$ is defined as the set of relative displacement vectors pointing to the three vertices of that face.
Advantage: Unlike discrete labels (vertex/edge/face) which create class imbalance and discontinuities, VDF provides a dense, continuous signal. It encodes both the surface geometry and the discrete vertex distribution/connectivity simultaneously, making it suitable for gradient-based diffusion learning.

B. Sparse Voxel VAE with T-Voxels

A Variational Autoencoder (VAE) compresses the VDF into a structured latent space called T-Voxels.

Encoder: Uses a sparse convolutional/transformer backbone to process point clouds infused with VDF features.
Decoder (Hierarchical Subdivision & Pruning):
- Subdivision: Starts with coarse voxels and progressively subdivides them (octree-like) to increase resolution.
- Pruning: A learnable "Pruning Head" discards empty voxels at each stage to maintain sparsity.
- Vertex Instantiation: The centroids of the final refined voxels become the predicted vertex locations ( $\hat{v}$ ).
Connection Head: A dedicated module predicts edge connectivity between vertex pairs. It uses cross-attention to aggregate global context and an MLP to predict the existence of edges ( $\hat{e}_{ij}$ ). This allows the model to recover the mesh topology directly without isosurface extraction.

C. Two-Stage Flow Matching

To generate meshes from images (or other conditions), LATO employs a cascaded Flow Matching process:

Stage 1 (Structure Synthesis): A flow matching model synthesizes a coarse sparse voxel occupancy grid (128³ resolution) from the input image. This establishes the global shape and handles open surfaces/non-manifold geometries.
Stage 2 (Topology Refinement): A second flow matching model takes the occupied voxels from Stage 1 as spatial anchors and generates the detailed T-Voxel features (topology and vertex offsets).
Decoding: The populated T-Voxels are passed to the VAE decoder to instantiate the final explicit mesh.

3. Key Contributions

T-Voxels Latent Representation: A novel sparse voxel latent that explicitly encodes mesh topology (vertex positions and edge connectivity) rather than just surface presence.
Vertex Displacement Field (VDF): A continuous, dense supervision signal that bridges the gap between continuous surface fields and discrete mesh structures, enabling effective training on open surfaces and non-manifold assets.
Direct Topology Recovery: A decoding strategy that predicts vertex locations and edge connectivity directly, eliminating the need for Marching Cubes or heuristic meshing, thus producing "artist-friendly" edge flows.
Scalable Flow Matching: A two-stage generation process that decouples structural synthesis from topological refinement, achieving high-fidelity generation with inference times (3–10s) comparable to implicit models but with explicit topology.

4. Experimental Results

The paper evaluates LATO on artist-crafted meshes (G-Objaverse, Toys4K) and dense synthetic meshes (TRELLIS).

Quantitative Performance:
- Reconstruction: LATO outperforms state-of-the-art explicit methods (MeshGPT, MeshCraft, PivotMesh) in Chamfer Distance (CD), Hausdorff Distance (HD), and Normal Consistency (NC).
- Generation: Compared to autoregressive baselines (MeshAnything-v2, DeepMesh, BPT), LATO achieves superior alignment scores and generates hole-free meshes with well-formed topology.
Inference Efficiency:
- LATO generates meshes in 3–10 seconds on a single H100 GPU.
- In contrast, autoregressive methods scale poorly with mesh complexity, often taking minutes for high-fidelity outputs due to sequential token prediction.
Qualitative Analysis:
- Topology: Unlike implicit models that produce dense, irregular triangulations, LATO generates clean, structured edge flows suitable for animation and deformation.
- Completeness: Unlike autoregressive models that often produce fragmented components due to sequence truncation, LATO produces complete, watertight (or valid open) meshes.
- Scalability: Demonstrated successful generation of large-scale urban scenes by composing block-wise buildings, proving the compositional nature of the sparse voxel representation.

5. Significance and Impact

LATO represents a paradigm shift in 3D generative modeling by unifying the efficiency of voxel-based diffusion with the utility of explicit mesh topology.

Practical Utility: By producing meshes with correct topology directly, LATO makes AI-generated 3D assets immediately usable in industrial pipelines (e.g., game engines, VR, animation) without expensive manual retopology.
Data Efficiency: The ability to train on open surfaces and non-manifold geometries significantly expands the usable dataset for 3D generation, addressing the "data scarcity" problem caused by the watertightness requirement of implicit methods.
Future Direction: The work suggests a path toward overcoming the resolution limits of current voxel grids (potentially via octrees) while maintaining explicit topological control, setting a new standard for high-fidelity 3D content creation.