Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation

Imagine you are an architect trying to build a 3D house out of digital Lego bricks. In the world of computer graphics, these "bricks" are called meshes. Most 3D models are built using triangles (like a pyramid), but professional artists and video game developers prefer quadrilaterals (squares or rectangles) because they are easier to animate, texture, and edit.

The paper introduces Mesh-Pro, a new AI system designed to build these perfect square-based 3D models. It solves three major problems that previous AI systems faced: they were too slow, they made messy models with holes, and they couldn't learn from their mistakes effectively.

Here is the breakdown using simple analogies:

1. The Problem: The "Traffic Jam" of Training

Imagine a team of construction workers (the AI) trying to build a house.

Old Method (Synchronous RL): The boss tells all workers to start building. But some houses take 10 minutes, and others take 10 hours. The boss waits for the slowest worker to finish before giving the next instruction. Everyone else stands around doing nothing, wasting time and money. This is what happened with previous AI methods; they were incredibly inefficient.
The Mesh-Pro Solution (Asynchronous Framework): Mesh-Pro changes the rule. Instead of waiting, the boss lets workers keep building at their own pace. As soon as a worker finishes a house, they hand it to the boss for inspection, and the boss immediately updates the instructions for the next batch of workers. No one stands around waiting.
- Result: This makes the training process 3.75 times faster. It's like switching from a single-file line to a busy highway where cars keep moving.

2. The Algorithm: The "Smart Coach" (ARPO)

Once the workers are building fast, they need to know how to build better.

Old Method (DPO): Imagine a coach who only says, "This house is better than that one," without explaining why. The workers guess what to change, which is slow and sometimes leads to bad habits.
Old Method (GRPO): Imagine a coach who tries to calculate the perfect mathematical formula for every single brick. It's too complicated, and the workers get confused and stop improving.
The Mesh-Pro Solution (ARPO): Mesh-Pro uses a "Smart Coach" called Advantage-guided Ranking Preference Optimization (ARPO).
- It looks at a group of houses the workers built.
- It ranks them (Best to Worst).
- Crucially, it doesn't just say "Good job." It calculates exactly how much better the best house is compared to the average (the "Advantage").
- It tells the workers: "Focus on the specific features that made the best house stand out, but ignore the tiny flaws in the average ones."
- Result: The AI learns faster and, more importantly, learns to build new types of houses it hasn't seen before (better generalization).

3. The Blueprint: The "Diagonal-Aware" Language

To build a house, you need a language to describe the bricks.

The Problem: Previous AI tried to describe a square by saying, "Here is a triangle, and oh, by the way, add a fourth corner." This was confusing and often led to the AI drawing a weird shape or a hole in the wall.
The Mesh-Pro Solution: Mesh-Pro invented a new "language" (Tokenization). It treats a square as a triangle plus a secret code (a "diagonal flag") that tells the AI exactly how to connect the fourth corner.
- Analogy: Instead of saying "Draw a square," the AI says, "Draw a triangle, then add a diagonal line here." This removes the guesswork and ensures the walls are straight and the roof doesn't collapse.

4. The Safety Inspector: The "Ray-Casting" Reward

How does the AI know if the house is solid?

The Problem: Sometimes AI builds a house that looks good from the outside but has invisible holes or floating walls inside.
The Mesh-Pro Solution: The AI uses a "Ray-Casting" reward system. Imagine shining a flashlight (a ray) from every angle at the model.
- If the light hits a wall and bounces back correctly, the model gets a point.
- If the light passes through a hole or hits the back of a wall it shouldn't (a "back-face hit"), the model gets a zero.
- This forces the AI to build watertight, solid structures that don't fall apart.

The Big Picture

Mesh-Pro is like a super-efficient, fast-learning construction crew.

It never waits (Asynchronous training).
It has a smart coach that knows exactly what makes a house great (ARPO).
It speaks a clear language that prevents structural errors (Diagonal-aware tokens).
It uses flashlights to ensure there are no holes in the walls (Ray-based rewards).

The result? It can generate 3D models that look like they were hand-crafted by professional artists, with perfect square shapes, ready for video games and movies, in a fraction of the time it used to take.

1. Problem Statement

High-quality 3D mesh generation is critical for gaming and embodied intelligence, yet current methods face significant challenges:

Quality vs. Topology: Supervised learning often produces meshes with artifacts like holes, non-manifold surfaces, and disorganized topology.
Inefficiency of Offline RL: Existing Reinforcement Learning (RL) approaches for mesh generation rely on Offline Direct Preference Optimization (DPO). This suffers from low training efficiency because it uses static, pre-constructed data and cannot dynamically update policies.
Synchronous RL Bottlenecks: Online RL is theoretically superior but impractical for 3D meshes due to highly variable token lengths. Synchronous training causes severe GPU idle times while waiting for the longest sequences to finish, making it 3.75× slower than necessary.
Generalization Issues: While methods like Group Relative Policy Optimization (GRPO) attempt explicit reward modeling, they often suffer from slow convergence and poor exploration-exploitation efficiency when constrained by complex reward distributions in 3D geometry.
Tokenization Flaws: Prior methods for mixed triangle-quadrilateral meshes often use leading tokens to declare face types prematurely, leading to inconsistent ordering and geometric artifacts.

2. Methodology

The authors propose Mesh-Pro, a comprehensive framework integrating a novel tokenization scheme, a new RL algorithm, and an asynchronous training infrastructure.

A. Diagonal-Aware Mesh Tokenization

To address structural defects in mixed triangle-quad meshes, Mesh-Pro introduces a diagonal-aware tokenization scheme:

Canonical Ordering: Vertices are normalized and sorted lexicographically, ensuring every face sequence starts with its absolute minimum-index vertex.
Deferred Decision: Instead of declaring a face type (triangle vs. quad) at the start, the model generates a base triangle first. It then decides whether to terminate (triangle) or append a fourth vertex (quad).
Diagonal Encoding: For quadrilaterals, the internal diagonal orientation is explicitly encoded via an offset flag added to the fourth vertex's index. This reduces predictive burden and ensures consistent geometric representation.

B. Asynchronous Online RL Framework

To overcome the latency of synchronous RL caused by variable sequence lengths:

Architecture: The system decouples Rollout Workers (which generate mesh data) from Trainer Workers (which update the model).
Mechanism: Rollout workers continuously sample data into a replay buffer. Trainer workers sample from this buffer to update the policy.
Efficiency: Outdated data is discarded to ensure the trainer always learns from the most recent policy. This eliminates GPU idle time, achieving a 3.75× speedup over synchronous RL.
Pre-Start Stage: A stabilization phase where the model adapts to the reward distribution before the main asynchronous loop begins, preventing early instability.

C. Advantage-guided Ranking Preference Optimization (ARPO)

ARPO is the core algorithm designed to balance training efficiency and generalization:

Hybrid Approach: It combines the fast, stable convergence of Ranking Preference Optimization (similar to DPO) with Explicit Advantage Guidance.
Mechanism: ARPO samples a group of rollouts, ranks them based on rewards, and calculates an Advantage Function ( $A$ ). This advantage acts as a weighting mechanism in the loss function.
Benefit: High-reward samples are weighted more heavily, guiding the model to learn the underlying reward distribution explicitly. This overcomes the slow convergence of GRPO and the poor generalization of standard DPO.
Truncated Training: To handle long sequences, the framework uses truncated windows (36,864 tokens) during the RL phase, allowing the model to learn locally optimal decisions that lead to globally superior topology.

D. Reward Design

The reward function $R(M_t)$ is multifaceted, combining:

Ray-Based Integrity Reward ( $R_{ray}$ ): Casts rays from multiple directions to detect "bad faces" (broken surfaces or back-face hits). If the number of bad faces exceeds a threshold, the reward is zero. This is superior to boundary-edge detection as it handles multi-component objects correctly.
Topological Reward ( $R_{topo}$ ): Quantifies structured edge flow by counting Quad Rings (closed loops) and Quad Lines (open strips).
Geometric Consistency: Uses Hausdorff distance to ensure the generated mesh aligns with the input point cloud.

3. Key Contributions

First Asynchronous Online RL Framework for 3D Meshes: Designed specifically to handle variable token lengths, achieving a 3.75× training speedup over synchronous methods.
ARPO Algorithm: A novel RL algorithm that explicitly leverages advantage functions within a ranking preference framework, offering a superior trade-off between training efficiency and generalization compared to DPO and GRPO.
Mesh-Pro System: Integrates asynchronous ARPO with diagonal-aware tokenization and ray-based rewards to generate artist-style, quadrilateral-dominated meshes with high fidelity and topological quality.

4. Experimental Results

The paper evaluates Mesh-Pro on both dense meshes (from Hunyuan3D 2.5) and artist-created meshes (Toys4k).

Quantitative Performance:
- Broken Ratio (BR): Mesh-Pro achieves a 22% broken ratio on dense meshes and 32% on artist meshes, significantly outperforming baselines like QuadGPT (50%/39%) and DeepMesh (91%/64%).
- Topological Quality (QR): Achieves 81% quad ratio on dense meshes, surpassing QuadGPT (78%).
- User Study (US): Receives the highest subjective scores (5.2/5 for dense, 4.9/5 for artist meshes), indicating it closely matches professional artist standards.
- Geometric Metrics: Lowest Chamfer Distance (CD) and Hausdorff Distance (HD) compared to all baselines.
Ablation Studies:
- Asynchronous vs. Synchronous: Confirms the 3.75× efficiency gain.
- ARPO vs. DPO/GRPO: ARPO converges faster than GRPO and generalizes better than DPO.
- Tokenization: The diagonal-aware scheme reduces structural broken ratios significantly compared to previous tokenizers.
- Reward Design: Removing the ray-based reward causes a massive spike in broken ratios, proving its necessity for geometric integrity.

5. Significance

Paradigm Shift: Mesh-Pro demonstrates that Reinforcement Learning can be effectively applied to 3D mesh generation, moving beyond the limitations of offline DPO.
Industrial Applicability: The generated meshes are "artist-style," meaning they possess clean edge flows and quad-dominant structures essential for UV unwrapping, texture painting, and animation rigging in the gaming and film industries.
Scalability: The asynchronous framework solves the computational bottlenecks of training on variable-length 3D data, paving the way for larger-scale 3D foundation models.
Future Impact: The work suggests a path toward RL-driven 3D generation achieving breakthroughs similar to those seen in text and image generation, potentially transforming how 3D assets are created.