MME: Mixture of Mesh Experts with Random Walk Transformer Gating

Imagine you are a museum curator trying to organize a massive collection of 3D objects, from tiny sharks to giant chairs. You have a team of three specialized guides, but each one has a very specific talent:

Guide A is amazing at recognizing men but gets confused by animals.
Guide B is a master at spotting horses but struggles with furniture.
Guide C is the best at identifying sharks but can't tell a chair from a table.

In the past, if you wanted to identify a new object, you might ask all three guides to guess and take the average of their answers. Or, you might just pick one guide and hope they are right. Both methods are inefficient because you aren't using the best person for the specific job at hand.

This paper introduces a brilliant new system called Mixture of Mesh Experts (MME). Think of it as hiring a super-smart "Gatekeeper" who stands at the entrance of your museum.

The Gatekeeper's Superpower

The Gatekeeper doesn't just guess; it learns exactly what each guide is good at. But how does it know?

The "Random Walk" Tour: Imagine the Gatekeeper sends a tiny robot on a "random walk" across the surface of the object. The robot hops from one point to another, tracing the shape.
The "Spotlight" Attention: As the robot walks, the Gatekeeper uses a "spotlight" (an attention mechanism) to focus on the most interesting parts of the walk. If the object is a horse, the Gatekeeper notices that Guide B is staring intently at the legs. If it's a shark, it sees Guide C focusing on the fins.
The Decision: Based on these clues, the Gatekeeper instantly decides: "This is a horse! Let's let Guide B make the final call."

This ensures that for every single object, the expert who is actually best at that specific type of object gets to make the decision.

The Tricky Balancing Act: The Coach

There's a catch. If the Gatekeeper lets the experts work in total isolation, they might become too specialized and forget how to help each other. But if they all try to be the same, they lose their unique talents.

The authors solved this with a Reinforcement Learning Coach.

Think of the training process like a sports season. The Coach has a magic dial (a variable called $\lambda$ ) that controls how much the experts should compete (diversity) vs. how much they should collaborate (similarity).
Early in training, the Coach might say, "You guys need to be different! Focus on your own strengths!" (High competition).
Later, the Coach might say, "Okay, now that you're experts, share what you learned with the others!" (High collaboration).
The Coach is smart enough to adjust this dial automatically, second by second, to get the perfect balance. It's like a conductor tuning an orchestra in real-time to ensure the music sounds perfect.

Why This Matters

The results are like magic. When they tested this system on famous 3D datasets:

Classification: It got 100% accuracy on some tests where the best individual experts only got 91% or 97%.
Retrieval: It found the right objects in a database much faster and more accurately than before.
Segmentation: It could break down a complex object (like a human body) into parts (arms, legs, head) with incredible precision, fixing mistakes that individual experts made.

The Trade-off

Is there a downside? Yes, but it's a small price to pay for perfection.
Because the system has to run three experts and the Gatekeeper, it takes a bit more time and computer power to process each object. It's like having a team of three experts plus a manager instead of just one person. However, the paper shows that the massive jump in accuracy is worth the extra few seconds of processing time.

In a Nutshell

This paper is about building a smart team manager for 3D shapes. Instead of forcing one model to be good at everything, it gathers the best specialists, uses a clever "random walk" system to see which specialist is needed, and uses a smart coach to keep the team working together perfectly. The result? A system that sees 3D objects better than any single model ever could.

1. Problem Statement

Polygonal meshes are the standard representation for 3D surfaces in computer graphics. While numerous deep learning methods exist for mesh analysis (classification, retrieval, semantic segmentation), no single architecture excels across all object classes or datasets. For instance, MeshCNN may perform best on "Men," while MeshWalker excels on "Horses," and PD-MeshNet on "Sharks."

Existing approaches to combine models, such as Ensembles (averaging or voting predictions) or standard Mixtures of Experts (MoE), often fail to fully leverage the complementary strengths of heterogeneous architectures. Standard MoE frameworks typically assume homogeneous experts (same architecture) or use simple gating mechanisms that do not effectively identify which mesh regions are most informative for a specific expert's decision-making process. Furthermore, balancing the need for expert specialization (diversity) with knowledge sharing (similarity) is a non-trivial challenge, as static loss weighting often leads to suboptimal convergence.

2. Methodology: Mixture of Mesh Experts (MME)

The authors propose MME, a novel framework that integrates heterogeneous expert models using a specialized gating mechanism and a Reinforcement Learning (RL) based training strategy.

A. The Expert Environment

The system comprises a set of pre-trained, heterogeneous expert models (e.g., MeshCNN, MeshWalker, PD-MeshNet, AttWalk, MeshFormer, MeshNet).

Input: A batch of meshes.
Processing: Each expert processes the mesh independently to generate a prediction vector.
Selection: A Gate assigns a weight to each expert for every input mesh. The final prediction is the output of the expert with the highest weight (hard selection).

B. The Transformer Gate (Novelty 1)

Unlike traditional MoE gates that mirror the expert architecture or use simple 2D convolutions, the MME gate is designed specifically for 3D mesh topology:

Random Walk Extraction: The gate extracts random walks from the mesh surface. A random walk is a sequence of distinct vertices connected by edges. This captures local and global geometric patterns effectively.
Transformer Architecture: The gate utilizes a Transformer (Encoder-Decoder) structure:
- Encoder: Takes the random walk sequence as input and applies Multi-Head Attention (MHA) to identify the most critical regions of the walk relevant to a specific expert.
- Decoder: Generates a weight vector of length $J$ (number of experts) for the input mesh.
Pre-training: Before the main training, the gate undergoes a pre-training phase where it is trained to "imitate" each expert's full prediction vector based on random walks. This teaches the gate to recognize the specific mesh regions each expert relies on for accurate classification.

C. Dynamic Loss Balancing via Reinforcement Learning (Novelty 2)

The training objective involves two competing loss terms:

Diversity Loss: Encourages experts to specialize in different classes (standard MoE goal).
Similarity Loss: Encourages experts to learn from one another (using Kullback-Leibler Divergence) when beneficial.

Balancing these is difficult because the impact of the weighting factor ( $\lambda$ ) is only observable at the end of training. The authors frame this as a Reinforcement Learning (RL) task:

Agent: An RL agent (using the Soft Actor-Critic (SAC) algorithm) predicts the optimal weighting factor $\lambda_t$ at each training iteration.
State: The current expert weights and batch accuracy.
Reward: The accuracy of the current batch.
Action: The updated weighting factor $\lambda_{t+1}$ .
This allows the system to dynamically shift between promoting diversity and similarity throughout the training process to maximize long-term accuracy.

3. Key Contributions

Heterogeneous MoE for 3D: The first framework to successfully integrate diverse 3D mesh analysis architectures (e.g., edge-based, face-based, random walk-based) into a unified MoE system.
Random Walk Transformer Gate: A novel gating mechanism that uses random walks and Transformer attention to identify mesh regions most relevant to specific experts, enabling precise routing of inputs.
RL-Based Loss Balancing: A dynamic training strategy that uses Reinforcement Learning to automatically adjust the trade-off between diversity and similarity losses, overcoming the limitations of static hyperparameter tuning.
State-of-the-Art Performance: The method achieves superior results across classification, retrieval, and semantic segmentation tasks.

4. Experimental Results

The authors evaluated MME on four classification datasets (SHREC11, Cube Engraving, ModelNet40, 3D-FUTURE), two retrieval datasets (ShapeNet-Core55, ModelNet40), and three segmentation datasets (Human Body, COSEG, PartNet).

Classification:
- Achieved 100.0% accuracy on SHREC11 and Cube Engraving (surpassing individual experts and ensembles).
- On the challenging 3D-FUTURE dataset, MME achieved 86.1%, significantly outperforming the best individual expert (AttWalk at 72.1%) and a standard ensemble (78.0%).
Retrieval:
- On ShapeNet-Core55, MME achieved 93.2% mAP and 93.8% NDCG, a substantial improvement over the next best method (Ensemble at 84.3% mAP).
Semantic Segmentation:
- On PartNet, MME improved accuracy by 6.7% over the best individual expert.
- On Human Body, it achieved 94.5% face accuracy.
Ablation Studies:
- Gate Design: The proposed Transformer gate outperformed simpler alternatives (FC layers, 3D convolutions, and other mesh networks).
- Dynamic $\lambda$ : The RL-based dynamic weighting significantly outperformed static $\lambda$ settings (including $\lambda=0$ , which is vanilla MoE).
- Expert Selection: The gate successfully learned to route specific object classes to the expert best suited for them (e.g., routing "Armchair" to AttWalk).

5. Significance and Limitations

Significance:
This work demonstrates that combining diverse 3D learning paradigms is more effective than relying on a single architecture. By using random walks to understand where experts look and RL to manage how they learn together, MME sets a new benchmark for mesh analysis. It proves that "one size fits all" is not optimal for 3D data and that adaptive, specialized routing is the key to unlocking higher performance.

Limitations:

Computational Cost: The primary drawback is increased training and inference time. Inference time roughly doubles compared to a single expert (e.g., ~270ms vs. ~128ms per mesh) due to the gate computation and running multiple experts.
Convergence: While the method converges faster (10-15 epochs) than training individual complex models from scratch (90+ epochs), the total computational load per epoch is higher due to the multi-expert setup.

In conclusion, MME represents a significant advancement in 3D deep learning by effectively unifying heterogeneous models through a sophisticated, attention-based gating mechanism and a reinforcement learning-driven training strategy.