N4MC: Neural 4D Mesh Compression

Imagine you are trying to send a high-definition, 3D animated movie of a dancer to a friend on a smartphone. The problem? A single second of this animation contains millions of tiny triangles (vertices) that make up the 3D shape. Sending a whole hour of this would require a data plan the size of a small country.

N4MC is a new, clever way to shrink these massive 3D movies down to a size that fits in your pocket, while still looking perfect when you play them back on a VR headset or a phone.

Here is how it works, broken down into simple analogies:

1. The Problem: The "Frame-by-Frame" Bottleneck

Old methods of compressing 3D movies are like taking a photo of a dancer every single second and mailing them individually. Even if the dancer is just moving their arm slightly, you still send the entire body in every photo. It's incredibly wasteful.

Other methods try to say, "Here is the dancer's body, and here is a list of how the arm moved." But if the dancer does something weird (like a hand touching their face or clothes getting tangled), that "list of movements" breaks, and the animation looks like a glitchy mess.

2. The Solution: Turning 3D into a "Digital Fog"

N4MC starts by changing the way it sees the 3D world. Instead of looking at the dancer as a collection of millions of triangles, it turns the dancer into a 3D grid of "fog" (technically called a TSDF tensor).

The Analogy: Imagine the dancer is inside a giant box of invisible fog. The fog is dense where the dancer's body is and empty where there is air.
Why this helps: This turns a messy, irregular shape (a human body) into a neat, uniform block of data. It's much easier to compress a uniform block of fog than a messy pile of triangles.

3. The Magic Trick: The "Smart Predictor" (Interpolation)

This is the secret sauce. N4MC doesn't try to compress every single frame of the video. Instead, it uses a Transformer (a type of AI) to act like a super-smart movie director.

The Analogy: Imagine you are drawing a flipbook. Instead of drawing every single page, you only draw the "Keyframes" (the start of a jump and the landing).
The AI's Job: The AI looks at the start and the end, and it guesses what the dancer looks like in the middle. It fills in the missing pages for you.
The Secret Ingredient: To make sure the AI doesn't guess wrong (like making the dancer's hand pass through their head), N4MC uses "Volume Tracking."
- Think of this as placing invisible GPS trackers on the dancer's hands, feet, and head. The AI watches where these trackers go and uses that path to guide its guesses. This ensures the dancer moves naturally, even if they are doing complex, non-rigid moves like dancing.

4. The Result: Tiny Files, Big Quality

Because N4MC only stores the "Keyframes" and the "GPS paths" (plus a tiny bit of data to help the AI guess the middle frames), the file size shrinks dramatically.

Compression: It can shrink a 4D mesh sequence (a 3D movie) by a huge amount, often beating current industry standards by a wide margin.
Real-Time Playback: The best part? The "AI Director" is so lightweight that it can run on a Meta Quest 3 (a standalone VR headset) or an Android phone. You don't need a supercomputer to watch the movie; the device can decode the tiny file and reconstruct the full 3D dancer instantly.

Summary

Think of N4MC as a smart time-machine for 3D data.

It turns messy 3D shapes into neat blocks of fog.
It records only the start and end points of a movement.
It uses invisible GPS trackers to teach an AI how to perfectly guess the middle moments.
It sends this tiny "recipe" to your phone or VR headset, which cooks up the full, high-quality 3D animation in real-time.

This technology opens the door to streaming high-fidelity 3D worlds, digital twins, and VR experiences without needing a massive internet connection.

1. Problem Statement

The rapid advancement of 3D capture and reconstruction technologies has led to an explosion in the volume of time-varying (dynamic) 3D mesh data used in AR/VR, robotics, and digital twins. These sequences often contain millions of vertices per frame and hundreds of frames.

Limitations of Existing Methods:
- Static Compression (e.g., Draco, TFAN): Treats each frame independently, failing to exploit temporal redundancy, resulting in high bitrates for long sequences.
- Dynamic Mesh Compression (e.g., VSMC, MESHGRID): Relies on consistent topology (re-meshing) or deformation fields. These struggle with non-rigid motions, self-contact, and topological changes common in real-world sensor data.
- Prior Neural Methods (e.g., NeCGS): Focus on compressing static meshes or short sequences but lack mechanisms to model long-term temporal correlations and motion compensation effectively.
Goal: Develop a framework that achieves extreme compression ratios for long 4D mesh sequences while maintaining high visual fidelity and enabling real-time decoding on resource-constrained devices (e.g., mobile VR headsets).

2. Methodology: N4MC Framework

N4MC is the first neural framework designed specifically for 4D mesh compression. It adopts an inter-frame compression paradigm inspired by 2D video codecs, converting irregular mesh sequences into a uniform volumetric representation and exploiting spatiotemporal redundancy.

The pipeline consists of four main stages:

A. TSDF-Def Generation (Uniform Representation)

Instead of processing irregular triangle meshes directly, N4MC converts each frame of the mesh sequence $M$ into a TSDF-Def tensor ( $T$ ).

Representation: A 4D tensor $T \in \mathbb{R}^{k \times k \times k \times 4}$ where $k$ is the voxel resolution.
Channels: The last dimension contains one Truncated Signed Distance Function (TSDF) value (geometric structure) and a deformation vector $(\Delta x, \Delta y, \Delta z)$ (local deformation).
Benefit: This creates a uniform, regular grid representation that facilitates efficient neural processing and handles topological changes naturally.

B. Auto-Encoder/Decoder (Spatial & Temporal Condensation)

Encoder: A 3D ConvNeXt-based encoder ( $E_\theta$ ) processes the TSDF-Def tensors to extract compact, spatially structured embedded features ( $F$ ).
Quantization-Aware Decoder: A lightweight decoder ( $D_\psi$ ) reconstructs the tensors from latent features. It employs quantized 3D convolutions and PixelShuffle layers to upsample features.
Loss Function: Optimized using a combination of L1 loss, masked L1 (focusing on near-surface voxels), and SSIM to ensure both geometric accuracy and structural consistency.

C. Volume Tracking & Latent Priors (Motion Guidance)

To resolve motion ambiguities in long sequences, N4MC introduces Volume Tracking:

Volume Centers: A set of $p$ volume centers ( $C$ ) is tracked across the sequence to capture non-rigid motion dynamics.
Latent Mapping: A PointNet-style encoder processes the start, end, and target volume centers to generate global descriptors. These are combined with a sinusoidal time encoder to produce latent codes ( $Z$ ).
Role: These latent codes serve as motion priors, guiding the interpolation model to predict intermediate frames accurately without explicit motion vectors.

D. 3D Mesh Interpolation (Transformer)

Architecture: A lightweight Cross-Attention Transformer predicts intermediate frame embeddings ( $\hat{F}_t$ ) based on keyframe embeddings ( $F_s, F_e$ ) and the latent motion codes ( $Z$ ).
Mechanism: The model uses FiLM (Feature-wise Linear Modulation) conditioning to modulate the transformer's query with the latent deformation code.
Efficiency: All linear layers in the transformer are replaced with quantized linear layers to minimize model size and inference cost.
Output: The transformer outputs intermediate latent features, which are decoded back to TSDF-Def tensors and converted to meshes using the Deformable Marching Cubes algorithm.

3. Key Contributions

First Neural 4D Mesh Framework: N4MC is the first system to explicitly target 4D mesh compression using a tensor-based latent interpolation paradigm, leveraging spatiotemporal correlations for extreme efficiency.
Volume-Center Priors: The introduction of explicit volume center tracking provides stable motion anchors, accelerating model convergence and enabling the transformer to remain lightweight while handling non-rigid motion.
Real-Time Mobile/VR Deployment: The authors provide a Unity plugin and demonstrate real-time decoding on mobile devices and standalone VR headsets (Meta Quest 3), a capability previously unachieved for neural 4D mesh compression.
Superior Rate-Distortion Performance: The method outperforms state-of-the-art baselines (including NeCGS, TVMC, and Draco) in both geometric (D2-PSNR) and image-based (SSIM, PSNR) metrics.

4. Experimental Results

Datasets: Evaluated on diverse datasets including MPEG V-DMC sequences ("Dancer", "Basketball player", "Mitch", "Thomas"), a "Mixed" multi-object dataset, custom real-world captures, and synthetic Thingi10K data.
Performance Metrics:
- Quality: At ~4 Mbps, N4MC achieves significantly higher SSIM and PSNR compared to baselines. For example, on the "Dancer" sequence, it achieved an SSIM of 0.9712 vs. 0.9490 for KLT and 0.9388 for TVMC.
- Bitrate: N4MC achieves comparable or better quality than NeCGS at lower bitrates. On the "Mixed" dataset, N4MC required 2.469 Mbps to match the quality of NeCGS at 4.931 Mbps (a ~50% reduction).
- Decoding Speed: On an RTX 4090, N4MC achieves real-time decoding (>24 FPS). On mobile devices (Meta Quest 3), the total decoding time is ~346ms per frame, enabling playback on standalone VR.
Ablation Studies:
- Resolution: Higher TSDF-Def resolutions (256) yield better quality but higher bitrates.
- Group Size: Larger interpolation groups slightly reduce per-frame decoding time but marginally decrease quality.
- Volume Tracking: Disabling volume center latent mapping causes the transformer to fail, resulting in severe interpolation artifacts, proving the necessity of motion priors.

5. Significance

N4MC represents a paradigm shift in 4D data compression. By moving away from topology-dependent deformation methods and irregular mesh representations, it offers a robust solution for dynamic scenes with complex motion.

Practical Impact: Its ability to run on mobile VR headsets opens the door for high-fidelity, real-time 4D content streaming in consumer AR/VR applications.
Technical Advancement: It successfully bridges the gap between implicit neural representations (like TSDFs) and temporal video compression techniques, demonstrating that neural interpolation can effectively handle long, non-rigid 3D sequences.
Future Potential: The framework lays the groundwork for foundation models in 4D geometry, potentially scalable to larger datasets and diverse device ecosystems.