PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

Imagine you are looking at a single photograph of a messy living room. You can see the front of a sofa, the side of a coffee table, and maybe a lamp peeking out from behind a chair. But you can't see the back of the sofa, the bottom of the table, or what's hidden in the shadows.

The Problem:
For a long time, computers trying to turn that single photo into a 3D model have been like a sculptor working with wet clay. They try to "guess" the whole shape by filling in invisible gaps with a smooth, blob-like substance (called an SDF). When they are done, they have to carve the final shape out of that blob. The result is often a 3D model that looks okay from a distance but is actually a heavy, messy, over-detailed mess of thousands of tiny triangles. It's hard for artists to edit, and it's heavy for computers to run.

The Solution: PixARMesh
The researchers behind PixARMesh decided to try a completely different approach. Instead of sculpting clay, they taught a computer to write a story.

Here is how it works, using some everyday analogies:

1. The "Autoregressive" Storyteller

Think of the computer not as a sculptor, but as a very smart writer who loves to finish sentences.

Old Way: The computer tries to draw the whole room at once, guessing where every wall and chair goes, then tries to smooth out the edges.
PixARMesh Way: The computer looks at your photo and says, "Okay, I see a chair. Let me write the story of that chair." It predicts the chair's position, then writes the story of its shape, piece by piece (token by token), just like a writer finishing a sentence word by word. Because it builds the object step-by-step, it naturally creates a clean, organized structure (a "mesh") that artists can actually use.

2. The "Pixel-Perfect" Detective

Usually, 3D models are built just by looking at the "shape" of the dots in space. But in a single photo, you only see the front of things.

The Analogy: Imagine trying to guess what a person looks like from the back, but you only have a photo of their front.
PixARMesh's Trick: It doesn't just look at the 3D dots; it looks at the colors and textures in the photo that match those dots. It's like a detective who says, "I see a wooden texture here in the photo, so I know the hidden back of this table must be wood, not metal." This helps the computer "hallucinate" (guess) the missing parts of the furniture with much higher accuracy.

3. The "Party Host" (Context Awareness)

When you are in a room, you know a chair belongs near a table, and a lamp belongs on a desk.

The Problem: If you look at a chair in isolation, you might guess it's floating in mid-air.
PixARMesh's Trick: The model acts like a party host who knows the whole room. Before it builds the chair, it looks at the "global scene" (the whole room context). It asks, "Where do chairs usually sit?" and "How big is this room?" This ensures that when it builds the chair, it places it in the right spot relative to the other objects, creating a coherent, logical room instead of a floating jumble of furniture.

4. The "One-Stop Shop"

In the past, building a 3D room was a multi-step assembly line:

Find the objects.
Guess their positions.
Build their shapes.
Run a complex math optimization to make sure they don't float or overlap.

PixARMesh does all of this in one single forward pass. It's like a master chef who doesn't just chop vegetables and then cook them in separate pots; they chop, season, and cook everything in one perfect, synchronized motion.

Why Does This Matter?

It's "Artist-Ready": The output isn't a messy blob of data. It's a clean, lightweight 3D model with clear edges, just like a professional 3D artist would make. This means game developers and animators can use the result immediately without hours of cleanup.
It's Fast and Smart: By predicting the layout and the shape together, it avoids the "local minima" traps (getting stuck in a bad guess) that older methods suffer from.
It Works on Real Photos: Even though it was trained on computer-generated images, it can look at a real photo from your phone and build a decent 3D version of your living room.

In a Nutshell:
PixARMesh is like a magical 3D printer that reads your mind. You show it a photo, and instead of guessing and smoothing, it "writes" a perfect, clean, and logically placed 3D room, object by object, using the visual clues in your picture to fill in the blanks.

Here is a detailed technical summary of the paper "PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction".

1. Problem Statement

Reconstructing a complete 3D indoor scene from a single RGB image is a fundamentally ill-posed problem due to:

Partial Observations: A single viewpoint provides only partial views of objects, with large portions occluded or unobserved.
Depth Ambiguity: Recovering accurate shapes and spatial layouts requires strong priors about indoor scenes.
Limitations of Existing Methods:
- Holistic SDF Approaches: Methods using Signed Distance Fields (SDFs) and volumetric grids are constrained by spatial resolution and struggle to produce high-fidelity, editable geometry.
- Compositional Approaches: Current pipelines often require separate stages for inpainting, object reconstruction, and layout optimization (e.g., point-cloud matching). These rely on optimization loops prone to local minima and often produce overly smooth, high-face-count meshes via Marching Cubes, which are not "artist-ready."
- Gap: While autoregressive mesh generators exist for single objects, no existing pipeline leverages native mesh representations for scene-level reconstruction.

2. Methodology: PixARMesh

PixARMesh is an end-to-end framework that performs autoregressive reconstruction of complete 3D scenes directly in mesh space, bypassing SDFs and post-hoc optimization.

Core Architecture

The framework builds upon pre-trained object-level mesh generative models (specifically EdgeRunner and BPT) and adapts them for scene-level tasks.

Input Processing:
- Takes a single RGB image $I$ .
- Uses off-the-shelf models to extract:
  - Instance segmentation masks ( $M$ ).
  - Monocular depth maps ( $D$ ).
  - Image features ( $F_{img}$ ).
- Back-projects depth to create a raw scene point cloud ( $P_{scene}$ ) and per-object point clouds ( $P_i$ ) based on masks.
Pixel-Aligned Point-Cloud Encoder:
- Multi-modal Fusion: Unlike standard encoders that only use point coordinates, PixARMesh projects 3D points onto the 2D image plane to retrieve pixel-aligned image features.
- Fusion Block: Concatenates geometric features ( $f_{pc}$ ) with aligned image features ( $f_{img}$ ) and processes them through a Transformer-based fusion block.
- Benefit: This injects appearance cues into the geometry, significantly improving robustness to occlusion and ensuring global consistency.
Scene Context Aggregation:
- Instead of normalizing objects independently, the entire scene point cloud and all instances are normalized into a unified global coordinate frame.
- Cross-Attention: Each object's latent code attends to a global scene latent code via a cross-attention layer. This allows the model to use context from nearby objects to infer missing geometry for occluded parts.
Unified Autoregressive Tokenization:
- The model predicts a single token stream containing Pose followed by Mesh.
- Pose Tokenization: Object poses (7-DoF bounding boxes) are encoded as sequences of 8 corner points, reusing the vertex token vocabulary of the mesh generator. This avoids introducing new token types.
- Mesh Tokenization: Uses the native tokenization of the base models (EdgeRunner uses EdgeBreaker-based compact tokens; BPT uses block-patch tokens).
- Sequence Structure: <BOS> -> [Pose Tokens] -> <SEP> -> [Mesh Tokens] -> <EOS>.
Decoding & Transformation:
- The Transformer decoder autoregressively generates pose tokens first, then mesh tokens.
- Pose-to-Mesh Alignment: The decoded global pose (bounding box corners) is used to compute an affine transformation ( $T$ ) that maps the locally generated canonical mesh (unit cube) to the global scene coordinates.

Training Strategy

Objective: Single next-token prediction (Cross-Entropy Loss).
Conditioning: The decoder is conditioned on the aggregated latent code ( $z_{agg}$ ), which fuses geometry, pixel-aligned image features, and global scene context.
Joint Learning: By predicting pose and mesh in one sequence, the model learns to reason about geometry and layout simultaneously, allowing geometric cues to inform pose estimation and vice versa.

3. Key Contributions

First Mesh-Native Autoregressive Scene Reconstruction: The first framework to perform single-view scene reconstruction directly in mesh space, eliminating the need for SDF-based decoding and Marching Cubes surface extraction.
Repurposed Generative Models: Successfully adapted object-level mesh generators (EdgeRunner/BPT) for scene-level tasks by integrating pixel-aligned image features and global scene context into the encoder.
Unified Pose-Mesh Prediction: Introduced a novel tokenization scheme that jointly predicts object poses and meshes in a single feed-forward pass, removing the need for error-prone post-hoc layout optimization.
Artist-Ready Output: Produces compact, high-fidelity meshes with clear structural boundaries suitable for downstream applications (editing, animation) without excessive face counts.

4. Experimental Results

Experiments were conducted on the synthetic 3D-FRONT dataset and real-world datasets (Pix3D, Matterport3D, ScanNet).

Quantitative Performance (3D-FRONT):
- Scene Level: PixARMesh achieves State-of-the-Art (SOTA) performance across all metrics (Chamfer Distance, CD-S, F-Score).
  - PixARMesh-BPT: CD (Scene) = 47.6 $\times 10^{-3}$ , F-Score = 32.26%.
  - PixARMesh-EdgeRunner: CD (Scene) = 49.1 $\times 10^{-3}$ , F-Score = 33.55%.
  - Outperforms diffusion-based SDF methods (e.g., DepR, MIDI) and holistic SDF methods (InstPIFu, Uni-3D).
- Object Level: Achieves the second-best performance, with F-Scores comparable to diffusion-based SDF models but with significantly lower mesh complexity.
Qualitative Results:
- Generates geometrically coherent scenes with smooth surfaces and well-defined edges.
- Demonstrates strong generalization to real-world images despite being trained primarily on synthetic data.
Ablation Studies:
- Joint Modeling: Removing joint pose-mesh modeling (using a two-stage approach) significantly degrades performance, proving the benefit of unified reasoning.
- Pixel-Aligned Features: Removing image features causes the largest performance drop, highlighting their necessity for handling occlusion.
- Error Analysis: The model is robust to imperfect depth and segmentation inputs, though providing ground-truth depth and layout yields the highest theoretical fidelity (F-Score ~68% on scene level with oracle inputs).

5. Significance

PixARMesh represents a paradigm shift in 3D scene reconstruction:

Efficiency: Replaces complex optimization loops and multi-stage pipelines with a single, fast, feed-forward autoregressive pass.
Usability: By outputting native meshes rather than implicit fields, it directly addresses the needs of the graphics industry for editable, artist-ready assets.
Scalability: Demonstrates that autoregressive generative models, previously limited to single objects, can be effectively scaled to complex, multi-object indoor scenes with strong spatial reasoning capabilities.

This work bridges the gap between 2D visual understanding and 3D geometric generation, offering a viable alternative to traditional SDF-based pipelines for real-world applications.

PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

1. The "Autoregressive" Storyteller

2. The "Pixel-Perfect" Detective

3. The "Party Host" (Context Awareness)

4. The "One-Stop Shop"

Why Does This Matter?

1. Problem Statement

2. Methodology: PixARMesh

Core Architecture

Training Strategy

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning

Missingness Bias Calibration in Feature Attribution Explanations

Why Is RLHF Alignment Shallow? A Gradient Analysis

Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness

U-Parking: Distributed UWB-Assisted Autonomous Parking System with Robust Localization and Intelligent Planning