Improving Multi-View Reconstruction via Texture-Guided Gaussian-Mesh Joint Optimization

The Big Picture: The "Digital Sculptor's" Dilemma

Imagine you want to create a perfect 3D digital copy of a real-world object (like a vintage sneaker or a ceramic vase) using only a bunch of photos taken from different angles.

Currently, computer scientists have two main tools for this, but both have a major flaw:

The "Shape-Only" Tool (MVS): This is great at figuring out the shape of the object (how bumpy the sole of the shoe is), but it's terrible at the texture. It might give you a perfectly shaped shoe, but the leather looks like a blurry, smeared mess.
The "Photo-Only" Tool (NeRF/3DGS): This is amazing at making the object look photorealistic from any angle, but it's like a cloud of invisible dust. It doesn't have a solid "skin" (a mesh), so you can't easily edit it, bend it, or change the lighting without breaking the whole thing.

The Problem: Most methods treat the shape and the color as two separate problems. They build the shape first, then try to paint on it later. This often leads to a mismatch where the paint doesn't fit the bumps, making it impossible to edit the object later (like bending a finger or changing the light source).

The Solution: The "Smart Clay" Approach

This paper proposes a new way to build 3D objects. Instead of building the shape and then painting it, they do it simultaneously using a "Smart Clay" approach.

Think of it like this:

The Mesh: This is your wireframe or the skeleton of the object.
The Gaussians: These are like millions of tiny, glowing paint droplets that float around the object to make it look real.

The authors' secret sauce is joint optimization. They don't just move the wireframe; they move the wireframe and the paint droplets at the same time, making sure they always agree with each other.

How It Works: Three Simple Steps

1. The Rough Draft (The Coarse Mesh)

First, they take the photos and use existing AI (3D Gaussian Splatting) to create a "rough draft" of the object. It's like a sculptor throwing a big lump of clay on the table. It has the general shape and color, but it's messy. The edges are too smooth, and the details are blurry.

2. The "Texture-Aware" Sculpting (Remeshing)

This is the paper's biggest innovation. Usually, when a sculptor refines a model, they just look at the shape. If the shape is smooth, they make the triangles (the mesh faces) big. If the shape is bumpy, they make them small.

The Flaw in Old Methods: Imagine a duck with a smooth white wing that has a sharp green stripe.

Old Method: The sculptor sees the wing is "smooth" (geometrically) and makes the triangles huge. But then, the green stripe gets stretched across a giant triangle, looking like a blurry smear.
This Paper's Method: They tell the sculptor: "Don't just look at the shape! Look at the texture too!"
- If the color changes sharply (like the green stripe), the sculptor automatically cuts the triangles into tiny pieces to capture that detail.
- If the color is flat (like the white wing), they keep the triangles big to save space.
- The Analogy: It's like a tailor cutting fabric. If the fabric has a complex pattern, they cut small, precise pieces. If it's plain, they use big pieces. This prevents "color leakage" and keeps the details sharp.

3. The "Double Agent" Binding (Gaussian-Mesh Link)

Once the mesh is perfect, they need to make it editable. They create a "binding" between the solid mesh and the floating paint droplets (Gaussians).

The Analogy: Imagine the mesh is a puppet, and the Gaussians are the strings.
If you pull the puppet's arm (deform the mesh), the strings (Gaussians) move with it perfectly.
If you change the lighting in the room, the puppet's skin (the mesh) reacts realistically because it's tied to the high-quality paint droplets.

Why Does This Matter? (The "So What?")

Because they fixed the connection between shape and color, this new method opens up cool new possibilities:

Relighting: You can take a photo of a red car taken in a dark garage, and the AI can instantly make it look like it's parked in bright sunlight, with realistic reflections and shadows.
Deformation: You can take a 3D model of a human face and make it smile, or twist a vase, and the texture (the skin or the pattern) will stretch and bend naturally without looking like a glitchy video game.
Editing: You can easily cut, paste, or modify parts of the object because it has a solid, clean structure (the mesh) rather than just a cloud of data.

The Results

The authors tested this on many objects (from DTU and DTC datasets).

Accuracy: Their 3D models are sharper and closer to the real object than previous methods.
Speed: It's fast. They can take a rough 3D model and refine it in a matter of minutes.
Visuals: The textures are crisp. You can read the text on a toy airplane or see the stitching on a sneaker, which was blurry in older methods.

Summary

This paper is about teaching computers to sculpt and paint at the same time. By making the "skeleton" of the 3D object aware of the "skin" (texture), they create digital objects that are not only beautiful to look at but are also easy to bend, twist, and light up for movies, games, and virtual reality.

1. Problem Statement

Reconstructing real-world objects from multi-view images is critical for 3D editing, AR/VR, and digital content creation. However, existing methods face a fundamental trade-off:

Multi-View Stereo (MVS): Prioritizes geometric accuracy but often produces oversimplified or inconsistent texture maps due to heavy reliance on geometric priors.
Neural View Synthesis (NVS) / 3DGS: Prioritizes photorealistic rendering but often results in inaccurate geometry when extracting meshes (e.g., via SDF or density thresholds).
The Core Bottleneck: Most approaches decouple geometry and appearance optimization. Methods like NeRF2Mesh or Nvdiffrec extract a mesh first and then train separate networks to map textures, leaving geometry and appearance "disentangled." This separation hinders downstream tasks like relighting and shape deformation, which require a unified representation where geometric changes immediately and correctly affect appearance.

2. Methodology

The authors propose a unified framework that jointly optimizes mesh geometry (vertex positions/faces) and vertex colors (appearance) using a Texture-Guided Gaussian-Mesh Joint Optimization pipeline.

A. Initialization

3DGS Reconstruction: Start with multi-view images and use off-the-shelf 3D Gaussian Splatting (3DGS) methods to reconstruct the scene.
Coarse Mesh Extraction: Convert the 3DGS representation into a coarse mesh ( $M_{init}$ ) using Marching Cubes on a TSDF (Truncated Signed Distance Field) derived from the Gaussians. This mesh includes initial per-vertex colors extracted from the 3DGS.

B. Texture-Guided Remeshing (Geometry-Color Optimization)

Instead of optimizing geometry and texture separately, the method performs inverse rendering-based remeshing that simultaneously updates vertex positions and colors.

Operations: The framework extends standard remeshing operations (Edge Split, Edge Collapse, Edge Flip) to include color interpolation and fusion. When an edge is split or collapsed, the new vertex inherits a bilinearly interpolated position and color.
Loss Function: The optimization minimizes a combined loss:
- Photometric Loss ( $L_{rgb}$ ): Ensures rendered RGB images match input views.
- Geometric Regularization ( $L_{geo}$ ): Enforces consistency with pseudo-ground-truth depth and normal maps derived from the initial 3DGS.
- Smoothness ( $L_{reg}$ ): Laplacian smoothing to prevent mesh artifacts.

C. Texture-Based Edge Length Control (TELC)

A key innovation to prevent color artifacts (e.g., color bleeding across sharp texture boundaries on smooth geometry) is the TELC scheme.

Problem: Linear color interpolation on vertices can cause artifacts if a large triangle spans a high-frequency texture boundary.
Solution: The method computes a texture density map using Fast Fourier Transform (FFT) on input images. This density is back-projected onto the mesh.
Adaptive Remeshing: The target edge length for remeshing is dynamically adjusted based on local texture frequency.
- High Texture Frequency: Triggers smaller edge lengths (finer mesh) to capture detail.
- Low Texture Frequency: Allows larger edge lengths to save computational cost.
- This ensures triangles align with texture boundaries, preventing leakage.

D. Vertex-Gaussian Binding for Editing

To enable downstream editing (relighting and deformation), the optimized mesh is bound back to 3D Gaussians:

Binding Scheme: Each mesh vertex is associated with a Gaussian.
- Position: Direct mapping.
- Scale/Rotation: Derived from local edge projections and vertex normals.
- Appearance: Low-order Spherical Harmonics (SH) coefficients are assigned directly from vertex colors.
Benefit: This creates a bidirectional link. Geometric edits to the mesh propagate to the Gaussians, and material parameters learned via Gaussians (e.g., albedo/roughness) can be transferred back to the mesh.

3. Key Contributions

Unified Optimization: A novel framework that jointly optimizes mesh geometry and vertex colors, eliminating the decoupling found in previous MVS or NVS-to-mesh pipelines.
Texture-Guided Remeshing: Introduction of the TELC scheme, which uses frequency analysis of input images to adaptively control mesh resolution, significantly reducing color artifacts in high-frequency texture regions.
Gaussian-Mesh Binding: A robust mechanism to bind optimized mesh vertices to 3D Gaussians, enabling synchronized geometric deformation and material editing (relighting).
Plug-and-Play Refinement: The method acts as a refinement step for existing Gaussian-based reconstruction methods (3DGS, 2DGS, GOF, PGSR), improving their output without requiring retraining from scratch.

4. Experimental Results

The method was evaluated on the DTU and Digital Twin Catalog (DTC) datasets, as well as a synthetic relighting dataset.

Geometric Accuracy: The method significantly outperforms state-of-the-art explicit mesh reconstruction methods (NeuS, Neuralangelo, 3DGS, GOF, 2DGS, PGSR) in Chamfer Distance. For example, on the DTU dataset, it reduced the mean Chamfer Distance from 0.84 (NeuS) and 1.63 (3DGS) to 0.72 (Ours + GOF) and 1.45 (Ours + 3DGS).
Rendering Quality: Quantitative metrics (PSNR, SSIM, LPIPS) show significant improvements in rendering fidelity. Qualitative results demonstrate the recovery of fine details (e.g., text on airplanes, shoe mesh patterns) that are blurred in baseline coarse meshes.
Relighting & Deformation:
- Relighting: When used as initialization for R3DG (a relighting framework), the method achieved superior albedo and roughness precision compared to standard R3DG and Nvdiffrec.
- Deformation: Experiments showed that deforming the mesh (e.g., twisting a jug) resulted in physically consistent updates to the bound Gaussians, preserving specular highlights and shadow casting correctly.
Efficiency: The refinement process is fast, adding only ~0.1 to 0.15 hours to the training time of baseline methods, making it highly efficient.

5. Significance

This work addresses a critical gap in 3D reconstruction by bridging the divide between geometric fidelity and photorealistic appearance.

For Editing: It enables true "editable 3D" where changing the shape of an object automatically updates its texture and lighting interactions without manual re-texturing or re-training.
For Workflow: It provides a unified, high-quality textured mesh that is compatible with standard geometry processing tools, unlike SDF-based representations which are difficult to edit.
Future Impact: The approach paves the way for more intuitive workflows in virtual environment design and digital content creation, where cohesive manipulation of geometry and appearance is essential.