LoGoColor: Local-Global 3D Colorization for 360° Scenes

Imagine you have a black-and-white 3D movie of a bustling city. You can see the buildings, the cars, and the people moving around perfectly, but everything is in shades of gray. Your goal is to paint this world in full, vibrant color.

This is exactly the problem the paper LoGoColor tries to solve.

The Problem: The "Blurry Paintbrush" Effect

Previous methods tried to solve this by asking a smart AI (trained on 2D photos) to guess the color of every single angle of the 3D scene.

Think of it like this: You ask 100 different artists to paint the same tree from different angles.

Artist A says, "The leaves are bright green."
Artist B says, "No, they are dark green."
Artist C says, "Actually, they look yellowish."

Old methods would take all 100 answers, mix them together in a blender, and say, "Okay, the tree is a muddy, average green." This is called averaging. While it keeps the colors consistent (no one sees a different color), it kills the detail. The result is a dull, monotonous world where distinct objects (like a red apple or a blue sign) all look like the same grayish-brown mush.

The Solution: The "Local-Global" Team

The authors of LoGoColor realized that to get rich, diverse colors, you can't just blend everyone's opinions. You need a smarter strategy. They call it LoGoColor (Local-Global Colorization).

Here is how they do it, using a simple analogy:

1. Breaking the City into Neighborhoods (Local)

Instead of trying to paint the whole city at once, they divide the 3D scene into smaller "neighborhoods" or subscenes.

The Strategy: They pick a few key "viewpoints" (like a photographer standing in the center of a neighborhood) to represent each area.
The Benefit: This allows the AI to focus on the specific details of that neighborhood without getting confused by the rest of the city.

2. The "Team Huddle" (Global Consistency)

Now, here is the tricky part. If you let the artists paint each neighborhood independently, the red car in Neighborhood A might look different from the red car in Neighborhood B. That's bad.

To fix this, they use a Multi-View Diffusion Model (a super-smart AI) as a "Team Captain."

The Huddle: Before finalizing the colors, the Team Captain gathers all the neighborhood leaders. They look at each other's work and say, "Hey, that red car needs to match the red car over there."
The Calibration: The AI adjusts the colors so that the whole city looks consistent, but without blending them into a muddy average. It preserves the unique "personality" of each object while making sure they all fit together.

3. Painting the Whole World

Once the "Team Captain" has agreed on a consistent color palette for the key viewpoints, the AI uses those as a reference to paint every single angle of the 3D scene.

Because the reference is consistent, the final 3D model doesn't flicker or change colors as you walk around it.
Because they didn't "blend" the colors, the details remain sharp and vibrant.

Why This Matters

For VR and AR: Imagine putting on a headset to visit a museum. If the paintings are all muddy gray because the computer "averaged" the colors, it's boring. With LoGoColor, the paintings are vivid, and the statues have real skin tones.
For Night Vision and Medical Imaging: Sometimes we only have black-and-white data (like thermal cameras or X-rays). LoGoColor can take that scary, gray data and turn it into a realistic, colorful 3D world that doctors or robots can actually understand and use.

In a Nutshell

Old methods were like a committee that voted on a color and picked the "average" result, leading to boring, gray worlds. LoGoColor is like a skilled director who organizes a team of painters, ensures they all agree on the big picture (Global), but lets them keep the unique, bright details of their specific scenes (Local). The result? A 3D world that is both consistent and bursting with life.

1. Problem Statement

The paper addresses the challenge of 3D colorization for single-channel 3D reconstructions (e.g., from thermal, near-infrared, or grayscale inputs) in complex 360° scenes.

Context: While single-channel 3D reconstruction (using NeRF or 3D Gaussian Splatting) excels at recovering geometry, the resulting models lack color, limiting their utility in VR/AR and general visualization.
The Core Issue: Existing 3D colorization methods typically distill 2D image colorization models. However, these approaches suffer from an inherent "guidance-averaging" problem. Because 2D models are inconsistent across different views, the 3D optimization process averages these conflicting color predictions.
Consequence: In complex 360° scenes with diverse objects, this averaging leads to monotonous, oversimplified, and muted colors, failing to preserve the distinct color diversity of the original scene.

2. Methodology: LoGoColor

The authors propose LoGoColor, a pipeline designed to eliminate the guidance-averaging process while ensuring strict multi-view consistency. The method follows a Local-Global approach:

A. Single-Channel 3D Reconstruction

The pipeline begins by reconstructing the scene geometry using 3D Gaussian Splatting (3DGS) adapted for single-channel inputs.
Instead of learning view-dependent color (spherical harmonics $F_c$ ), the model learns view-dependent luminance ( $F_y$ ) while retaining geometric parameters (position, rotation, scale, opacity).

B. View-based Subscene Decomposition

To handle the complexity of 360° scenes, the method partitions the scene into $K$ subscenes.
Algorithm: A greedy algorithm selects $K$ "base views" (camera poses) that maximize the coverage of Gaussian primitives while minimizing overlap between subscenes.
Goal: This decomposition allows the system to handle intricate local details without forcing a single global color distribution that might average out distinct features.

C. Multi-View Diffusion Model Fine-Tuning ( $\Phi_{MV}$ )

The core engine is a fine-tuned multi-view diffusion model (based on SD-Turbo and pix2pix-Turbo).
Architecture: It integrates a reference mixing layer (from DIFIX3D+) that uses self-attention on reference images to guide the colorization of the input.
Training: The model is trained to take a grayscale input and a set of reference color images, generating a colorized output that preserves structure while adhering to the reference colors.

D. The Local-Global Pipeline

The colorization process occurs in three distinct stages to ensure consistency without averaging:

Base View Colorization: Each of the $K$ base views is initially colorized independently using a standard 2D image colorization model.
Global Consistency Calibration (Inter-subscene):
- The independently colored base views are fed into the fine-tuned multi-view diffusion model ( $\Phi_{MV}$ ).
- For each base view, the model references the other $K-1$ base views to generate a globally consistent version.
- The final calibrated base view is an average of the initial prediction and the globally consistent prediction. This resolves conflicts between subscenes.
Local Color Propagation (Intra-subscene):
- The calibrated base views serve as a consistent color reference set.
- $\Phi_{MV}$ is used to colorize all remaining training views by referencing the calibrated base views. This ensures that colors are consistent within each subscene and across the entire scene.

E. Final 3D Optimization

The generated consistent colorized views serve as pseudo-ground truth.
The geometry parameters of the 3DGS model are frozen, and new learnable color coefficients ( $F_c$ ) are optimized to match these consistent views, resulting in the final colorized 3D model.

3. Key Contributions

Identification of the Averaging Problem: The paper highlights that existing methods fail in complex 360° scenes due to the averaging of inconsistent 2D model outputs, leading to loss of color diversity.
Local-Global Pipeline: A novel framework that partitions scenes into subscenes and uses a fine-tuned multi-view diffusion model to explicitly manage both inter-subscene (global) and intra-subscene (local) consistency.
Elimination of Guidance Averaging: By generating a new set of consistently colorized training views rather than relying on iterative averaging of 2D outputs, the method preserves the rich color diversity of the scene.
Robustness to Single-Channel Modalities: The method is demonstrated to work effectively on various single-channel inputs, including Near-Infrared (NIR) and thermal images.

4. Experimental Results

The method was evaluated on Mip-NeRF 360, Tanks and Temples, LLFF, and DL3DV-10K datasets.

Qualitative Results:
- LoGoColor successfully recovers fine-grained details (e.g., small fruits, specific labels, street signs) that baseline methods (ColorNeRF, ChromaDistill) fail to colorize or render as monotonous blobs.
- It avoids the "color shift" artifacts seen in uncalibrated methods.
Quantitative Results:
- Color Diversity: Achieved the highest nColorfulness (normalized Colorfulness), a metric specifically designed to measure color diversity after removing global tint. This proves the method avoids the averaging effect.
- Consistency: Achieved competitive or superior Short-term (SC) and Long-term (LC) consistency scores compared to baselines.
- Plausibility: Achieved the best or second-best FID (Fréchet Inception Distance) scores across most datasets.
Ablation Studies:
- Removing the Global Calibration step resulted in continuous color shifts across the 360° view.
- Removing Multi-view Referencing degraded color quality significantly.
- Increasing the number of base views ( $K$ ) improved color diversity up to $K=4$ , after which gains plateaued.

5. Significance

LoGoColor represents a significant advancement in 3D computer vision by bridging the gap between single-channel geometry reconstruction and high-fidelity, diverse 3D visualization.

Versatility: It enables the use of 3D models derived from non-visible spectra (thermal, NIR) in standard RGB-dependent applications like VR/AR and digital twins.
Paradigm Shift: It moves away from the "average the 2D model" paradigm, proposing instead a "generate consistent 3D-aware training data" paradigm.
Practical Impact: The ability to preserve distinct colors in complex, cluttered 360° environments makes these 3D reconstructions viable for real-world applications where visual fidelity is critical.