PoI: A Filter to Extract Pixel of Interest from Novel Views for Scene Coordinate Regression

The Big Picture: Teaching a Robot to Find Its Way

Imagine you are trying to teach a robot to navigate a city. To do this, you need to show it thousands of photos of that city from different angles so it learns, "Okay, when I see this specific brick on this building, I know exactly where I am."

This is called Visual Localization. There are two main ways to teach the robot:

The "Gist" Method (Camera Pose Regression): You show the robot a photo and ask, "Where are you?" The robot looks at the whole picture and guesses. It's fast, but it's a bit like guessing the weather by looking at the sky from far away.
The "Pinpoint" Method (Scene Coordinate Regression - SCR): This is the method this paper focuses on. Here, you ask the robot to point to every single pixel in the photo and say, "That pixel is 5 meters to the left, 2 meters up." It's like having a GPS that knows the exact location of every single brick. This is much more accurate, but it's very picky. If the photo is blurry or has a fake-looking building, the robot gets confused and fails.

The Problem: The "Fake" Photos Are Too Fake

To teach the robot without taking millions of real photos, scientists use Neural View Synthesis (NVS). Think of this as a 3D printer that can generate new photos of the city from angles the robot has never seen before.

However, these 3D printers (like NeRF or 3D Gaussian Splatting) have a flaw: they can only copy what they've already seen.

If you ask them to show a building from the back, but they only saw the front, they might just stretch the front wall or make a blurry mess.
For the "Gist" method, a blurry mess is okay.
For the "Pinpoint" method (SCR), a blurry mess is a disaster. If the robot thinks a fake, blurry pixel is a real brick, it will calculate the wrong location and crash.

The Dilemma: We want to use these synthetic photos to teach the robot more, but the photos are often "hallucinations" (fake details) that confuse the robot.

The Solution: PoI (Pixel of Interest)

The authors created a system called PoI (Pixel of Interest). Think of PoI as a super-smart editor that sits between the 3D printer and the robot student.

Here is how PoI works in three steps:

1. The "Magic Touch" (Diffusion Refinement)

First, they take the blurry, stretched-out photos from the 3D printer and run them through a Diffusion Model.

Analogy: Imagine a sketch artist who drew a building but forgot the windows. A diffusion model is like a second artist who looks at the sketch and says, "I know what windows usually look like on this style of building," and paints them in.
Result: The photos look much sharper and more realistic. But... they might still have some "fake" windows that don't actually exist in the real world.

2. The "Trust Filter" (The Core Innovation)

This is the most important part. Even after the "Magic Touch," some pixels are still unreliable. PoI acts like a bouncer at a club.

It looks at every single pixel in the new photo.
It asks: "Does this pixel match up with what we know about the real world?" (This is called checking the reprojection error).
The Bouncer's Decision:
- Pixel A: "You look perfect! You match the real building." -> Let it in. (This is a "Pixel of Interest").
- Pixel B: "You look weird. You're a hallucination." -> Kick it out.
The Magic: PoI doesn't throw away the whole photo if it has a few bad pixels. It just throws away the bad pixels and keeps the good ones. It's like eating a pizza and picking out the burnt pepperoni slices but keeping the cheese.

3. The "Dynamic Teacher"

As the robot learns, PoI gets stricter.

Early in training: The robot is a baby. PoI is lenient, letting in more pixels to help the robot learn fast.
Later in training: The robot is smarter. PoI becomes a strict teacher, only letting in the pixels that are 100% trustworthy. It gradually lowers the weight of the "fake" pixels until they don't confuse the robot anymore.

Why This Matters

Before this paper, scientists tried to use these synthetic photos for the "Pinpoint" method, but it made the robot worse because the fake details were too confusing.

PoI changes the game by saying: "We don't need the whole photo to be perfect. We just need the right pixels to be perfect."

The Results

The team tested this on real-world datasets (like the "7Scenes" dataset, which is like a digital museum of rooms, and "Cambridge Landmarks," which is like a digital map of famous city squares).

Without PoI: The robot got lost or confused.
With PoI: The robot became the best in the world at finding its location, beating all previous methods. It learned faster and made fewer mistakes, all while using the "fake" photos as a helpful supplement rather than a distraction.

Summary Analogy

Imagine you are trying to learn a new language by reading a book written by a translator who sometimes makes up words.

Old Way: You read the whole book. You learn the language, but you also learn the made-up words, so you speak incorrectly.
PoI Way: You have a smart editor (PoI). The editor reads the book, highlights the words that are definitely correct, and crosses out the made-up ones. You only study the highlighted words. You learn the language perfectly, and you learn it much faster because you have more material to study, but you aren't confused by the errors.

In short: PoI is a filter that saves us from the "hallucinations" of AI-generated photos, allowing us to use them to teach robots how to navigate the real world with extreme precision.

1. Problem Statement

Visual localization aims to estimate 6-degree-of-freedom (6DoF) camera poses. While Scene Coordinate Regression (SCR) methods (e.g., DSAC*, ACE) achieve high accuracy by predicting dense 3D coordinates for every pixel, they are data-hungry and struggle with generalization when training data is scarce.

To address data scarcity, researchers use Neural View Synthesis (NVS) (e.g., NeRF, 3D Gaussian Splatting) to generate synthetic training images from novel viewpoints. However, applying NVS to SCR faces two critical challenges:

Geometric Limitations: NVS methods rely on geometric interpolation of observed radiance. They cannot "hallucinate" unseen 3D structures or recover missing content under sparse viewpoints, leading to blurry, distorted, or incomplete geometry in synthetic views.
Sensitivity to Artifacts: Unlike Camera Pose Regression (CPR) methods, which regress a single pose from global image features (N-to-1) and tolerate some blur, SCR requires accurate per-pixel 3D supervision (N-to-N). Even localized rendering errors in synthetic views can propagate to incorrect 2D-3D correspondences, severely degrading SCR performance.
The "Direct Addition" Failure: Simply adding raw NVS-rendered images to the training set often degrades SCR accuracy and increases training time because the model learns from noisy, geometrically inconsistent pixels.

2. Methodology: The PoI Framework

The authors propose PoI (Pixel-of-Interest), a framework designed to safely integrate NVS-generated data into SCR training by filtering out unreliable pixels. The pipeline consists of four main stages:

A. Novel View Synthesis (NVS) & Refinement

Base Rendering: The system uses 3D Gaussian Splatting (3DGS) to render novel views from sampled camera poses. To handle lighting variations (especially in outdoor scenes), it employs an exposure histogram-assisted approach (similar to DFNet) to adjust appearance.
Diffusion Refinement: To overcome the blurring and structural distortions of pure 3DGS, the authors employ DIFIX3D+, a single-step diffusion model. This model refines the coarse 3DGS renderings in a single forward pass, recovering structurally plausible details and improving texture fidelity without the high computational cost of iterative denoising.

B. Pixel-of-Interest (PoI) Filtering Strategy

Even after diffusion refinement, some pixels may still violate geometric consistency. PoI introduces a progressive, pixel-level filtering mechanism:

Reprojection Error Check: The system calculates the reprojection error (the distance between ground-truth pixel coordinates and re-projected 2D estimates) for synthetic pixels.
Dual-Criterion Gate: A filter function $G(x, y)$ retains a pixel only if its reprojection error is below a threshold $\tau_r$ .
Progressive Training:
1. Subsampling: Initially, synthetic features are subsampled (Bernoulli probability $p=0.5$ ) to prevent model destabilization.
2. Dynamic Weighting: The loss weight for synthetic pixels ( $\tilde{\omega}$ ) starts at 1.0 and linearly decays to 0.01 as training progresses. This allows the model to initially explore the synthetic data but gradually rely more on high-confidence pixels as it converges.
3. Outlier Removal: Pixels identified as outliers (high reprojection error) are progressively excluded from the "Pixel-of-Interest" set, ensuring only trustworthy geometric supervision is used.

C. Architecture

Backbone: A shared backbone extracts features from both real query images ( $I_{query}$ ) and filtered synthetic images ( $I_{novel}$ ).
Feature Fusion: The "Features of Interest" (FoI) from synthetic views are concatenated with query features.
Head: A scene-specific MLP head predicts the 3D scene coordinates.
Loss Function: Combines reprojection loss from query data (weight = 1) and filtered synthetic data (dynamic weight).

D. Data Augmentation Strategy

The paper adapts pose sampling based on data density:

Dense Data: Uses nearest-neighbor search to sample adjacent viewpoints.
Sparse Data: Uses Fisher Information selection (from FisherRF) to sample the most informative novel poses, ensuring uniform coverage of the pose manifold.

3. Key Contributions

PoI Framework: A novel pixel-level filtering mechanism that enables the integration of NVS into SCR by explicitly removing low-quality, geometrically inconsistent synthetic pixels.
Diffusion-Enhanced NVS: The integration of a single-step diffusion model to refine 3DGS renderings, recovering structural details that pure geometry-driven methods miss.
Progressive Filtering Strategy: A training protocol that dynamically adjusts the reliance on synthetic data, starting with broad exploration and converging to strict geometric consistency.
State-of-the-Art Performance: Demonstrated that NVS augmentation for SCR requires not just generative realism but explicit control over pixel-level reliability.

4. Experimental Results

The method was evaluated on 7Scenes (indoor) and Cambridge Landmarks (outdoor) datasets.

Quantitative Performance:
- 7Scenes: PoI achieved a median translation error of 0.4 cm (vs. 0.5 cm for ACE and 0.5 cm for DSAC*), setting a new state-of-the-art.
- Cambridge Landmarks: PoI achieved a median translation error of 12.2 cm, outperforming previous SCR and NRP methods (e.g., GLACE at 14 cm).
- Ablation Studies:
  - Adding unfiltered synthetic images (dif+poa) degraded performance compared to the baseline.
  - Random pixel sampling (dif+por) was inferior to the proposed reprojection-based filtering.
  - Using 3DGS alone without diffusion (3dgs+poi) performed worse than the full pipeline, confirming the need for generative refinement.
Efficiency: The method achieves these gains with competitive training times (approx. 25 minutes for Cambridge Landmarks), comparable to efficient baselines like ACE.
Sparse Input: In sparse-view scenarios (10 images per scene), the diffusion-NVS augmented pipeline significantly improved localization accuracy over training with sparse inputs alone.

5. Significance

This paper addresses a critical bottleneck in visual localization: how to safely use synthetic data for tasks requiring high geometric precision.

Paradigm Shift: It challenges the assumption that "more data is always better," showing that for SCR, noisy synthetic data is harmful. The solution is not just better generation, but better selection.
Bridging the Gap: It successfully bridges the gap between generative AI (Diffusion models) and geometric computer vision (SCR), proving that generative priors can enhance geometric reasoning if filtered correctly.
Practical Impact: The method allows for robust localization in scenarios where collecting dense, annotated training data is impossible, by effectively "hallucinating" and filtering high-quality training views.