Image Compression Using Novel View Synthesis Priors

Imagine you are trying to talk to a friend who is deep underwater in a submarine, but the only way to communicate is through a very slow, crackly walkie-talkie. You want to send them a photo of a shipwreck you just found, but the "walkie-talkie" (acoustic communication) is so slow that sending a full photo would take forever. By the time the photo arrives, you've already moved on to the next spot.

This is the exact problem the researchers in this paper are trying to solve for underwater robots (ROVs).

Here is a simple breakdown of their solution, NVSPrior, using some everyday analogies.

The Problem: The "Slow Walkie-Talkie"

Underwater, you can't use Wi-Fi or radio waves (they don't travel well through water). You have to use sound waves (acoustics), which are like a very narrow pipe. You can send simple text commands easily, but trying to send a high-quality video or photo is like trying to pour a swimming pool of water through that narrow pipe. It just takes too long.

The Old Way: Sending the Whole Picture

Traditionally, if the robot wanted to send a photo, it would take the picture, squish it as small as possible (like zipping a file), and send it. But even the best "zippers" (like JPEG or WebP) still leave too much data for these slow underwater pipes. The robot would send maybe 1 or 2 pictures a second, which is too slow for a human to control the robot effectively.

The New Idea: The "Mental Map" Trick

The researchers came up with a clever trick. They realized that underwater inspection sites (like a shipwreck or an oil rig) don't change much from day to day. The rocks, the rust, and the structure are always there.

Instead of sending the whole picture every time, they decided to send a mental map and only the changes.

Here is how it works, step-by-step:

1. The "Training Camp" (Creating the Prior)

Before the robot goes on its real mission, it does a "training run." It swims around the site and takes hundreds of photos.

The Analogy: Imagine you are an artist who wants to draw a specific house. You spend a week studying the house from every angle. You memorize exactly where the windows, the door, and the chimney are. You create a perfect 3D mental model of the house in your head.
In the paper: This mental model is called a NVS (Novel View Synthesis) model. It's a digital 3D map of the underwater site stored on both the robot and the human operator's computer.

2. The "Guessing Game" (During the Mission)

Now, the robot is on its real mission. It takes a new photo.

The Old Way: Send the whole photo.
The New Way: The robot looks at its 3D mental map and asks, "If I am standing here looking this way, what should the house look like?"
It uses its math to render (draw) a fake picture of what the scene should look like based on its location.

3. The "Difference" (The Secret Sauce)

The robot compares the Real Photo it just took with the Fake Photo it just drew from its mental map.

The Analogy: Imagine you are playing a game of "Spot the Difference."
- If the scene hasn't changed, the Real Photo and the Fake Photo are identical. The "difference" is zero. You send nothing!
- If there is a fish swimming by, or a new piece of trash, or a slight change in lighting, the two photos won't match perfectly.
- The robot only calculates the tiny differences (the fish, the trash, the lighting shift).
In the paper: This is called the residual or difference image. Because most of the scene is already known (from the mental map), this "difference" file is tiny. It's like sending a note that says, "The house is the same, but there's a blue fish in the corner."

4. The "Refinement" (iNVS)

Sometimes, the robot isn't 100% sure of its exact location. If it guesses its location wrong by even a tiny bit, the "Fake Photo" will be slightly shifted, and the "Difference" will look like a messy blur (which is hard to compress).

The Solution: The paper introduces a smart algorithm called iNVS. It's like a super-fast auto-correct.
The Analogy: Imagine you are trying to align two transparent sheets of paper. If they are slightly off, the image looks blurry. The iNVS algorithm nudges the "Fake Photo" sheet back and forth until it lines up perfectly with the Real Photo.
Once they are perfectly aligned, the "Difference" is just the actual new objects (the fish), making the file size incredibly small.

Why is this a Big Deal?

The researchers tested this in a giant water tank and on real underwater datasets (like a coral reef and a sunken torpedo boat).

The Result: Their method sent data 2 to 4 times smaller than the best standard methods (like WebP or JPEG).
The Benefit: Instead of getting 2 frames per second, the operator could get 10 frames per second. This makes the robot feel much more responsive, allowing for real-time control and inspection.
Robustness: Even when new things appeared in the scene (like a new metal structure or a fish), the system handled it well because it only had to send the new stuff, not the whole background.

Summary

Think of it like sending a text message instead of a photo.

Old Way: "Here is a photo of the ocean floor." (Huge file, slow to send).
New Way: "The ocean floor looks exactly like the map we made yesterday, except there is a crab in the bottom left corner." (Tiny file, instant to send).

By using a shared "memory" of the underwater world, this technique allows robots to send high-quality video back to humans even through the slowest, narrowest underwater connections.

Here is a detailed technical summary of the paper "Image Compression Using Novel View Synthesis Priors".

1. Problem Statement

Context: Remotely Operated Vehicles (ROVs) are essential for underwater inspection and manipulation. While tethered ROVs offer reliable communication, tetherless ROVs are preferred for maneuverability but face severe bandwidth constraints due to the reliance on underwater acoustic communication.
The Challenge: Acoustic links typically offer only tens of kilobits per second (kbps), which is insufficient for real-time transmission of high-resolution video or images required for effective teleoperation.
Limitations of Existing Solutions:

Classical Codecs (e.g., WebP, JPEG-XL): Designed for general use, they lack the compression efficiency needed for extremely low-bandwidth underwater links.
Learned Image Compression (LIC): While superior to classical codecs, standard LIC models require large, diverse training datasets to generalize, which are scarce in specific underwater domains. They also fail to leverage scene-specific prior information available in repeat-survey missions.

2. Methodology: NVSPrior and iNVS

The authors propose a novel framework called NVSPrior, which leverages Novel View Synthesis (NVS) models as scene-specific priors to compress images. The core philosophy is that in structured inspection missions, the environment is largely static; therefore, most visual content can be reconstructed from a prior, and only the residual (differences) needs to be transmitted.

A. The NVSPrior Framework

Mapping Run: An initial survey is conducted to collect images of the environment. These are used to train a scene-specific NVS model (specifically 3D Gaussian Splatting or 3DGS).
Storage: The trained 3DGS model is stored on both the ROV (transmitter) and the surface station (receiver).
Compression Process (ROV Side):
- The ROV captures a new image.
- It estimates a latent representation (camera pose) to render a view of the scene using the stored 3DGS model.
- It computes the difference image ( $I_{diff}$ ) between the actual camera image and the rendered image.
- $I_{diff}$ is compressed using a classical codec (WebP or JPEG-XL).
- The optimized latent representation and the compressed $I_{diff}$ are transmitted.
Reconstruction (Receiver Side): The surface station uses the received latent representation to render the scene via the shared 3DGS model and adds the decompressed $I_{diff}$ to reconstruct the original image.

B. Inverse NVS (iNVS) Optimization

A critical challenge is that small errors in the estimated camera pose (latent representation) lead to large pixel differences in $I_{diff}$ , destroying compression efficiency. To solve this, the authors introduce iNVS, a gradient-based optimization strategy:

Goal: Rapidly refine the latent representation (6-DoF pose) to minimize the difference between the rendered image and the real camera image.
Initialization: Uses the optimized latent from the previous frame (temporal continuity) if the difference is below a threshold; otherwise, uses sensor data or pose estimators.
Optimization Algorithm: The paper evaluates BFGS (a quasi-Newton method) and Adam. Results show BFGS is superior for this low-dimensional problem, offering faster convergence and better stability.
Loss Function: Mean Squared Error (MSE) between rendered and camera images is selected over keypoint-matching losses due to better performance in compression metrics.

3. Key Contributions

NVSPrior Framework: The first image compression system to exploit scene-specific priors from trained NVS models (3DGS) for underwater transmission.
iNVS Algorithm: A gradient-based latent refinement method that significantly enhances compression efficiency by minimizing the residual error without relying on 2D affine warping (which introduces artifacts).
Comprehensive Analysis: Systematic evaluation of loss functions (MSE vs. Keypoint), optimizers (BFGS vs. Adam), and initialization strategies, providing practical guidelines for deployment.
Robustness Validation: Demonstrated effectiveness in controlled environments, scenarios with novel objects (occlusions), and real-world datasets with turbidity, backscatter, and marine snow.

4. Experimental Results

The method was evaluated on a controlled artificial ocean basin dataset and two real-world datasets (SeaThru-NeRF and Torpedo Boat Wreck).

Performance Metrics (vs. Baselines):

Compression Ratio: NVSPrior+iNVS achieved a compression ratio of ~141:1 (using WebP) on the controlled dataset, compared to ~48:1 for WebP and ~30:1 for JPEG-XL.
Data Size: Reduced transmitted data size to ~1.2 kB per frame (320x180), enabling ~10 frames per second over a 100 kbps acoustic link. Classical codecs only allowed ~2 fps.
Quality (PSNR): Achieved 35.83 dB (WebP) and 36.15 dB (JPEG-XL), outperforming classical codecs and learned baselines (MLIC++).
Novel Objects: Even with new objects in the scene, the method maintained high compression efficiency (improving 2.2x over WebP) and high reconstruction quality.
Real-World Robustness: In the Torpedo Boat Wreck dataset (challenging lighting, marine snow), the method preserved scene structure better than learned models, which suffered from domain mismatch and low resolution.

Comparison with Baselines:

Classical Codecs (WebP/JPEG-XL): Outperformed in both rate and distortion.
Learned Codecs (Mean & Scale Hyperprior, MLIC++): Underperformed due to lack of domain-specific training data and resolution mismatch.
NVSPrior + Affine: Outperformed by iNVS. The affine approach (2D warping) failed to correct 3D pose errors, introducing artifacts and larger residuals.

5. Significance and Limitations

Significance:

Enables Real-Time Teleoperation: By drastically reducing bandwidth requirements, this method makes high-quality, real-time visual feedback feasible over standard underwater acoustic modems.
Leverages Mission Context: It uniquely capitalizes on the repetitive nature of underwater inspection missions, turning "static environment" constraints into a compression advantage.
Scalability: The approach is computationally efficient enough for near-real-time processing (approx. 60ms/frame on high-end GPUs), though edge deployment requires further optimization.

Limitations:

Initialization Sensitivity: Performance degrades if the initial pose estimate is poor (large perturbations), requiring a good initialization strategy.
Dynamic Environments: The method assumes the scene is relatively static. Rapid environmental changes (e.g., fast-growing marine life) require frequent re-mapping to update the 3D prior.
Runtime on Edge Devices: On resource-constrained hardware (e.g., Jetson Orin), runtime increases significantly, necessitating further code optimization.

Conclusion

The paper presents a paradigm shift in underwater image transmission by moving from generic compression to context-aware compression. By combining a 3D scene prior with gradient-based latent refinement (iNVS), the authors achieve compression ratios and image qualities that make real-time, high-fidelity visual control of tetherless ROVs a practical reality.