UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images

Imagine you are looking at two snapshots of a busy street scene taken a split second apart. In the first photo, a car is driving past a building. In the second, the car has moved, and the camera has shifted slightly.

Your brain instantly figures out three things:

Where the objects are (the 3D shape of the car and building).
How they moved (the car drove forward, the building stayed still).
How you moved (the camera panned to the right).

Doing this mathematically is incredibly hard for computers, especially when you don't know exactly where the camera was pointing (unposed images). Usually, computers have to run slow, heavy calculations for hours to guess these answers, or they need a massive amount of pre-labeled training data that doesn't exist in the real world.

Enter UFO-4D.

Think of UFO-4D not as a calculator, but as a magical, instant 3D sculptor. Here is how it works, using some everyday analogies:

1. The "Magic Dust" (Dynamic 3D Gaussians)

Most 3D reconstruction tries to build a scene out of a million tiny, rigid Lego bricks. If a brick moves, you have to rebuild the whole wall.

UFO-4D uses something different: 3D "Magic Dust" (Gaussians). Imagine the scene is made of thousands of glowing, fuzzy clouds of paint.

Each cloud has a position, a color, and a velocity (a built-in instruction on how fast and in what direction it wants to move).
When the computer looks at your two photos, it doesn't just guess the shape; it instantly sprays this "magic dust" into the air to form the car, the building, and the road.
Because the dust has velocity instructions, the computer knows exactly how the car's dust will shift to match the second photo, and how the building's dust stays put.

2. The "One-Stop Shop" (Unified Feedforward)

Old methods are like hiring three different specialists: one to guess the shape, one to guess the motion, and one to guess the camera angle. They often disagree with each other, and you have to wait for them to argue it out (slow optimization).

UFO-4D is a super-genius general contractor.

It looks at the two photos and, in a single instant (a "feedforward" pass), it hands you the finished 3D model, the motion map, and the camera movement all at once.
Because it builds everything from the same set of magic dust, the shape, motion, and camera angle are perfectly synchronized. They can't disagree because they are all part of the same object.

3. The "Self-Correcting Mirror" (Self-Supervision)

Here is the cleverest part. Usually, to teach a computer 3D, you need a teacher with a perfect answer key (labeled data). But perfect 3D data is rare.

UFO-4D uses a self-checking mirror.

It builds its 3D model, then it tries to "paint" the two original photos back onto a canvas using that model.
If the painted photo looks different from the real photo, the model knows, "Oops, I got the shape or motion wrong."
It fixes itself instantly. It doesn't need a human teacher; it just needs to make sure its own predictions look like the real world. This allows it to learn from messy, real-world data where perfect answers don't exist.

4. The "Time Machine" (4D Interpolation)

Because the model knows the "velocity" of every single particle of dust, it can do something amazing: Time Travel.

If you want to see the scene at a time between the two photos, or from a camera angle that doesn't exist, UFO-4D just tells the dust clouds to move to their new positions and repaints the scene.

It can create a smooth, high-quality video of the car driving by, even if you only gave it two static photos.
It can show you the car from behind the building, even though the building was blocking it in the original photos (because the model "knows" the car is there).

Why is this a big deal?

Speed: It works in real-time (like a video game), whereas old methods took hours.
Accuracy: It is up to 3 times better at guessing motion and shape than previous top methods.
Versatility: It solves the puzzle of "Shape," "Motion," and "Camera" all together, rather than trying to solve them separately.

In summary: UFO-4D is like giving a computer a pair of glasses that instantly turns flat photos into a living, breathing 3D world where every object knows how to move, and the computer can watch that world play out in slow motion or from any angle it wants.

1. Problem Statement

The paper addresses the challenge of dense 4D reconstruction (recovering 3D geometry, 3D motion, and camera pose) from a pair of unposed images (images without known camera parameters).

Current Limitations: Existing methods typically rely on slow, test-time optimization pipelines (iterative solvers) that are computationally expensive and dependent on intermediate signals like depth or optical flow. Alternatively, recent feedforward models are often fragmented, handling specific tasks (e.g., just geometry or just motion) separately, or requiring camera poses as input.
Data Scarcity: Training robust models is hindered by a lack of large-scale, densely annotated 4D datasets. Real-world data often has sparse or noisy ground truth, while synthetic data suffers from domain gaps.
Goal: To create a unified, feedforward framework that reconstructs a dense, explicit 4D representation from two unposed images in a single pass, enabling joint estimation of geometry, motion, and pose without iterative optimization.

2. Methodology: UFO-4D

UFO-4D introduces a unified feedforward model that predicts Dynamic 3D Gaussian Splats (D-3DGS) and relative camera pose directly from an input image pair.

A. Core Representation: Dynamic 3D Gaussians

Instead of predicting per-pixel depth or flow maps, the model outputs a set of dynamic 3D Gaussians in a canonical coordinate system (defined by the first image).

Attributes: Each Gaussian $p$ $p$ is defined by:
- 3D center $\mu \in \mathbb{R}^3$
- 3D motion vector $v \in \mathbb{R}^3$ (forward for $t \to t+1$ , backward for $t+1 \to t$ )
- Rotation (quaternion $r$ ), Scale ( $s$ ), View-dependent color (spherical harmonics $h$ ), and Opacity ( $o$ ).
Temporal Alignment: To represent the scene at an intermediate time $t' = t + \Delta t$ , the 3D centers are translated linearly: $\mu' = \mu + \Delta t \cdot v$ . This allows for continuous 4D interpolation.

B. Network Architecture

Inspired by DUSt3R and NoPoSplat, the architecture consists of:

Weight-Sharing Encoder: A ViT-based encoder processes both input images into tokens.
Token Integration: Image tokens are concatenated with a learnable Pose Token and an Intrinsics Token (derived from camera parameters).
Decoder: A ViT-based decoder with cross-attention layers integrates information between the two views.
Heads:
- Center, Attributes, and Velocity Heads: Predict the Gaussian parameters ( $\mu, r, s, h, o, v$ ) for each pixel.
- Pose Head: A 3-layer MLP that directly predicts the relative camera pose (translation $\tau$ and rotation $q$ ) between the two frames.

C. Differentiable 4D Rasterization

A critical component is the extension of the 3D Gaussian Splatting rasterizer to handle 4D data.

Unified Rendering: The rasterizer renders not only color images but also dense 3D point maps and 3D scene flow at any arbitrary time step $t'$ and view.
Mechanism: By substituting the color term in the alpha-blending equation with other Gaussian attributes (e.g., position $\mu$ or velocity $v$ ), the system generates geometric and motion maps that are fully differentiable.
Benefit: This allows gradients from rendered outputs to backpropagate through the entire pipeline, enabling joint optimization of all heads.

D. Training Strategy: Semi-Supervised Learning

To overcome data scarcity, UFO-4D employs a hybrid loss function:
$L_{total} = L_{sup} + L_{self}$

Supervised Loss ( $L_{sup}$ ): Uses sparse ground truth (where available) for motion, points, and pose.
Self-Supervised Loss ( $L_{self}$ ):
- Photometric Loss: Minimizes MSE and LPIPS between input images and rendered images. This provides dense supervision independent of ground truth labels.
- Smoothness Loss: Encourages spatial smoothness in rendered point and motion maps, weighted by image edges to preserve boundaries.

Synergy: Because geometry, motion, and appearance share the same Gaussian primitives, supervising one modality (e.g., photometric loss on the image) inherently regularizes the others (geometry and motion).

3. Key Contributions

Unified Feedforward Model: The first architecture to jointly estimate dense 3D geometry, 3D motion, and camera pose from two unposed images in a single forward pass using Dynamic 3D Gaussians.
Robust Semi-Supervision: A framework that leverages differentiable rendering to utilize self-supervised photometric losses, effectively mitigating the lack of dense 4D ground truth annotations.
4D Interpolation Capability: The explicit 4D representation enables high-fidelity interpolation of images, depth, and motion at novel views and intermediate time steps, a capability not present in per-frame point cloud methods.
State-of-the-Art Performance: Demonstrates significant improvements over existing methods on standard benchmarks.

4. Experimental Results

The model was evaluated on Stereo4D, KITTI, Bonn, and Sintel datasets.

Geometry Estimation: UFO-4D achieves the lowest End-Point Error (EPE) for pointmaps and competitive depth accuracy (Abs. Rel.) across all datasets. It significantly outperforms competitors like DynaDUSt3R and MonST3R on Stereo4D.
Motion Estimation: The method achieves a 3× lower EPE on Stereo4D and KITTI compared to the best competing methods. It successfully disentangles object motion from camera ego-motion, producing sharp motion boundaries.
Pose Estimation: UFO-4D outperforms methods relying on iterative PnP solvers (like MonST3R and St4RTrack) in both Absolute Trajectory Error (ATE) and Relative Pose Error (RPE), demonstrating that direct feedforward estimation is more accurate and robust.
Qualitative Analysis:
- Opacity as Confidence: The model learns to assign high opacity to visible regions and low opacity to occluded/disoccluded regions, effectively handling visibility changes.
- Ablation Studies: Removing the photometric loss or the rendering losses for points/motion significantly degrades performance, confirming the necessity of the unified 4D rasterization for joint optimization.

5. Significance and Impact

Efficiency: By replacing slow test-time optimization with a single feedforward pass, UFO-4D enables real-time or near-real-time 4D reconstruction.
Data Efficiency: The self-supervised approach reduces reliance on expensive, dense 4D annotations, making the model more applicable to real-world scenarios where such data is unavailable.
Unified Representation: The use of Dynamic 3D Gaussians as a bottleneck creates a "synergistic" effect where improvements in one task (e.g., better pose estimation) directly improve others (e.g., better motion estimation), solving the problem of fragmented task-specific models.
New Applications: The ability to interpolate 4D data opens doors for applications in video synthesis, AR/VR, and robotics, where understanding the continuous spatio-temporal evolution of a scene is crucial.

In conclusion, UFO-4D represents a paradigm shift in dynamic scene reconstruction, moving from iterative, task-specific pipelines to a unified, explicit, and efficient feedforward framework capable of handling the full spectrum of 4D perception tasks.