3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

Imagine you are teaching a robot to understand a 3D world using only a video camera. The robot needs to answer questions like, "Where is the red chair?" or "How far is the table from the sofa?"

For a long time, the best way to teach this robot was Supervised Fine-Tuning (SFT). Think of SFT like a strict teacher who only cares if the robot's final answer matches the textbook exactly, word-for-word.

The Problem with the Old Way: The robot was trained to mimic the words of the correct answer. But in 3D space, the "words" are just numbers (coordinates). If the robot says the chair is at "1.0, 2.0, 3.0" instead of "1.0, 2.1, 3.0," the teacher marks it wrong because the numbers don't match perfectly. However, in the real world, that tiny difference might mean the robot is actually very close to the chair. The teacher was grading the robot on spelling, not on accuracy. This created a gap: the robot got good at copying text, but bad at actually understanding space.

Enter the new method from this paper: 3D-RFT (Reinforcement Fine-Tuning).

The New Approach: The Video Game Coach

Instead of a strict teacher, imagine a Video Game Coach who uses a scoring system based on real-world results. This is the core idea of 3D-RFT.

Here is how it works, using a simple analogy:

1. The Warm-Up (SFT)

First, the robot needs to learn the basics. We show it examples of how to talk about 3D objects so it doesn't get confused. It learns the "grammar" of 3D space. This is just a warm-up to get it ready.

2. The Real Training (Reinforcement Learning)

Now, the robot starts playing a game. Every time it tries to find an object or answer a question, it gets a score based on how well it actually did, not just how well it copied the answer key.

The Old Way (SFT): "You wrote '1.0, 2.0, 3.0'. The answer key says '1.0, 2.1, 3.0'. You get a zero. Try to copy the key better next time."
The New Way (3D-RFT): "You pointed to the chair. Let's measure the distance. You are 95% accurate! That's a great score! Now, try to get 98%."

The robot learns by trial and error, trying to maximize its score (like 3D IoU or F1-Score) rather than just copying text. It's like training a basketball player not by making them memorize a script of a perfect shot, but by letting them shoot thousands of times and rewarding them only when the ball actually goes through the hoop.

Why is this a Big Deal?

The paper shows that this "Video Game Coach" approach is magic for 3D understanding:

Small Models Beat Big Models: The researchers built a model called 3D-RFT-4B (which is relatively small). Because it was trained with this smart scoring system, it beat much larger, more expensive models (like the 8B model) at finding objects and understanding space. It's like a small, well-trained athlete beating a giant who has never practiced the actual sport.
No More "Hallucinations": The old models often made up things that weren't there (like seeing a table where there was only a shadow). The new model, because it's rewarded for actual accuracy, learned to be much more careful and precise.
Better Reasoning: It didn't just get better at finding things; it got better at thinking. When asked, "If I stand here, where is the door?", the model could figure out the spatial relationship much more accurately.

The Secret Sauce: Verifiable Rewards

The key innovation is Verifiable Rewards. In math or coding, you can easily check if an answer is right or wrong. In 3D vision, it's harder. The authors created a special "calculator" that takes the robot's answer, turns it into a 3D shape, and measures it against the real world to give a precise score.

For finding objects: It calculates the "Intersection over Union" (IoU)—basically, how much does your drawn box overlap with the real object?
For finding specific items: It checks if the robot found the right frame in the video and the right spot.

The Bottom Line

This paper says: "Stop teaching robots to memorize answers. Start teaching them to win the game."

By shifting from "copying the textbook" to "maximizing the score," the researchers created a system where AI models can truly understand 3D space, reason about it, and do it better than models twice their size. It's a massive step forward for robots that need to navigate our physical world, from self-driving cars to home helpers.

1. Problem Statement

Current approaches to Video-based 3D Scene Understanding (e.g., 3D detection, visual grounding, spatial reasoning) primarily rely on Supervised Fine-Tuning (SFT). This paradigm faces a critical limitation known as objective misalignment:

The Mismatch: SFT optimizes models using token-level Cross-Entropy (CE) loss to mimic ground-truth text sequences. However, 3D tasks are evaluated in a continuous geometric space (e.g., 3D Intersection over Union (IoU), F1-Score, spatial accuracy).
The Consequence: Minimizing token-level error acts only as an indirect proxy for geometric quality. A model might generate textually correct sequences that decode into geometrically inaccurate 3D bounding boxes, leading to a performance ceiling where training objectives do not directly correlate with final task metrics.
The Gap: While Reinforcement Learning with Verifiable Rewards (RLVR) has revolutionized reasoning in Large Language Models (LLMs) for math and coding, its application to 3D perception and reasoning remains under-explored due to the complexity of defining verifiable rewards for geometric tasks.

2. Methodology: 3D-RFT Framework

The authors propose 3D-RFT, a two-stage framework that shifts the learning paradigm from sequence imitation to metrics-driven policy optimization.

A. Training Pipeline

Stage 1: SFT Warm-Up
- Goal: Activate 3D-awareness in a Multimodal Large Language Model (MLLM) and establish a stable initial policy.
- Process: The model is trained on standard SFT data to learn the required output format (structured JSON with 9-DoF bounding boxes) and basic 3D scene understanding.
- Architecture: Based on VG LLM-4B, utilizing a Qwen2.5-VL-3B language backbone and a VGGT-1B visual geometry backbone.
Stage 2: Reinforcement Fine-Tuning (RL Training)
- Algorithm: Uses Group Relative Policy Optimization (GRPO), a memory-efficient variant of PPO that eliminates the need for a separate critic network.
- Mechanism: For each input, the model generates a group of outputs. The advantage of each output is calculated by normalizing its reward against the group's statistics.
- Optimization: The policy is updated to maximize expected rewards while constraining the KL-divergence from the reference model to prevent collapse.

B. Verifiable Reward Design

The core innovation is the design of task-specific, strictly verifiable reward functions derived directly from evaluation metrics, bypassing the token-level proxy:

3D Video Detection:
- Average IoU Reward: Averages the maximum 3D IoU between predicted and ground-truth boxes.
- F1-Score Reward: Computes True Positives (TP), False Positives (FP), and False Negatives (FN) based on an IoU threshold (0.25) to directly optimize the F1 metric.
- Combined Reward: $R_{Det} = R_{IoU} + R_{F1}$ .
3D Visual Grounding:
- Temporal Reward: A smoothed linear decay function based on the absolute difference between predicted and ground-truth frame indices.
- Spatial Reward: Computes 3D IoU in the global coordinate system by projecting the local predicted box using extrinsic matrices.
- Combined Reward: $R_{Grd} = R_{frame} + R_{IoU}$ .
3D Spatial Reasoning:
- Multiple Choice: Exact match indicator (1 if correct, 0 otherwise).
- Numerical Reasoning: Mean Relative Accuracy (MRA) across multiple tolerance thresholds.

3. Key Contributions

Paradigm Shift: First framework to extend RLVR to video-based 3D perception and reasoning, moving from token-level imitation to direct metric optimization.
Reward Engineering: Designed novel, differentiable (via parsing) reward functions that align training objectives with standard evaluation metrics (3D IoU, F1-Score, Accuracy).
Two-Stage Pipeline: Established a robust SFT-to-RL pipeline that successfully activates 3D capabilities in MLLMs before applying reinforcement signals.
Data Insights: Revealed that high-quality Chain-of-Thought (CoT) data is critical for the SFT warm-up to ensure generalization and prevent overfitting in reasoning tasks.

4. Experimental Results

The authors evaluated 3D-RFT-4B (a 4B parameter model) against various baselines, including larger models (e.g., VG LLM-8B, 8B scale).

3D Video Detection (ScanNetDetection):
- 3D-RFT-4B achieved State-of-the-Art (SOTA) performance.
- Improvement: Outperformed the SFT baseline (VG LLM-4B) by +5.5% F1 and +12.5% Precision (4-frame setting).
- Efficiency: Surpassed the larger VG LLM-8B model on F1-Score and Precision, demonstrating that metrics-driven optimization extracts more potential from smaller models than SFT.
3D Visual Grounding (ScanRefer):
- Achieved significant gains over the SFT baseline (+6.5% at IoU@0.25).
- Again, outperformed the 8B baseline model, proving the efficacy of direct geometric optimization.
3D Spatial Reasoning (VSI-Bench):
- 3D-RFT-4B achieved 62.8% average accuracy, surpassing previous SOTA models (e.g., VLM-3R-7B, SpaceR-7B).
- Data Analysis: Experiments showed that mixing Direct Answer (DA) data with high-quality CoT (TA) data during SFT is crucial. Training solely on DA data led to overfitting and poor Out-of-Domain (OOD) performance.

5. Significance and Future Impact

Bridging the Gap: 3D-RFT successfully bridges the gap between discrete language generation and continuous 3D geometric evaluation, solving a fundamental misalignment in 3D vision-language research.
Scalability: It demonstrates that smaller models (4B) can outperform larger models (8B) in 3D tasks if the training objective is correctly aligned with the evaluation metric, offering a more efficient path for future development.
Robustness: The framework shows robust efficacy across diverse tasks (detection, grounding, reasoning) and is less sensitive to the specific visual backbone (works with or without explicit 3D priors like VGGT).
Future Directions: The paper highlights the need for better CoT data collection for 3D perception and suggests that future work should focus on unified multi-task RL and process rewards to ensure reasoning soundness in 3D scenes.

Conclusion: 3D-RFT establishes a new standard for 3D scene understanding by proving that Reinforcement Learning with Verifiable Rewards is a superior paradigm to Supervised Fine-Tuning for tasks where the ground truth is geometric and continuous.