Direction-aware 3D Large Multimodal Models

Imagine you are standing in a completely dark room, and someone hands you a 3D hologram of that room. They ask you, "What is on the left of the sofa?"

Here's the problem: The hologram is just a floating cloud of points. It doesn't know which way you are facing. Is the sofa's "left" the side near the window, or the side near the door? Without knowing where you (the observer) are standing and which way you are looking, the question "What is on the left?" is impossible to answer correctly. It's like asking, "Which way is North?" without knowing where you are on the map.

This is exactly the problem the paper "Direction-aware 3D Large Multimodal Models" solves.

The Problem: The "Blindfolded" AI

Currently, most AI models that understand 3D rooms (like ScanRefer or ScanQA) are trained on datasets where the "camera" (the AI's eyes) is missing. The datasets have the 3D room and the questions, but they forgot to save the photo of where the camera was standing when the question was asked.

Because of this missing "self-location" data (called ego pose), the AI is essentially blindfolded. It tries to guess directions like "left" or "right" based on a global map, which leads to confusion and wrong answers.

The Solution: Two New Tools

The authors propose a simple but brilliant two-step fix to wake the AI up to its own position.

1. PoseRecover: The "Time Traveler" Detective

Since the original datasets forgot to save the camera's location, the authors built a tool called PoseRecover to find it.

The Analogy: Imagine you lost a specific photo in a massive library of 10,000 photos. You remember the photo had a "red chair" in it. PoseRecover is a detective that scans the entire library, finds every photo containing a red chair, and checks: "In this photo, is the red chair actually visible, or is it blocked by a wall?"
How it works: The tool looks at the 3D room and the question (e.g., "What is to the left of the bed?"). It then scans through all the original video frames of that room to find the specific camera angles where the bed is clearly visible. It picks the best angle that matches the question.
The Result: It recovers the "missing link"—the exact position and direction the camera was facing when the question was asked.

2. PoseAlign: The "Rotating Table"

Now that we have the camera's location, we need to feed it to the AI. The authors tried three ways to do this, but one worked best.

The Analogy: Imagine you are sitting at a round table with a plate of food. If you want to know what is on your "left," you don't need to describe the table's coordinates to your brain; you just turn your head.
The Best Method (PoseAlign-Transform): Instead of trying to explain the camera's position to the AI using complex math or text (which is confusing), they simply rotate the 3D room so that the "camera's view" becomes the "AI's view."
- If the camera was facing North, they spin the entire 3D room so North becomes "Forward" for the AI.
- Now, when the AI sees the room, "Left" actually means "Left" relative to the camera. The AI doesn't need to guess; the geometry is already aligned.

Why This Matters

The results are like turning on the lights in a dark room.

Before: The AI was guessing directions and getting them wrong about 30-50% of the time on tricky questions.
After: With the room rotated to match the camera's view, the AI's accuracy jumped significantly (up to 30% improvement in some tasks).

The Big Picture

This paper argues that for AI to truly understand 3D spaces (like a robot navigating a house), it needs to know where it is standing.

Old Way: Give the AI a map and ask, "What's on the left?" (AI: Confused. Which left?)
New Way: Give the AI the map, tell it "You are standing here, facing this way," and then ask, "What's on the left?" (AI: Ah, I see! The lamp is on the left.)

The authors show that you don't need to build a brand new, super-complex AI to do this. You just need to fix the data (PoseRecover) and rotate the room to match the view (PoseAlign). It's a simple, "free lunch" upgrade that makes existing AI models much smarter at understanding space.

1. Problem Statement

Current 3D Large Multimodal Models (3D LMMs) struggle with spatial reasoning and directional understanding (e.g., "left," "right," "behind") because existing 3D benchmarks (such as ScanRefer, ScanQA, and Scan2Cap) are ill-posed regarding ego-pose information.

The Core Issue: These datasets contain rich directional queries but lack the corresponding ego-pose (the camera's position and orientation) required to define those directions. Without a reference frame, terms like "left" are ambiguous in a global 3D point cloud.
Limitations of Existing Solutions: Prior works often create new datasets or force models to infer the ego-pose as a latent variable. This is conceptually redundant because embodied agents (robots) naturally acquire ego-pose data via SLAM (Simultaneous Localization and Mapping) during data collection. Inferring it adds unnecessary complexity and often leads to inconsistent reasoning.

2. Methodology

The authors propose a new paradigm consisting of two core components: PoseRecover and PoseAlign.

A. PoseRecover: Automatic Ego-Pose Recovery

This is a fully automatic pipeline designed to supplement missing ego-pose data into existing benchmarks.

Mechanism: It matches question-specific object annotations (from ground truth) with camera frustums derived from ScanNet RGB-D video extrinsics.
Process:
1. Intersection Calculation: It calculates the intersection between object annotations (segmentation masks, bounding boxes, or point locations) and camera frustums.
2. Visibility Check: It uses a Z-buffer to ensure the object is actually visible from the candidate camera pose, filtering out occluded views.
3. Pose Selection: During training or inference, it selects a candidate pose. The authors introduce a "Clip" strategy, which discards the top and bottom $X\%$ of intersection scores (where $X \approx 0.3$ ). This removes extreme outliers (e.g., 180° opposite views) while maintaining data diversity.
Output: A list of valid, mission-critical ego-poses for every text-scene pair, transforming the dataset from ill-posed to well-posed.

B. PoseAlign: Integrating Pose into 3D LMMs

Since existing 3D LMMs are not designed to accept explicit pose inputs, the authors propose three integration strategies, with PoseAlign-Transform being the most effective:

PoseAlign-Transform (Recommended): The input point cloud is geometrically transformed (rotated and translated) to align with the recovered ego-pose coordinate frame. This leverages the inherent coordinate sensitivity of pretrained point cloud encoders without modifying the model architecture.
PoseAlign-Embed: Encodes the 6-DoF pose into the feature vectors via an MLP and adds them to the projection layer.
PoseAlign-Prompt: Serializes the pose as numeric tokens and prepends them to the text prompt.

Training: The approach requires only instruction tuning (LoRA) on the projector and LLM; the 3D point cloud encoder remains frozen.

3. Key Contributions

Redefining the Paradigm: The paper identifies that most 3D LMM benchmarks are ill-posed due to missing ego-pose data and proposes a rigorous solution to rectify this without creating new datasets.
PoseRecover Pipeline: A novel, automatic pipeline that recovers mission-critical ego-poses from existing RGB-D sequences using object-frustum intersection and visibility checks, effectively "fixing" existing benchmarks.
PoseAlign Framework: A simple, generic, and training-efficient method to enable direction-awareness in 3D LMMs. The PoseAlign-Transform approach achieves state-of-the-art results by simply re-aligning the point cloud data to the ego-view.
Comprehensive Evaluation: Extensive experiments across four benchmarks (ScanRefer, Multi3DRefer, ScanQA, Scan2Cap) and four different 3D LMM architectures (LL3DA, Chat-Scene, 3D-LLAVA, etc.).

4. Experimental Results

The proposed method yields consistent and significant improvements across all tested models and tasks:

Performance Gains:
- ScanRefer (Object Grounding): Improved mIoU by 30.0% (from 42.6% to 55.4% on 3D-LLAVA).
- Scan2Cap (Dense Captioning): Improved LLM-as-judge accuracy by 11.7%.
- ScanQA: Improved LLM-as-judge accuracy by 3.5%.
Direction-Critical Subsets: The improvements are most pronounced on subsets of questions explicitly requiring directional reasoning (e.g., "What is on the left?"), where baseline models previously failed due to ambiguity.
Ablation Studies:
- PoseAlign-Transform outperforms embedding and prompting methods, confirming that geometric alignment is superior for spatial reasoning.
- The Clip strategy ( $X=0.3$ ) is crucial for balancing pose stability and data diversity.
- Using random poses degrades performance, proving that the recovered poses are essential, not just the concept of having a pose.
Efficiency: The method requires no retraining of the heavy 3D point cloud encoder, making it highly efficient and applicable to various architectures.

5. Significance and Impact

Solving the Ill-Posed Problem: The paper provides a theoretical and practical solution to the fundamental ambiguity in 3D spatial reasoning, arguing that ego-pose should be an explicit input rather than a latent inference task.
Practical Applicability: By leveraging "free-lunch" pose data available in real-world embodied AI workflows (via SLAM), the method bridges the gap between synthetic benchmarks and real-world deployment.
Generalizability: The approach is architecture-agnostic and works with diverse 3D backbones, establishing a strong, simple baseline for future direction-aware 3D LMMs.
Future Direction: It advocates for a shift in 3D LMM research towards ego-conditioned understanding, where the model's reasoning is anchored to a specific, recovered viewpoint, leading to more robust and alias-free spatial intelligence.