Direction-aware 3D Large Multimodal Models

This paper addresses the lack of ego poses in existing 3D benchmarks by introducing a new paradigm featuring PoseRecover and PoseAlign, which automatically recover and align point clouds with camera poses to significantly enhance the directional reasoning capabilities of 3D Large Multimodal Models.

Quan Liu, Weihao Xuan, Junjue Wang, Naoto Yokoya, Ling Shao, Shijian Lu

Published 2026-02-24
📖 4 min read☕ Coffee break read

Imagine you are standing in a completely dark room, and someone hands you a 3D hologram of that room. They ask you, "What is on the left of the sofa?"

Here's the problem: The hologram is just a floating cloud of points. It doesn't know which way you are facing. Is the sofa's "left" the side near the window, or the side near the door? Without knowing where you (the observer) are standing and which way you are looking, the question "What is on the left?" is impossible to answer correctly. It's like asking, "Which way is North?" without knowing where you are on the map.

This is exactly the problem the paper "Direction-aware 3D Large Multimodal Models" solves.

The Problem: The "Blindfolded" AI

Currently, most AI models that understand 3D rooms (like ScanRefer or ScanQA) are trained on datasets where the "camera" (the AI's eyes) is missing. The datasets have the 3D room and the questions, but they forgot to save the photo of where the camera was standing when the question was asked.

Because of this missing "self-location" data (called ego pose), the AI is essentially blindfolded. It tries to guess directions like "left" or "right" based on a global map, which leads to confusion and wrong answers.

The Solution: Two New Tools

The authors propose a simple but brilliant two-step fix to wake the AI up to its own position.

1. PoseRecover: The "Time Traveler" Detective

Since the original datasets forgot to save the camera's location, the authors built a tool called PoseRecover to find it.

  • The Analogy: Imagine you lost a specific photo in a massive library of 10,000 photos. You remember the photo had a "red chair" in it. PoseRecover is a detective that scans the entire library, finds every photo containing a red chair, and checks: "In this photo, is the red chair actually visible, or is it blocked by a wall?"
  • How it works: The tool looks at the 3D room and the question (e.g., "What is to the left of the bed?"). It then scans through all the original video frames of that room to find the specific camera angles where the bed is clearly visible. It picks the best angle that matches the question.
  • The Result: It recovers the "missing link"—the exact position and direction the camera was facing when the question was asked.

2. PoseAlign: The "Rotating Table"

Now that we have the camera's location, we need to feed it to the AI. The authors tried three ways to do this, but one worked best.

  • The Analogy: Imagine you are sitting at a round table with a plate of food. If you want to know what is on your "left," you don't need to describe the table's coordinates to your brain; you just turn your head.
  • The Best Method (PoseAlign-Transform): Instead of trying to explain the camera's position to the AI using complex math or text (which is confusing), they simply rotate the 3D room so that the "camera's view" becomes the "AI's view."
    • If the camera was facing North, they spin the entire 3D room so North becomes "Forward" for the AI.
    • Now, when the AI sees the room, "Left" actually means "Left" relative to the camera. The AI doesn't need to guess; the geometry is already aligned.

Why This Matters

The results are like turning on the lights in a dark room.

  • Before: The AI was guessing directions and getting them wrong about 30-50% of the time on tricky questions.
  • After: With the room rotated to match the camera's view, the AI's accuracy jumped significantly (up to 30% improvement in some tasks).

The Big Picture

This paper argues that for AI to truly understand 3D spaces (like a robot navigating a house), it needs to know where it is standing.

  • Old Way: Give the AI a map and ask, "What's on the left?" (AI: Confused. Which left?)
  • New Way: Give the AI the map, tell it "You are standing here, facing this way," and then ask, "What's on the left?" (AI: Ah, I see! The lamp is on the left.)

The authors show that you don't need to build a brand new, super-complex AI to do this. You just need to fix the data (PoseRecover) and rotate the room to match the view (PoseAlign). It's a simple, "free lunch" upgrade that makes existing AI models much smarter at understanding space.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →