LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

Imagine you are trying to teach a robot how to navigate a messy living room. You could give it a 2D photo and say, "There's a chair there." But if the robot tries to walk over, it might trip because it doesn't know how far away the chair is, how big it is, or which way it's facing.

For a long time, Vision-Language Models (VLMs)—the AI brains that can see pictures and talk about them—were great at describing 2D images but terrible at understanding the 3D world. They were like a tour guide who could describe a painting perfectly but had no idea how deep the room was.

LocateAnything3D is a new method that finally teaches these AI models to "see" in 3D, and it does so by mimicking how humans naturally think. Here is the simple breakdown:

1. The Core Idea: "Chain-of-Sight" (CoS)

Think of the old way of doing 3D detection as trying to guess the entire shape of a mystery object in the dark all at once. It's hard, and you often get it wrong.

The authors propose a new way called Chain-of-Sight. Imagine you are looking at a photo of a coffee cup on a table. Instead of guessing the 3D shape immediately, your brain does this:

Step 1 (The 2D Anchor): "Okay, I see a round shape in the middle of the photo. That's the cup." (You locate it in the picture).
Step 2 (The 3D Leap): "Since it's in the middle and looks small, it must be a few feet away. It's probably 4 inches tall."

The AI does the exact same thing. It first predicts the 2D box (a rectangle on the screen) and then uses that box as a stepping stone to predict the 3D box (the actual size and distance in the real world).

The Analogy: It's like building a house. You don't just magically conjure the roof; you first lay the foundation (the 2D box), then build the walls (the size), and finally put on the roof (the rotation). By forcing the AI to lay the foundation first, the whole structure becomes much more stable.

2. The Order Matters: "Near to Far"

When a human looks at a room, they usually notice the coffee cup on the table before they notice the painting on the far wall. The AI used to just scan images like a robot reading a book (left to right, top to bottom). But in 3D, that's confusing because a small object far away might look right next to a big object nearby.

LocateAnything3D teaches the AI to scan Near to Far.

Why? Objects close to the camera give the AI the strongest clues. Once the AI knows exactly where the "near" objects are, it can use them as a ruler to guess where the "far" objects are.
The Metaphor: Imagine trying to guess the distance of a mountain. If you don't know where the trees in the foreground are, the mountain looks ambiguous. But if you know the trees are 10 feet away, you can use them to estimate the mountain is 1,000 feet away. The AI uses nearby objects as a "ruler" for the rest of the scene.

3. The "Recipe" for Success

The paper introduces a specific "recipe" for the AI to follow, which they call a Curriculum (like a school syllabus):

First: Find the object in the 2D picture (The "Where").
Second: Figure out how big it is (The "How Big").
Third: Figure out which way it's turned (The "Which Way").

This order is crucial. It's much easier to guess "where" something is before guessing "how big" it is. If you get the location wrong, your guess about the size will be wildly off. This step-by-step approach stops the AI from getting confused and "hallucinating" (making things up).

4. Why This is a Big Deal

It's "Open-World": You can ask the AI to find "a red chair," "a weirdly shaped vase," or "a cat," and it will work. It doesn't need to be pre-trained on a specific list of objects.
It's Flexible: You can talk to it ("Find the chair") or point at it on the screen ("Click here, what's the 3D box?"), and it understands both.
It's the Best: On the toughest tests (the Omni3D benchmark), this method beat all previous models by a huge margin. It even beat models that were given "cheat codes" (perfect 2D boxes to start with).

Summary

LocateAnything3D is like teaching a robot to drive by first teaching it to recognize traffic signs on a flat map, and then teaching it how far away those signs are in the real world. By breaking the complex problem of 3D vision into a simple, step-by-step conversation (2D first, then 3D; near first, then far), the AI finally learns to perceive the world in a way that is safe, accurate, and ready for real-world tasks like self-driving cars or home robots.

1. Problem Statement

Current Vision-Language Models (VLMs) excel at open-ended 2D perception (localization, description, and reasoning) but lack native capabilities for multi-object 3D detection from monocular images. Existing monocular 3D detectors typically suffer from:

Closed Vocabulary: They rely on fixed label spaces and cannot detect unseen categories.
Specialized Architectures: They require task-specific heads (e.g., separate 2D detectors followed by 3D lifting heads) that break the simplicity and versatility of the VLM paradigm.
Lack of Instruction Following: They struggle with free-form text guidance or visual prompts (e.g., clicking an object to find its 3D box).

The core challenge is to create a VLM-native framework that can perceive 3D geometry, handle open-world categories, and follow diverse instructions (text or visual) using a single, unified interface without specialized 3D heads.

2. Methodology: Chain-of-Sight (CoS)

The authors propose LocateAnything3D, a framework that reframes 3D detection as a next-token prediction problem within an autoregressive (AR) decoder. The core innovation is the Chain-of-Sight (CoS) decoding strategy, which mirrors human reasoning: first identifying what is visible in 2D, then inferring where and how big it is in 3D.

A. CoS Factorization

Instead of predicting 3D boxes directly, the model emits a structured token sequence interleaving 2D and 3D information for each object:
$\mathcal{S} = (q_1, b_1, q_2, b_2, \dots, \langle \text{eos} \rangle)$
Where $q_i$ is the 2D bounding box and $b_i$ is the corresponding 3D bounding box.

Visual Chain-of-Thought: The explicit 2D box acts as a high-confidence intermediate step. It grounds the model in the image pixels, reducing hallucinations and providing strong conditioning for the subsequent 3D prediction.
Unified Interface: This allows the model to accept text prompts ("detect all cars") or visual prompts (a user-drawn 2D box) and continue the sequence to generate the 3D state for that specific instance.

B. Curriculum Learning Strategies

To optimize the autoregressive decoding process, the authors introduce two specific ordering curricula:

Inter-Object Curriculum (Near-to-Far):
- Objects are serialized by depth, from near to far.
- Rationale: Near objects provide stronger monocular cues (less ambiguity) and are more critical for ego-centric utility (safety/interaction). Establishing the geometry of near objects provides context (scale, occlusion) that constrains the prediction of distant, ambiguous objects.
Intra-Object Factorization (Center $\to$ Size $\to$ Rotation):
- Within each object, the 3D box is decoded in a semantic order: Location ( $t$ ) $\to$ Dimensions ( $d$ ) $\to$ Rotation ( $R$ ).
- Rationale: This mirrors the observability of monocular cues. "Where is it?" is easier to determine than "How big is it?", which is easier than "How is it oriented?". This factorization stabilizes learning by letting the location constrain the subsequent properties.

C. Data Curation

The authors constructed a massive, camera-centric dataset (~1.74M examples) by unifying six heterogeneous 3D datasets (ARKitScenes, SUN-RGBD, Hypersim, Objectron, KITTI, nuScenes).

Normalization: All data is converted to a shared JSONL format with camera-centric coordinates.
Anti-Hallucination: The dataset includes "negative samples" where the model is prompted with absent categories and must emit a <no_object/> token.
Auto-Annotation: Large-scale text descriptions for grounding were generated using strong VLMs, ensuring rich referring expressions.

3. Key Contributions

Chain-of-Sight (CoS) Formulation: A novel decoding paradigm that turns open-world monocular 3D detection into a native next-token prediction task. It couples explicit 2D grounding with 3D estimation, preserving open-vocabulary and visual-prompting capabilities without specialized heads.
Curriculum and Representation Design: A tailored autoregressive curriculum (Near-to-Far ordering and Center-Size-Rotation factorization) that significantly improves robustness, calibration, and convergence speed.
Unified Camera-Centric Dataset: A large-scale, unified corpus spanning indoor and outdoor scenes, enabling scalable training and systematic ablation studies without task-specific architectural changes.

4. Experimental Results

The model was evaluated on the Omni3D benchmark (a large-scale suite covering indoor and outdoor scenes).

State-of-the-Art Performance: LocateAnything3D achieved 38.90 AP3D on the full Omni3D benchmark.
- This surpasses the previous best (DetAny3D) by +13.98 absolute points.
- Crucially, it outperforms baselines that are aided by ground-truth 2D boxes (e.g., DetAny3D w/ GT 2D boxes scored 34.38, while LocateAnything3D scored 38.90). This proves that learning 2D and 3D jointly in a single AR interface is superior to "lifting" external 2D proposals.
Zero-Shot Generalization: The model demonstrates strong robustness to unseen categories (held-out classes), outperforming competitors by significant margins (e.g., +5.26 AP3D on SUN-RGBD novel classes).
3D Grounding: In indoor 3D grounding tasks (Objectron, ARKitScenes, SUN-RGBD), the model significantly outperforms Cube-LLM (trained on 9.6M images) despite being trained on a smaller, curated dataset (1.7M images), showing superior spatial reasoning.
Ablation Studies:
- Removing the 2D CoS step drops performance by ~10 points.
- Randomizing object order or using Left-to-Right scanlines significantly degrades performance compared to Near-to-Far ordering.
- Changing the intra-object token order (e.g., Rotation first) reduces accuracy.

5. Significance and Impact

Bridging the Gap: This work closes the long-standing gap between open-vocabulary 2D recognition and metric 3D understanding within a single foundation model.
Embodied AI: By providing a VLM-native interface for 3D perception, it enables embodied agents to "see" the world in 3D metric terms, facilitating tasks like navigation, manipulation, and planning without needing separate 3D perception modules.
Efficiency: The CoS approach is highly data-efficient; the model achieves competitive performance with only 10% of the training data compared to pure 3D prediction baselines.
Scalability: The unified, token-based approach allows for easy scaling with more data and diverse modalities (e.g., future extensions to video or multi-view reasoning) without architectural redesign.

In summary, LocateAnything3D demonstrates that by structuring 3D detection as a disciplined, curriculum-based next-token prediction problem with explicit 2D grounding, VLMs can achieve robust, open-world 3D perception that rivals or exceeds specialized, closed-vocabulary systems.